latentbrief
← Back to concepts

Concept

Hyperparameter Optimization

The automated search for the best configuration settings for an ML model - learning rate, batch size, architecture choices, regularisation strength - that cannot be learned during training and must be specified in advance.

Added May 18, 2026

Every ML model has two kinds of parameters. Model parameters (weights) are learned during training via gradient descent. Hyperparameters are configuration choices that govern the training process itself - the learning rate, the batch size, the number of layers, the dropout rate, the weight decay - that must be specified before training begins and cannot be optimised by gradient descent directly.

Choosing good hyperparameters is critical: the difference between a well-tuned and poorly-tuned model can be larger than the difference between different model architectures. Hyperparameter optimization (HPO) is the process of systematically searching the hyperparameter space to find settings that maximise model quality on a validation set.

The simplest HPO approach is grid search: define a discrete set of values for each hyperparameter and train a model for every combination. Grid search is exhaustive for small search spaces but becomes computationally prohibitive as the number of hyperparameters grows (the curse of dimensionality: a 10-hyperparameter grid with 5 values each requires 5^10 = nearly 10 million training runs).

Random search improves on grid search by sampling hyperparameter combinations uniformly at random. Counterintuitively, random search is often more efficient than grid search for high-dimensional spaces: if only a few hyperparameters matter significantly, random search is more likely to explore diverse values of those important hyperparameters.

Bayesian optimisation treats HPO as a sequential decision problem. After each trial, a surrogate model (typically a Gaussian process) is updated to model the relationship between hyperparameter settings and validation performance. An acquisition function selects the next hyperparameter setting to try by trading off exploration (trying settings where model uncertainty is high) and exploitation (trying settings that the surrogate model predicts will perform well). Frameworks like Optuna, Hyperopt, and SMAC implement Bayesian optimisation and consistently outperform random search in terms of trials needed to reach good performance.

Multi-fidelity methods further accelerate HPO by using cheap proxies for expensive full evaluations. Successive halving (and its extension Hyperband) allocates a fixed budget across many trials, progressively eliminating poorly-performing configurations and concentrating resources on promising ones. Neural Architecture Search (NAS) extends HPO to the architecture itself, treating the model structure as a hyperparameter to be optimised.

Modern HPO integrations (Weights & Biases Sweeps, Ray Tune, SageMaker Automatic Model Tuning, Vertex AI Vizier) provide cloud-scale parallel hyperparameter searches that can run hundreds of trials simultaneously, compressing what would be weeks of sequential search into hours.

Analogy

Tuning a high-performance engine before a race. The engine's mechanical properties are fixed (like model architecture), but settings like fuel mixture ratio, ignition timing, and tyre pressure can be adjusted. A systematic tuning process tries different combinations of settings on a test track (validation set), measures lap times (model quality), learns from each run, and progressively converges on the optimal configuration. Hyperparameter optimization is this systematic tuning process applied to ML models.

Real-world example

Training a transformer-based text classifier, a team uses Optuna to search over learning rate (1e-5 to 1e-3, log scale), batch size (16, 32, 64, 128), warmup steps (0, 100, 500, 2000), and dropout rate (0.0 to 0.5). Random search would require 200 trials to cover this space adequately. Bayesian optimisation via Optuna finds a configuration achieving 89.2% F1 in just 40 trials by leveraging information from prior trials. The best configuration has a learning rate of 3e-4 and batch size 32 - not values that intuition would have suggested first.

Why it matters

Hyperparameter choice has an enormous impact on model quality, often larger than the choice of model architecture. Without systematic HPO, practitioners rely on defaults and intuition, frequently leaving significant performance on the table. Understanding HPO methods - their assumptions, tradeoffs, and scaling properties - is essential for anyone who wants to train models that perform at their potential rather than at the performance of arbitrarily chosen defaults.

In the news

No recent coverage - check back later.

Related concepts