Concept

Bayesian Optimization

A smart method for finding the best hyperparameters for a model - building a probabilistic model of how settings affect performance and using it to choose which settings to try next.

Added May 18, 2026

Training a neural network involves dozens of choices: learning rate, batch size, weight decay, dropout rate, the architecture's width and depth. These are called hyperparameters, and they are not learned during training - they are set before training and held fixed. The wrong choices can make a model fail to learn or learn slowly; the right choices can make the difference between a mediocre and an excellent result.

Searching for good hyperparameters is expensive because each evaluation requires a full (or at least partial) training run. Grid search - trying every combination of a predefined grid of values - is systematic but scales exponentially with the number of hyperparameters. Random search - picking combinations randomly - is surprisingly competitive with grid search but still wasteful, since it treats each trial as independent and learns nothing from previous results.

Bayesian optimisation does better by maintaining a probabilistic model of the relationship between hyperparameter settings and model performance. This surrogate model (typically a Gaussian process) learns from every trial: given that learning rate 0.001 produced a validation loss of X, update the probabilistic model of how learning rate relates to loss. The next hyperparameter setting to try is chosen to balance exploration (trying settings in unexplored regions) with exploitation (trying settings near where the surrogate model predicts high performance).

The exploration-exploitation balance is formalised through an acquisition function - a formula that scores candidate settings based on both their predicted performance and the model''s uncertainty about them. Settings that the surrogate model confidently predicts will be good score high. Settings in uncertain regions also score high because they might be excellent and we haven''t tried them. This principled balance allows Bayesian optimisation to find near-optimal hyperparameters in many fewer trials than grid or random search.

For large language model fine-tuning, where a single training run can take hours or days, the efficiency of Bayesian optimisation translates directly into significantly better results within a fixed compute budget. Modern hyperparameter optimisation libraries like Optuna, Ray Tune, and Weights and Biases all offer Bayesian optimisation as a first-class option.

Analogy

A detective who, instead of randomly visiting every possible suspect, builds a profile from each interview that narrows the space of who to investigate next. Each piece of evidence updates the probabilistic model of who is most likely guilty, guiding increasingly targeted investigation. Bayesian optimisation is this detective approach applied to hyperparameter search: each trial updates the model of the performance landscape, guiding the next experiment.

Real-world example

The AlphaGo team used Bayesian optimisation to tune the hyperparameters of their reinforcement learning training pipeline. The search space had dozens of parameters and each trial required significant computation. Bayesian optimisation found configurations that significantly improved the system's gameplay within a budget of trials that random search would not have explored effectively.

Why it matters

Hyperparameter tuning is often the difference between a model that works well and one that works very well. For organisations fine-tuning models for production deployment, where the target performance has a direct business impact, systematic hyperparameter search can provide meaningful quality improvements. Bayesian optimisation makes this search as efficient as possible - getting more performance per trial.

In the news

No recent coverage - check back later.

Related concepts

Fine-tuning Gradient Accumulation

← Back to concepts