latentbrief
← Back to concepts

Gradient Descent

The algorithm that trains neural networks - iteratively adjusting parameters in the direction that reduces the model's error.

Added May 21, 2026 · 3 min read

Gradient descent is the universal training algorithm for deep learning. Without it, there would be no large language models, no image generators, no speech recognisers. Everything that makes modern AI remarkable is the product of gradient descent run at enormous scale. Understanding it explains why AI training requires so much compute and why it can fail - bad learning rates, poor data, or architectural issues all manifest as failures of gradient descent to converge.

Training a neural network means finding the right values for millions or billions of parameters. You cannot try every possible combination - the search space is unimaginably vast. Gradient descent is the algorithm that makes this feasible by navigating the parameter space intelligently.

The key mathematical object is the gradient: a vector that points in the direction of steepest increase in the loss function. To reduce the loss, you move in the opposite direction - the direction of steepest decrease. This is gradient descent: compute the gradient of the loss with respect to every parameter, then subtract a small fraction of it from each parameter value.

The fraction subtracted is controlled by the learning rate - a hyperparameter that determines the step size. A high learning rate makes large adjustments and learns fast but may overshoot good solutions and never settle. A low learning rate makes small, careful adjustments but learns slowly and may get stuck. Most modern training uses adaptive learning rate algorithms like Adam, which adjust the effective learning rate per parameter based on the history of recent gradients.

Computing the gradient requires backpropagation: a clever algorithm that efficiently calculates how much each parameter contributed to the loss. It works by applying the chain rule of calculus backward through the network, from the output to the input. Without backpropagation, training deep networks would be computationally infeasible.

In practice, gradient descent is not run on the full training dataset at once. Instead, stochastic gradient descent (SGD) or mini-batch gradient descent processes a small random subset of data at each step, computing an estimate of the true gradient. This is noisier than using all the data but much faster, and the noise turns out to be beneficial - it helps the optimiser escape poor solutions.

Analogy

Finding the lowest point in a hilly landscape in thick fog. You cannot see the entire landscape, so you cannot jump to the lowest point directly. Instead, you feel the slope of the ground immediately around you and take a step downhill. Repeat this many times and you will, hopefully, end up in a valley. Gradient descent does exactly this in a parameter space with billions of dimensions.

Real-world example

Training a spam classifier processes millions of emails. For each batch of emails, the model predicts spam probability, the cross-entropy loss is computed, backpropagation calculates how each parameter contributed to the error, and gradient descent nudges every parameter slightly in the direction that would have reduced that error. After enough steps across enough data, the parameters settle into values that classify spam reliably.

Why it matters

Gradient descent is the universal training algorithm for deep learning. Without it, there would be no large language models, no image generators, no speech recognisers. Everything that makes modern AI remarkable is the product of gradient descent run at enormous scale. Understanding it explains why AI training requires so much compute and why it can fail - bad learning rates, poor data, or architectural issues all manifest as failures of gradient descent to converge.

In the news

Related concepts