Concept
Gradient Clipping
A simple but essential training stabiliser that prevents extremely large gradient updates from destabilising a model - one of those small techniques without which training large models would frequently fail.
Added May 18, 2026
During neural network training, the gradient tells the model which direction to adjust its weights and by how much. Usually these gradients are well-behaved: modest in size, pointing in reasonable directions. But occasionally, particularly in the early stages of training or when the model encounters unusual data, a gradient can become extremely large - what practitioners call an "exploding gradient." A single very large gradient update can dramatically disrupt the weights, erasing many steps of careful learning in one catastrophic update.
Gradient clipping is the straightforward solution: before applying any gradient update, check whether the gradient''s magnitude exceeds a threshold. If it does, scale it down so its magnitude equals the threshold. The direction of the gradient is preserved - the model still knows which way to adjust its weights - but the step size is capped at a safe maximum. If the gradient is within the threshold, it passes through unchanged.
The threshold, called the clipping norm, is a hyperparameter typically set between 0.5 and 5.0 for language model training. The specific value requires some tuning - too low and you slow learning by clipping legitimate gradients; too high and you fail to catch the genuinely dangerous large gradients that cause instability. In practice, most large language model training runs use gradient clipping as a matter of course, with clipping events treated as diagnostic signals: if clipping is happening very frequently, something may be wrong with the learning rate, batch size, or data.
Large language model training runs are expensive enough that a single training crash represents enormous wasted cost. Gradient clipping is one of the safety nets that prevents such crashes. Alongside careful learning rate scheduling, gradient accumulation, and mixed-precision training, it is part of the standard toolkit that makes month-long training runs on thousands of GPUs reliably successful rather than fragile.
Gradient clipping also interacts with other training choices. Models with residual connections (all modern transformers) are more resistant to exploding gradients than earlier architectures, but not immune. The combination of residual connections, layer normalisation, careful initialisation, and gradient clipping produces the stable training dynamics that modern large model training depends on.
Analogy
A speed limiter on a car that prevents acceleration beyond a safe maximum, regardless of how hard the driver pushes the pedal. The car still accelerates in the intended direction - the limiter does not redirect steering. But if a moment of driver error or external disruption would otherwise cause a dangerous burst of speed, the limiter caps it. Gradient clipping is the same: cap the size, preserve the direction.
Real-world example
During the training of very large models, gradient norms are logged continuously. In the training logs for LLaMA and similar models, you can occasionally see spikes in gradient norm that trigger clipping - often correlating with particularly unusual data batches. The clipping event is visible in the logs but does not interrupt training; without clipping, those same spikes could cause loss divergence and require restarting from a checkpoint.
Why it matters
Gradient clipping is one of those techniques whose importance is only fully appreciated when it is missing. Without it, large model training is significantly more likely to experience instability, requiring expensive restarts from checkpoints and wasting compute time. Its value is largely invisible during successful training runs - it quietly prevents the failures that do not happen.
In the news
Related concepts