latentbrief
← Back to concepts

Concept

Mixed Precision Training

A training technique that uses lower-precision numerical formats (FP16 or BF16) for most computations while maintaining higher-precision (FP32) master copies of weights - cutting memory usage and accelerating training without sacrificing model quality.

Added May 18, 2026

Neural network training and inference involve enormous numbers of floating-point operations. The standard 32-bit floating-point format (FP32) uses 4 bytes per value. Training a 7-billion-parameter model with FP32 requires 28GB just for the weights - before accounting for gradients, optimiser states, and activations. Mixed precision training is the now-standard approach of using lower-precision formats where possible to reduce memory footprint and increase computational throughput.

Modern GPUs have specialised hardware (Tensor Cores on NVIDIA GPUs, Matrix Engines on other accelerators) that can execute 16-bit floating-point matrix operations far faster than 32-bit operations - up to 8x faster for certain operations on the H100. Mixed precision exploits this by using 16-bit formats for the bulk of computation while retaining 32-bit precision where it matters for training stability.

The standard mixed precision approach: maintain FP32 master copies of model weights (the authoritative parameters). For each forward and backward pass, cast the weights to FP16 or BF16 for computation. Accumulate gradients in FP32. Update the FP32 master weights. The FP16/BF16 copies used for computation are ephemeral; only the FP32 masters are permanently stored. This sounds like it wastes memory on duplicate weights, but the computation (which dominates training time) runs at 16-bit speed, and activations and gradients - which are the majority of memory consumption during training - are stored in 16-bit.

FP16 and BF16 are distinct 16-bit formats with different tradeoffs. FP16 has more mantissa bits (more precision in the represented value) but a narrower exponent range, making it prone to overflow (values too large for the format) and underflow (values too small, rounded to zero) during gradient computations. BF16 has fewer mantissa bits but the same exponent range as FP32, making it more numerically stable for training. Modern hardware almost universally uses BF16 for training and FP16 is now primarily used for inference.

FP16 training requires loss scaling to address the underflow problem: gradients are multiplied by a large scale factor before the backward pass to shift them into the representable range, then divided back out before the weight update. This prevents small gradient values from vanishing to zero in FP16 representation. Dynamic loss scaling automatically adjusts the scale factor based on whether gradient overflow is detected.

Beyond training, quantised inference uses even lower precision (INT8, INT4, or 4-bit NF4 formats) for serving deployed models, trading a small amount of model quality for dramatic memory and throughput improvements.

Analogy

A construction company that uses high-precision laser measurements for the critical structural joints of a building but uses tape measures for less critical dimensions. The precision is applied where it matters (load-bearing structures, alignment) and lower-precision tools are used where minor rounding errors are inconsequential (rough framing, non-structural elements). The result is faster construction with the same structural integrity as if everything were laser-measured.

Real-world example

Training a 13B parameter language model in FP32 would require 52GB for weights, approximately 104GB for gradients (stored in FP32), and 156GB for the Adam optimizer states - over 300GB total, requiring four A100 80GB GPUs just for weights and optimizer states. With mixed precision training using BF16: weights are 26GB, gradients can be stored in BF16 at 26GB, and BF16 activations further reduce memory. The total fits more comfortably on two A100s, and training is roughly 2x faster due to BF16 tensor core utilisation.

Why it matters

Mixed precision training is one of the foundational techniques that enabled the rapid scaling of models throughout the 2020s. By halving memory requirements and doubling throughput for the cost of a small implementation effort, it became the default training configuration for any serious ML training workload. Understanding it is important for anyone setting up training infrastructure or trying to fit large models on available hardware.

In the news

Related concepts