Quantization-Aware Training (QAT)

Training a model while simulating the numerical precision it will run at after deployment - producing compressed models that stay accurate even when their weights are stored in low-precision formats.

Added May 18, 2026 · 3 min read

Quantization is what makes large language models practical to run on accessible hardware. A 70-billion parameter model in 16-bit precision requires about 140GB of GPU memory - multiple high-end data centre GPUs. Quantized to 4-bit, it fits in about 35GB - a single high-end consumer GPU. This difference determines whether a model can be deployed locally, on mobile, or in cost-sensitive production environments.

Neural network weights are typically stored and computed in 32-bit or 16-bit floating-point format during training. These high-precision numbers allow gradients to flow accurately during backpropagation and the model to converge reliably. But when it comes time to deploy a large model - particularly for inference on consumer hardware or edge devices - this precision is expensive in memory and compute. Quantization is the process of reducing precision, typically to 8-bit integers (INT8) or 4-bit integers (INT4), to make the model smaller and faster.

The problem with quantizing a model after training is that the model was never optimised to work well at lower precision. Weights that looked fine in 32-bit can cluster poorly or lose important fine-grained distinctions when rounded to 8 or 4 bits. The quantization error - the difference between the original and quantized weights - accumulates across layers and can significantly degrade quality, especially in sensitive weight ranges.

Quantization-Aware Training solves this by simulating the quantization during training itself. While the weights are still stored in high precision (for accurate gradient computation), the forward pass uses simulated low-precision arithmetic - fake quantize operations that round weights to the target precision and then round back. The model learns to distribute its weights in ways that are robust to this rounding. By the time training is complete, the model's parameters have settled into configurations that function well even at low precision.

The result is a quantized model that performs significantly better than one quantized after training. The quality gap between a QAT INT8 model and the full-precision original is typically small enough to be acceptable for most applications, while the benefits are substantial: a model quantized to 8 bits uses 4x less memory than a 32-bit model, runs faster on hardware with native INT8 support, and costs less to serve at scale.

For 4-bit quantization - which fits large models onto consumer GPUs - QAT is even more critical, because 4-bit rounding introduces much larger errors than 8-bit. Techniques like QLoRA combine 4-bit quantization with LoRA fine-tuning, allowing models to be both compressed and adapted to specific domains simultaneously.

Analogy

A photographer who learns to compose shots specifically for how they will look after printing on newsprint, rather than composing for perfect display on a high-resolution monitor and then printing. The newsprint version will look much better if the photographer anticipated its constraints during composition - leaving room for how ink spreads, choosing contrast that survives the coarser medium.

Real-world example

QLoRA (Quantized LoRA), published in 2023, combined 4-bit quantization with LoRA fine-tuning in a way that made it possible to fine-tune 65-billion parameter models on a single consumer GPU with 48GB of memory - hardware that would not normally accommodate a model that size in any form. The quantization-aware approach maintained enough accuracy that the resulting models were competitive with models fine-tuned without quantization constraints.

Why it matters

Quantization is what makes large language models practical to run on accessible hardware. A 70-billion parameter model in 16-bit precision requires about 140GB of GPU memory - multiple high-end data centre GPUs. Quantized to 4-bit, it fits in about 35GB - a single high-end consumer GPU. This difference determines whether a model can be deployed locally, on mobile, or in cost-sensitive production environments.

In the news

No recent coverage - search for Quantization-Aware Training (QAT).

Related concepts

Inference

Using a trained AI model to make predictions or generate outputs - the fast, cheap counterpart to training's slow, expensive computation.

LoRA (Low-Rank Adaptation)

The most widely used technique for efficiently fine-tuning large language models - adapting billions of parameters to new tasks by updating only a tiny fraction of the total weight count.

Parameter-Efficient Fine-Tuning (PEFT)

A family of techniques for adapting large language models to specific tasks by updating only a small fraction of their parameters - making fine-tuning accessible without massive compute budgets.

← Back to concepts