latentbrief
← Back to concepts

Concept

Knowledge Distillation

A training technique where a small model learns to imitate a larger one - capturing most of the large model's capability at a fraction of its size and cost.

Added May 18, 2026

Training a large, capable language model from scratch is extraordinarily expensive - weeks of compute on thousands of GPUs, costing millions of dollars. But once that large model exists, you do not necessarily need its full size for every use case. Knowledge distillation is the technique for extracting a large model's knowledge into a smaller, cheaper-to-run model.

The key insight of distillation is that a trained model's output distribution contains richer information than just the correct label. When a large model processes an input, it produces a full probability distribution over all possible outputs. For a next-word prediction, it might assign 60% probability to "the," 15% to "a," 10% to "this," and small probabilities to thousands of other words. These "soft labels" contain information about which outputs are similar and which are different - information that binary right/wrong labels do not capture.

A student model trained to match these soft probability distributions learns from much richer signal than one trained only on correct answers. The student learns not just what the right answer is, but something about the structure of the teacher's uncertainty - which alternatives are reasonable, which are clearly wrong, and in what ways different outputs relate to each other.

Distillation works through a modified training objective. The student model is trained to minimise the difference between its output distribution and the teacher's, using a measure like KL divergence. The temperature parameter in the softmax (which converts raw scores into probabilities) is often raised during distillation training, making the teacher's distribution "softer" and exposing more of its relative preferences across outputs.

Beyond just output distillation, more advanced techniques include layer-to-layer distillation - training the student's internal representations to match the teacher's at corresponding layers - and attention distillation - matching the teacher's attention patterns. These richer forms of distillation produce better students but require more intimate access to the teacher's internals.

Analogy

A master chef recording not just their recipes but also their commentary on why certain techniques work, what to look for when a dish is going right or wrong, and how different approaches compare. An apprentice who learns from this rich commentary develops better instincts than one who only gets a list of steps. The soft labels in distillation are the teacher model's commentary, not just its answers.

Real-world example

DistilBERT, released by Hugging Face in 2019, distilled BERT-large into a model with 40% fewer parameters that retained 97% of BERT's performance on benchmarks. More recently, the technique has been applied to decoder models: smaller models like Gemma 2B and Phi-2 were trained using outputs from much larger models, allowing them to punch above their weight in terms of capability per parameter.

Why it matters

Knowledge distillation is central to making AI practical at scale. Not every use case needs a 70-billion parameter model, and running one is expensive. Distillation allows the investment in training large models to be leveraged into smaller, faster, cheaper models that can run on limited hardware - on laptops, phones, or edge devices - while maintaining much of the quality that the large model provides.

In the news

No recent coverage - check back later.

Related concepts