DistilBERT

A compressed version of BERT that retains 97% of its performance at 40% of the size and 60% of the inference speed - a landmark demonstration of knowledge distillation applied to large language models.

Added May 18, 2026 · 2 min read

DistilBERT demonstrated that model compression through distillation was practical at scale, which opened up deployment of sophisticated NLP models on hardware that full-size models could not run on. This became increasingly important as AI moved from data centres to mobile devices and embedded systems.

BERT was powerful but large for 2019 standards. Running it on devices with limited memory, or serving it at the speed needed for real-time applications, was difficult. Hugging Face's DistilBERT showed that a much smaller model could capture most of BERT's capability if trained carefully through a technique called knowledge distillation.

Knowledge distillation trains a small "student" model to imitate a large "teacher" model, rather than learning purely from labelled data. The student is trained not just to predict the correct label for each example, but to match the teacher's probability distribution across all labels - including its uncertainties. If the teacher thinks an input is 70% positive sentiment and 30% negative, the student learns to produce a similar distribution, not just the binary "positive" answer. This richer training signal encodes the teacher's nuanced understanding into the student.

For DistilBERT specifically, Hugging Face also used several other compression techniques: the student was initialised by taking every other layer from the trained BERT model (layer-wise initialisation), and additional training objectives pushed the student's internal representations to align with the teacher's across layers, not just the final output.

The result: DistilBERT has 66 million parameters versus BERT-base's 110 million. It runs 60% faster and uses 40% less memory. And on the GLUE benchmark suite of language understanding tasks, it retains 97% of BERT's performance. For many production use cases, this trade-off is a clear win - you pay a small quality cost for a large efficiency gain.

DistilBERT inspired a wave of compression research applied to larger models. As GPT-style decoder models grew to billions of parameters, the same distillation techniques became even more valuable: DistilGPT2, TinyLLaMA, and many other compressed models followed the template DistilBERT established.

Analogy

Condensing a thick textbook into a thorough study guide. The study guide cannot fit everything, but if it captures the most important patterns and principles in a digestible form, students who learn from it can perform nearly as well as those who read the full textbook. The study guide (student model) learned by studying the textbook (teacher model), not by starting from scratch.

Real-world example

Many production deployments of NLP in customer service, content moderation, and search ranking use DistilBERT or similar compressed models rather than full-size BERT, because latency and memory cost matter more than squeezing out the last 3% of accuracy. Serving 1,000 DistilBERT requests per second on a given server might cost a third of what full BERT would cost for similar accuracy.

Why it matters

DistilBERT demonstrated that model compression through distillation was practical at scale, which opened up deployment of sophisticated NLP models on hardware that full-size models could not run on. This became increasingly important as AI moved from data centres to mobile devices and embedded systems.

In the news

No recent coverage - search for DistilBERT.

Related concepts

BERT

Google's landmark 2018 language model that introduced bidirectional pre-training - a model that reads text in both directions simultaneously and set new standards for understanding tasks.

Foundation Model

A large AI model trained on vast amounts of general data, designed to be the starting point for many different applications rather than built for a single task.

Inference

Using a trained AI model to make predictions or generate outputs - the fast, cheap counterpart to training's slow, expensive computation.

← Back to concepts