Lottery Ticket Hypothesis

The finding that large neural networks contain small sub-networks that can be trained to match the full network's performance - suggesting that much of a model's capacity may be redundant.

Added May 18, 2026 · 3 min read

The lottery ticket hypothesis reframed the intuition behind why large models work. It suggests the value of overparameterisation is not that every parameter matters, but that having many parameters gives gradient descent better odds of finding a capable sub-network. This understanding has driven research into more principled pruning, better initialisation strategies, and the theoretical foundations of why neural network training succeeds as reliably as it does.

When you train a large neural network, you initialise millions or billions of parameters randomly and then adjust them through gradient descent. The resulting trained network is dense - every parameter has been adjusted and presumably contributes to performance. But in 2019, MIT researchers Jonathan Frankle and Michael Carlin made a surprising discovery: within any large trained network, there exist much smaller sub-networks - "winning lottery tickets" - that, when trained from their original initialisation, can match the full network''s performance.

The lottery ticket hypothesis gets its name from the analogy: when you buy many lottery tickets, most are losers and a few are winners. In a randomly initialised neural network, most sub-networks are losers - they will not train well. But a few sub-networks happen to be initialised in configurations that are particularly amenable to learning. These winning tickets are the ones that do most of the useful learning in the full network; the rest of the parameters are effectively redundant.

The experimental method for finding these tickets is called iterative magnitude pruning: train the full network, identify the weights with the smallest magnitudes (presumed least important), remove them, reset the remaining weights to their original initialisation values, and retrain from scratch. The surviving weights, trained from their original initialisations, often match the full network''s performance at 10-20% of its parameter count.

The implications are significant for understanding why overparameterised networks train well. The prevailing explanation is that having many parameters gives gradient descent a good chance of finding at least one winning ticket - one sub-network in a good initialisation configuration. Smaller networks have fewer lottery tickets and thus a lower probability of finding a good one.

Practically, the hypothesis informed the development of pruning and sparsification techniques. If large networks contain small capable sub-networks, finding and deploying those sub-networks at inference time offers significant efficiency gains. Research on sparse networks, structured pruning, and model compression all draw on insights from the lottery ticket framework.

Analogy

Hiring a hundred people for a project when you only need ten experts. You do not know in advance which ten will turn out to have the right combination of skills and intuitions, so you hire broadly and let performance reveal the best team. The lottery ticket hypothesis says neural networks work similarly: the large random initialisation is the broad hiring; training identifies the winning team.

Real-world example

Researchers applying lottery ticket pruning to BERT-style models found that sparse sub-networks retaining only 10-40% of parameters could match the full model's performance on many downstream tasks when found correctly. This suggested that BERT's large parameter count is useful for training (providing many possible winning tickets) but not all necessary for inference - the winning ticket alone would suffice.

Why it matters

The lottery ticket hypothesis reframed the intuition behind why large models work. It suggests the value of overparameterisation is not that every parameter matters, but that having many parameters gives gradient descent better odds of finding a capable sub-network. This understanding has driven research into more principled pruning, better initialisation strategies, and the theoretical foundations of why neural network training succeeds as reliably as it does.

In the news

No recent coverage - search for Lottery Ticket Hypothesis.

Related concepts

Foundation Model

A large AI model trained on vast amounts of general data, designed to be the starting point for many different applications rather than built for a single task.

Knowledge Distillation

A training technique where a small model learns to imitate a larger one - capturing most of the large model's capability at a fraction of its size and cost.

Parameter-Efficient Fine-Tuning (PEFT)

A family of techniques for adapting large language models to specific tasks by updating only a small fraction of their parameters - making fine-tuning accessible without massive compute budgets.

← Back to concepts