VI · MLOps & InfrastructureAdvanced

Model Pruning

A model compression technique that removes unnecessary weights from a trained neural network - reducing model size and inference cost by identifying and deleting parameters that contribute minimally to the model's outputs.

Added May 18, 2026 · 3 min read

Pruning is one of the primary techniques for making large models deployable under hardware and cost constraints. It complements quantisation (reducing numerical precision) and knowledge distillation (training smaller models from scratch) to form the toolkit of model compression. Understanding pruning - its tradeoffs between compression ratio, quality loss, and hardware compatibility - is essential for anyone deploying ML models in resource-constrained environments.

Large neural networks are notoriously over-parameterised: they have far more parameters than the minimum needed to represent the function they have learned. Pruning exploits this by identifying parameters that are near-zero or contribute minimally to model outputs and removing them, producing a smaller, faster model that ideally retains most of the original model's quality.

The motivation is practical. A model with 70% of its weights removed runs faster, uses less memory, and can be deployed on hardware that could not fit the original. For edge deployment - running models on phones, embedded devices, or IoT hardware - pruning is often essential rather than optional.

Pruning strategies vary along two key axes: what is pruned and when. Magnitude pruning (the simplest approach) removes weights whose absolute values fall below a threshold - the intuition being that small weights contribute little to model outputs. More sophisticated importance-based pruning estimates each weight's contribution by measuring how much removing it increases the loss, using gradient information or second-order approximations like OBS (Optimal Brain Surgeon) or OBD (Optimal Brain Damage).

Structured versus unstructured pruning reflects the hardware implications. Unstructured (weight-level) pruning removes individual weights, creating sparse weight matrices. While this achieves high compression ratios, sparse matrix operations are not well-supported by standard GPU hardware, which is optimised for dense operations. Realising speedups from unstructured pruning requires specialised sparse execution libraries or hardware. Structured pruning removes entire components - neurons, attention heads, layers, or convolutional filters - producing dense (non-sparse) weight matrices that execute efficiently on standard hardware without specialised support. The tradeoff: structured pruning achieves lower compression ratios for the same quality loss.

Pruning timing also varies. Post-training pruning applies pruning to an already-trained model, often followed by a fine-tuning phase to recover the quality lost from pruning. Iterative pruning alternates between pruning steps and fine-tuning, gradually increasing sparsity. Pruning during training (sparse training) enforces sparsity constraints throughout the training process.

The Lottery Ticket Hypothesis (Frankle & Carlin, 2018) proposed that large networks contain small subnetworks ("winning tickets") that can be trained from scratch to match the full network's performance. While this is theoretically compelling, finding these tickets reliably at scale has proven difficult.

For large language models, attention head pruning removes entire attention heads that contribute minimally to outputs, and layer dropping removes full transformer layers. Both have been demonstrated to reduce model size by 20-40% with small quality losses in distillation pipelines.

Analogy

Pruning a rose bush. A mature rose bush has many branches, but experienced gardeners know that removing crossing branches, dead wood, and weaker shoots - keeping only the most productive canes - produces a plant that actually flowers more vigorously. The pruned plant is smaller but better. Model pruning does the same: removing redundant and low-contributing parameters produces a leaner model that fits better on deployment hardware while preserving most of the capability that matters.

Real-world example

A team trains a BERT-large model (340M parameters) for text classification. Using structured attention head pruning, they identify and remove 70% of attention heads by measuring each head's contribution to validation performance. The resulting model has 180M parameters. After a fine-tuning phase on task-specific data, it achieves 97% of the original model's F1 score while running 2.3x faster at inference. The smaller model is then deployed on a CPU-only inference server, eliminating the need for GPU instances.

Why it matters

Pruning is one of the primary techniques for making large models deployable under hardware and cost constraints. It complements quantisation (reducing numerical precision) and knowledge distillation (training smaller models from scratch) to form the toolkit of model compression. Understanding pruning - its tradeoffs between compression ratio, quality loss, and hardware compatibility - is essential for anyone deploying ML models in resource-constrained environments.

In the news

No recent coverage - search for Model Pruning.

Related concepts

Knowledge Distillation

A training technique where a small model learns to imitate a larger one - capturing most of the large model's capability at a fraction of its size and cost.

Mixed Precision Training

A training technique that uses lower-precision numerical formats (FP16 or BF16) for most computations while maintaining higher-precision (FP32) master copies of weights - cutting memory usage and accelerating training without sacrificing model quality.

← Back to concepts