IX · Specialized DomainsAdvanced

Active Learning

A machine learning paradigm where the model identifies which unlabelled examples would be most valuable to label next - reducing the amount of human annotation needed by focusing labelling effort on the most informative data points.

Added May 18, 2026 · 4 min read

Active learning directly addresses the annotation bottleneck that limits ML deployment in many specialised domains. Where data is abundant but labels are scarce and expensive (medical imaging, legal documents, scientific datasets, moderation queues), active learning can reduce annotation cost by an order of magnitude for equivalent model performance. Understanding it is important for designing efficient ML annotation pipelines, especially in domains where expert annotation is the primary limiting resource.

Labelling data for supervised learning is expensive: hiring domain experts to annotate medical images, transcribe audio, or label sentiment requires time, skill, and money. Random sampling of data to label is wasteful - many labelled examples provide redundant information while others would greatly improve model performance. Active learning addresses this by letting the model guide the labelling process, querying the human annotator (the oracle) for labels on the examples the model is most uncertain about or would learn the most from.

Uncertainty sampling is the most intuitive active learning strategy: select unlabelled examples where the model is most uncertain about the correct label. For classifiers, uncertainty can be measured by the entropy of the predicted class distribution (high entropy = uncertain), the least confidence margin (the gap between the top two predicted class probabilities), or by querying for examples near the decision boundary. If the model is highly confident about an example, labelling it provides little new information; if it is uncertain, labelling it resolves an important ambiguity.

Query by Committee (QBC) maintains an ensemble of models trained on different subsets or with different hyperparameters. Examples where the committee members disagree most are selected for labelling - the disagreement signals that the models lack information to make a consistent decision. Bayesian active learning computes the information gain from labelling each candidate example and selects the highest-information query, grounding the selection in information-theoretic terms.

Diversity-based sampling addresses a weakness of uncertainty sampling: the most uncertain examples can be clustered together, resulting in querying many similar examples that provide redundant information once any one of them is labelled. Coreset approaches select a set of unlabelled examples that, when labelled, provides maximally diverse coverage of the unlabelled data distribution. BADGE (Batch Active learning by Diverse Gradient Embeddings) combines uncertainty (magnitude of gradient embeddings) and diversity (k-means clustering of gradient embeddings) in a principled batch selection strategy.

Expected Model Change and Expected Error Reduction are more computationally intensive strategies that estimate, for each candidate example, how much labelling it would change the model or reduce overall error. These are the most theoretically motivated but also the most expensive to compute at scale.

Active learning is applied in medical image annotation (prioritising ambiguous radiology cases for expert review), NLP annotation (sampling uncertain texts for human labelling), scientific discovery (selecting experiments to run based on model uncertainty), and content moderation (routing borderline cases to human reviewers).

Limitations: active learning assumes that the model's uncertainty estimate is well-calibrated and that the queried examples are labelled by an infallible oracle. In practice, deep neural networks are often overconfident (poorly calibrated), and human annotators make mistakes or disagree. Label noise in actively selected examples can be particularly damaging because the model queries examples it is already uncertain about.

Analogy

A student studying for an exam who, rather than rereading material they already know well, focuses on the specific concepts they are unsure about. They might quiz themselves, identifying which questions they cannot confidently answer, and then study only those topics. This targeted study is more efficient than rereading the entire textbook because it concentrates effort where the student stands to gain the most. Active learning applies the same strategy to machine learning: concentrate labelling effort on the examples the model is most uncertain about, maximising the information gained per label.

Real-world example

A medical imaging team builds a cancer detection model for rare tumour types. They have 100,000 unlabelled scans and can afford to have a radiologist label 5,000 of them. Random selection would label 5,000 typical (mostly non-tumour) scans - providing little information about the rare positive cases. Active learning selects scans based on model uncertainty: in the first round, it flags scans where the model's cancer probability is near 50% (maximally uncertain). After labelling 500 of these, the model is retrained. It becomes more confident on most scans but remains uncertain on a different set, which are then queried. After 5,000 labels, the active learning model outperforms one trained on 50,000 randomly selected labels - achieving more with 10x less annotation.

Why it matters

Active learning directly addresses the annotation bottleneck that limits ML deployment in many specialised domains. Where data is abundant but labels are scarce and expensive (medical imaging, legal documents, scientific datasets, moderation queues), active learning can reduce annotation cost by an order of magnitude for equivalent model performance. Understanding it is important for designing efficient ML annotation pipelines, especially in domains where expert annotation is the primary limiting resource.

In the news

No recent coverage - search for Active Learning.

Related concepts

Continuous Training

The automated process of regularly retraining ML models on fresh data as part of a production ML system - ensuring that models stay current as the world changes rather than degrading on stale distributions.

Data Pipeline

The automated sequence of steps that moves raw data from its sources through transformation, validation, and loading into the storage systems that ML training and inference depend on - the plumbing that makes ML systems run.

Synthetic Data

Artificially generated data that mimics the statistical properties of real data - used to augment scarce training sets, preserve privacy, simulate rare events, and train AI systems when real-world data collection is impossible or prohibited.

← Back to concepts