latentbrief
← Back to concepts

Concept

Continuous Training

The automated process of regularly retraining ML models on fresh data as part of a production ML system - ensuring that models stay current as the world changes rather than degrading on stale distributions.

Added May 18, 2026

A model trained once is a model trained on history. The world changes: user behaviour shifts, language evolves, new products launch, fraud patterns adapt, market conditions move. A model trained six months ago will gradually degrade as its training distribution diverges from the current distribution it is serving. Continuous training is the practice of automatically retraining models on a schedule or in response to detected degradation, keeping them current without manual intervention.

Continuous training (CT) is distinct from continuous integration/continuous deployment (CI/CD) in software, though they share philosophical roots. In CT, what is being continuously built and deployed is not code but trained model artifacts. A CT system manages the full training cycle: gathering fresh data, triggering training jobs, evaluating results against quality gates, registering the new model, and (if it passes evaluation) deploying it to replace the current production model.

Triggers for retraining vary. Schedule-based retraining runs on a fixed cadence - daily, weekly, monthly - appropriate for use cases where the data distribution changes gradually and predictably. Event-based retraining triggers on specific conditions: when a data drift monitor detects significant distribution shift, when model performance drops below a threshold on live traffic, or when a sufficiently large batch of labelled examples accumulates from human feedback. Some systems use online learning, continuously updating model parameters from streaming data without distinct training runs.

The continuous training pipeline must handle several concerns that ad hoc training does not. Data freshness and quality gates: ensuring the new training data meets quality standards before training. Evaluation against holdout sets: the new model must outperform the current production model on a representative evaluation set before deployment. Rollback readiness: if a newly deployed model degrades, the system should automatically revert to the previous version. Lineage preservation: every automatic retrain should be logged with the same metadata as a manually initiated training run.

CT introduces new failure modes. Model drift from training on contaminated data: if a data pipeline bug causes label leakage, continuous training will propagate the error into every subsequent model. Feedback loops: models that influence the data they will be trained on can enter degenerate cycles - a recommendation model that trains on its own recommendations may collapse into narrow filter bubbles.

The maturity model for CT goes from no automation (models retrained manually) to full automation (models retrain, evaluate, and deploy without human involvement). Most production systems land somewhere in the middle: automated retraining with human approval gates before deployment.

Analogy

A news publication's editorial process. If the publication froze its reporting at a specific date and never updated, its coverage of current events would become increasingly irrelevant. Instead, journalists continuously gather new information, write updated articles, and publish them through an editorial review process. Continuous training does the same for ML models: new data flows in, models are retrained and reviewed, and updated versions replace stale ones - keeping the model's knowledge current.

Real-world example

A search ranking model is retrained weekly on the previous 90 days of search sessions. Each Sunday night, an automated pipeline: collects fresh training data from the feature store, launches a training job on GPU clusters, evaluates the new model against a fixed holdout set, compares its ranking quality metrics to the current production model, and - if the new model is at least 1% better on NDCG@10 - promotes it to production with an automatic traffic ramp from 1% to 100% over 24 hours. If any quality gate fails, the pipeline pages the on-call engineer instead of deploying.

Why it matters

Continuous training is the difference between an ML system that improves over time and one that degrades. Without it, production models drift as the world changes and require manual monitoring and intervention to maintain performance. With it, ML systems behave more like software services - automatically updated, consistently monitored, and resilient to distribution shift. Understanding CT is essential for anyone operating ML systems beyond research prototypes.

In the news

No recent coverage - check back later.

Related concepts