VI · MLOps & InfrastructureAdvanced

Model Monitoring

The continuous measurement of a deployed ML model's behaviour, input distributions, and output quality in production - the operational layer that detects when models are degrading before business impact becomes severe.

Added May 18, 2026 · 3 min read

ML systems fail silently. Unlike a crashed web server that triggers immediate alerts, a degraded model continues producing outputs - just wrong ones. Without monitoring, model quality can deteriorate for months before business impact surfaces in a board report. Model monitoring is what makes ML operations responsible: it provides the observability infrastructure that turns model deployment from a one-time event into an ongoing, managed system.

Deploying a model is not the end of the ML engineering work - it is the beginning of an ongoing operational responsibility. Model monitoring is the practice of continuously measuring what a deployed model is doing: what inputs it is receiving, what outputs it is producing, and whether those outputs are any good. Without monitoring, model degradation is detected late, often only after significant business impact has accumulated.

Model monitoring operates across several dimensions. Technical monitoring tracks serving infrastructure health: latency percentiles, error rates, throughput, memory usage, and GPU utilisation. These are the same metrics any production service requires and are typically handled by standard infrastructure monitoring tools.

Statistical monitoring tracks the data flowing through the model. Feature distribution monitoring checks whether the statistical properties of input features match the training data baseline - detecting data drift before it degrades performance. Prediction distribution monitoring watches the model's output distribution: if a model that used to predict approximately 5% positive class rate begins predicting 20%, something has changed. Schema monitoring catches structural changes in incoming data - a missing field, a changed encoding, a new category value the model was not trained on.

Performance monitoring is the most direct measurement but requires ground truth labels. When labels are available quickly (click-through rate is known within minutes, fraud labels often within 24-48 hours), actual model accuracy, precision, recall, and F1 can be computed against production predictions. When labels arrive with significant delay or not at all (loan default, long-horizon churn), proxy metrics and upstream indicators must substitute.

Business metric monitoring connects model performance to outcomes: does the recommendation model's click-through rate match expectations? Is the pricing model's conversion rate within expected bounds? These downstream metrics provide the ultimate signal of model value but are often noisier and more confounded than direct model quality metrics.

Alerting thresholds and human escalation paths complete the monitoring system. Raw metrics are only useful if they trigger actions: automated retraining when drift exceeds a threshold, human review when an anomaly does not match known patterns, rollback when a new deployment underperforms the previous version.

ML observability platforms like Evidently, WhyLabs, Arize, Fiddler, and Aporia provide specialised tooling for these ML-specific monitoring needs, typically integrating with standard infrastructure monitoring (Prometheus, Grafana, Datadog) for the serving layer.

Analogy

The instrument cluster in an aircraft cockpit, combined with ground control monitoring. The pilots (serving infrastructure) provide engine health and flight path metrics. The instruments (statistical monitoring) track whether altitude, speed, and heading match the flight plan. Air traffic control (business metric monitoring) watches whether the aircraft is on course to arrive at the right destination at the right time. A problem in any layer triggers escalation - not waiting until the plane lands to evaluate whether the flight went well.

Real-world example

A loan approval model is deployed with a monitoring stack that tracks: input feature distributions against training baselines (PSI alert if any feature shifts by more than 0.2), prediction score distribution (alert if mean prediction score changes by more than 10%), 30-day loan performance metrics (alert if early default rate exceeds historical baseline), and serving latency (alert if p99 latency exceeds 500ms). A data pipeline bug that introduces a null value in a key feature field triggers the schema monitor within 15 minutes, long before the degraded predictions affect enough loan decisions to show up in business metrics.

Why it matters

ML systems fail silently. Unlike a crashed web server that triggers immediate alerts, a degraded model continues producing outputs - just wrong ones. Without monitoring, model quality can deteriorate for months before business impact surfaces in a board report. Model monitoring is what makes ML operations responsible: it provides the observability infrastructure that turns model deployment from a one-time event into an ongoing, managed system.

In the news

No recent coverage - search for Model Monitoring.

Related concepts

Continuous Training

The automated process of regularly retraining ML models on fresh data as part of a production ML system - ensuring that models stay current as the world changes rather than degrading on stale distributions.

Data Drift

The gradual or sudden shift in the statistical properties of data that a deployed ML model receives compared to the data it was trained on - the most common cause of silent model degradation in production.

Model Serving

The infrastructure layer that takes a trained ML model and makes it available to receive requests, run predictions, and return results at production scale - the bridge between a trained artifact and a live application.

← Back to concepts