VI · MLOps & InfrastructureAdvanced

Data Pipeline

The automated sequence of steps that moves raw data from its sources through transformation, validation, and loading into the storage systems that ML training and inference depend on - the plumbing that makes ML systems run.

Added May 18, 2026 · 3 min read

Data pipelines are often the primary source of ML system failures in production. A model trained on perfect data but served with different or degraded data will perform poorly, and the failure is often difficult to diagnose. Understanding data pipelines - their architecture, failure modes, and monitoring requirements - is essential for anyone building or maintaining production ML systems.

Every ML model depends on data, and that data rarely arrives in the form the model needs. Raw data lives in operational databases, event streams, file systems, and external APIs. It must be collected, cleaned, transformed, validated, and stored before it is useful for training or inference. A data pipeline is the automated workflow that handles this end-to-end process.

ML data pipelines typically involve several stages. Ingestion: collecting data from source systems - database replication, event stream consumers, API polling, file watchers. Validation: checking that arriving data conforms to expected schemas and statistical properties - missing values, unexpected distributions, schema changes. Transformation: applying cleaning, normalisation, encoding, and feature engineering logic to convert raw records into model-ready feature vectors. Storage: loading the transformed data into the target store - a training dataset in object storage, feature values in a feature store, or embedding vectors in a vector database.

Pipeline orchestration tools like Apache Airflow, Prefect, Dagster, and Metaflow manage the scheduling, dependency resolution, and monitoring of these multi-step workflows. They provide DAG (directed acyclic graph) abstractions that define which pipeline steps depend on which others, retry logic for failed steps, and observability into pipeline health.

ML pipelines have particular requirements beyond general ETL pipelines. Point-in-time correctness: when creating training datasets, historical feature values must reflect what was known at each training example's timestamp, not what is known today. Data versioning: training datasets should be versioned so models can be retrained against the exact same data. Lineage tracking: the ability to trace any model prediction back through the serving features to the raw source data records.

Streaming pipelines (built on Apache Kafka, Apache Flink, or AWS Kinesis) handle real-time data, enabling features that depend on very recent events - the user's last 10 clicks in the past minute, the fraud signal that arrives 30 seconds after a transaction. Batch pipelines handle the large historical datasets used for model training and periodic retraining.

Data quality issues in pipelines are a leading cause of ML model failures. Silent data bugs - a source schema change that nullifies a feature, a backfill error that introduces label leakage, a timezone bug that shifts timestamps - can degrade model performance in ways that look like model drift but are actually data problems.

Analogy

The supply chain and logistics network behind a manufacturing plant. The plant (the ML model) needs specific inputs delivered consistently: the right materials, prepared to specification, arriving on schedule. The supply chain (the data pipeline) handles everything from raw material sourcing to delivery at the factory door. If the supply chain breaks down or delivers the wrong materials, the factory cannot produce quality output regardless of how good the manufacturing process is.

Real-world example

A ride-sharing company's dynamic pricing model needs real-time inputs: current demand density by grid cell, driver supply, weather conditions, local event calendars, and historical pricing data. Each of these comes from a different source system. A data pipeline continuously ingests these streams, joins them spatially and temporally, applies validation to catch sensor failures or schema changes, computes derived features, and makes them available in a low-latency feature store so the pricing model can generate a fare estimate within 200ms of a ride request.

Why it matters

Data pipelines are often the primary source of ML system failures in production. A model trained on perfect data but served with different or degraded data will perform poorly, and the failure is often difficult to diagnose. Understanding data pipelines - their architecture, failure modes, and monitoring requirements - is essential for anyone building or maintaining production ML systems.

In the news

No recent coverage - search for Data Pipeline.

Related concepts

Continuous Training

The automated process of regularly retraining ML models on fresh data as part of a production ML system - ensuring that models stay current as the world changes rather than degrading on stale distributions.

Data Drift

The gradual or sudden shift in the statistical properties of data that a deployed ML model receives compared to the data it was trained on - the most common cause of silent model degradation in production.

Feature Store

A centralised data layer that stores, manages, and serves the computed features that ML models use for both training and inference - eliminating the costly problem of different teams recomputing the same features differently.

← Back to concepts