VI · MLOps & InfrastructureAdvanced

Distributed Training

Training ML models across multiple GPUs or machines simultaneously by partitioning the work - essential for models too large to fit on a single device or datasets too large to process in reasonable time on one machine.

Added May 18, 2026 · 3 min read

Distributed training is the reason frontier AI models can exist at all. Without it, model scale would be bounded by single-GPU memory limits - roughly 10-20 billion parameters rather than the hundreds of billions in current systems. Understanding distributed training explains the economics of AI development (requiring expensive GPU clusters), the research challenges in scaling (communication overhead, parallelism strategies), and why large model training is concentrated at a small number of organisations with the infrastructure to run it.

Modern large language models and multimodal systems have billions to hundreds of billions of parameters. A model with 70 billion parameters at 16-bit precision requires over 140GB of GPU memory just to store the weights - far exceeding the 80GB capacity of even the largest current GPUs (H100). Training these models requires distributing the work across dozens, hundreds, or thousands of GPUs simultaneously. Distributed training is the set of strategies and systems that make this possible.

The primary approaches to distributed training reflect different ways to partition the problem. Data parallelism splits the training dataset across devices: each device holds a complete copy of the model and processes a different subset of the batch. After each backward pass, gradients are averaged across all devices (all-reduce operation) and all model copies update synchronously. Data parallelism scales throughput but does not help when the model itself exceeds a single device's memory.

Model parallelism addresses memory limits by splitting the model itself across devices. Tensor parallelism splits individual operations: a large matrix multiply is partitioned across GPUs, each computing a portion of the result. Pipeline parallelism splits the model's layers into stages, with different stages assigned to different devices and micro-batches flowing through the pipeline. Context parallelism, increasingly important for long-context models, splits the sequence dimension across devices so that attention computation over very long sequences can be distributed.

Fully Sharded Data Parallel (FSDP) and DeepSpeed ZeRO are frameworks that address memory by sharding not just parameters but also gradients and optimizer states across devices, so no single device holds the full set of any of these. A model that would require 1TB of memory on a single device can be trained on 16 devices each holding 64GB by distributing all of these components.

Communication overhead is the central challenge in distributed training. All-reduce operations that synchronise gradients require high-bandwidth, low-latency interconnects between devices. Within a single machine, NVLink (NVIDIA's GPU interconnect) provides hundreds of GB/s of bandwidth. Across machines, InfiniBand or high-speed Ethernet connects nodes. The hierarchy of interconnect speeds shapes which parallelism strategies are efficient: tensor parallelism (which requires very frequent communication) is typically confined within a single node, while pipeline parallelism (less frequent communication) spans nodes.

Flops utilisation (MFU - Model FLOPs Utilisation) measures how efficiently a distributed training setup uses the theoretical compute capacity of its GPUs. Production training runs at large AI labs typically achieve 30-50% MFU, with the remainder lost to communication overhead, memory management, and other inefficiencies.

Analogy

Building a massive bridge across a wide river using many construction crews working simultaneously. No single crew can assemble the entire span alone - it is too large and would take too long. Instead, different crews work on different segments (pipeline parallelism), others work on components of each segment in parallel (tensor parallelism), and logistics teams coordinate material delivery across all sites (data parallelism for batches). The engineering challenge is coordinating all this activity so the final structure comes together correctly despite the distributed workforce.

Real-world example

Training GPT-3 (175 billion parameters) at OpenAI required a cluster of thousands of A100 GPUs. The training used 3D parallelism: data parallelism across replicas of the full model, tensor parallelism to split matrix operations across GPUs within each node, and pipeline parallelism to distribute layers across nodes. The training run consumed millions of GPU-hours. A single GPU would have taken thousands of years to complete the same computation.

Why it matters

Distributed training is the reason frontier AI models can exist at all. Without it, model scale would be bounded by single-GPU memory limits - roughly 10-20 billion parameters rather than the hundreds of billions in current systems. Understanding distributed training explains the economics of AI development (requiring expensive GPU clusters), the research challenges in scaling (communication overhead, parallelism strategies), and why large model training is concentrated at a small number of organisations with the infrastructure to run it.

In the news

No recent coverage - search for Distributed Training.

Related concepts

Gradient Checkpointing

A memory-efficiency technique that trades extra computation for reduced memory usage during training by discarding intermediate activations and recomputing them during the backward pass rather than storing them throughout.

Mixed Precision Training

A training technique that uses lower-precision numerical formats (FP16 or BF16) for most computations while maintaining higher-precision (FP32) master copies of weights - cutting memory usage and accelerating training without sacrificing model quality.

Model Serving

The infrastructure layer that takes a trained ML model and makes it available to receive requests, run predictions, and return results at production scale - the bridge between a trained artifact and a live application.

← Back to concepts