Sample Packing

A training efficiency technique that concatenates multiple short sequences into a single long training example - eliminating wasted padding and significantly improving GPU utilisation.

Added May 18, 2026 · 3 min read

Sample packing represents the kind of engineering efficiency that compounds into meaningful capability differences at scale. Two teams with the same hardware and budget but different training pipelines can reach dramatically different amounts of training compute within a month. The team with better padding efficiency sees more data and trains more effectively. These engineering details are part of why well-resourced labs can achieve more with similar hardware than less optimised operations.

Language model training processes text in fixed-length sequences. When a dataset contains sequences of varying lengths - as most real-world text datasets do - the standard approach is to pad shorter sequences with special tokens to reach the target length. A batch of sequences where the average is 200 tokens but the maximum is 2,048 means 90% of each batch position is padding that does not contribute to learning. The model still computes attention over those padding positions, burning GPU cycles on nothing.

Sample packing addresses this by concatenating multiple documents end-to-end, separated by end-of-sequence tokens, until the target length is filled. Instead of one 200-token document plus 1,848 padding tokens, you pack ten 200-token documents together into a single 2,000-token sequence. Every position in the batch contributes real training signal. The model learns from ten times as many actual examples in the same amount of wall-clock compute.

The attention mask is the critical implementation detail. Documents packed together should not attend to each other: a document''s tokens should only attend to tokens within the same document, not to tokens from adjacent packed documents. Getting this right requires careful masking to ensure that information does not leak across document boundaries during the attention computation. Without proper masking, the model learns spurious correlations between the end of one document and the beginning of the next.

The throughput improvements from sample packing are significant. For datasets with many short sequences (conversational data, code snippets, news articles), packing can improve GPU utilisation from 30-40% to over 90% of theoretical maximum, with proportional reductions in training time and cost. For pre-training runs that last weeks or months, these efficiency gains translate into substantial savings.

Sample packing is now standard in most efficient pre-training and fine-tuning pipelines. Libraries like Hugging Face TRL, Axolotl, and the various LLM training frameworks all support it as an optional but commonly recommended setting for datasets with variable-length sequences.

Analogy

Packing a moving van efficiently versus haphazardly. Haphazard packing leaves large gaps between items, wasting space. Efficient packing rearranges items to fill every gap. Sample packing does the same for training batches: instead of one item per slot with the rest empty (padding), pack multiple real items end to end until the slot is full.

Real-world example

When fine-tuning LLaMA on conversational datasets where individual exchanges average 150-300 tokens, enabling sample packing typically reduces training time by 50-70% for the same number of gradient updates. At the scale of a week-long fine-tuning run on a cluster of GPUs, this difference can mean the difference between completing an experiment and running out of compute budget.

Why it matters

Sample packing represents the kind of engineering efficiency that compounds into meaningful capability differences at scale. Two teams with the same hardware and budget but different training pipelines can reach dramatically different amounts of training compute within a month. The team with better padding efficiency sees more data and trains more effectively. These engineering details are part of why well-resourced labs can achieve more with similar hardware than less optimised operations.

In the news

No recent coverage - search for Sample Packing.

Related concepts

Fine-tuning

Taking a general-purpose AI model and giving it additional training on a specific subject, so it becomes noticeably better at that particular domain.

Gradient Accumulation

A training trick that simulates large batch sizes on hardware with limited memory - by accumulating gradient updates over multiple small batches before applying them.

Instruction Datasets

Curated collections of instruction-response pairs used to fine-tune language models into helpful assistants - the training data that teaches models what being useful looks like.

← Back to concepts