Flash Attention

A faster, memory-efficient way of computing attention in transformer models - the engineering breakthrough that made very long context windows practical.

Added May 18, 2026 · 2 min read

Flash Attention is one of those advances that looks unglamorous from the outside but enables everything else. It did not change what models can learn - it changed what models can affordably process. The practical race toward million-token context windows depends on this kind of infrastructure improvement as much as on raw increases in model scale.

The attention mechanism at the heart of transformer models is powerful but computationally expensive. The standard implementation stores large intermediate matrices in GPU memory while computing attention, and the memory required grows as the square of the sequence length. This was not a problem for short sequences, but as researchers tried to build models that could process longer documents - thousands or tens of thousands of tokens - GPU memory became a hard wall that blocked further scaling.

Flash Attention, introduced in 2022, solved this with a surprisingly elegant insight: instead of materialising the full attention matrix in memory at once, recompute the parts you need as you go, keeping only small tiles of data in the fast memory on the chip. This is called an IO-aware algorithm because it is designed around the architecture of modern hardware - specifically, the fact that moving data between slow main memory and fast on-chip memory is often the real bottleneck, not the number of calculations.

The result is attention that computes exactly the same mathematical operation as the standard approach, producing identical outputs, but uses dramatically less memory and runs significantly faster. The memory savings scale linearly with sequence length rather than quadratically, which is what makes very long context windows feasible. Without Flash Attention, models with 100,000-token or 200,000-token context windows would require impractically large amounts of GPU memory.

Flash Attention has been adopted across virtually all frontier model implementations. It is now considered standard infrastructure rather than a research novelty - an invisible layer in the stack that every major model benefits from. Subsequent versions (Flash Attention 2 and 3) have extended the approach with further optimisations, continuing to push the practical limits of what context lengths are achievable at reasonable cost.

Analogy

Doing a large jigsaw puzzle but with only a small table. Instead of spreading all the pieces out at once (which would require a huge table), you work tile by tile - completing one small section, setting it aside, moving to the next. You get exactly the same completed puzzle, but you never needed more table space than one small working area at a time. Flash Attention tiles the attention computation the same way.

Real-world example

The jump from GPT-4 with an 8,000-token context window to Claude with a 200,000-token window was made possible largely by advances including Flash Attention. The same attention mechanism that would require prohibitive GPU memory at long sequences becomes tractable when you tile the computation intelligently. Every practical long-context model today depends on this optimisation.

Why it matters

Flash Attention is one of those advances that looks unglamorous from the outside but enables everything else. It did not change what models can learn - it changed what models can affordably process. The practical race toward million-token context windows depends on this kind of infrastructure improvement as much as on raw increases in model scale.

In the news

No recent coverage - search for Flash Attention.

Related concepts

Context Window

The maximum amount of text an AI can read and think about at once - everything you send it, plus the conversation history, has to fit within this limit.

KV-Cache

A memory buffer that stores the results of attention calculations so the model does not have to recompute them on every generation step - the key to making AI responses fast.

Multi-Head Attention

A way of running self-attention multiple times in parallel with different learned perspectives, so the model can pick up on several types of relationships in the same sentence at once.

← Back to concepts