KV-Cache

A memory buffer that stores the results of attention calculations so the model does not have to recompute them on every generation step - the key to making AI responses fast.

Added May 18, 2026 · 2 min read

The KV-Cache is why AI models can generate responses in seconds rather than minutes. It is infrastructure that most users never think about, but without it, the economics of deploying AI at scale would be completely different. Almost all work on making AI inference faster and cheaper eventually touches the KV-Cache.

When a language model generates text, it processes the entire conversation from the beginning every time it produces a new token. In the raw, unoptimised version, generating a 500-token response would require 500 full passes through the model - each pass processing the prompt plus all previously generated tokens. This would be extraordinarily slow.

The KV-Cache is the solution. During the attention computation, every token in the sequence produces two sets of vectors: key vectors and value vectors. These are what other tokens attend to when computing their contextualised representations. Crucially, for tokens that have already been processed, these vectors never change - they depend only on the token itself and everything before it, not on what comes after. So rather than recomputing them on every generation step, the model simply stores them in a cache and reuses them.

With a KV-Cache in place, generating each new token only requires computing the keys and values for that one new token, then running attention between it and all the cached keys and values from previous tokens. Instead of scaling linearly with sequence length on every step, generation costs a roughly constant amount per new token.

The trade-off is memory. KV-Cache entries accumulate as the conversation grows, and for long contexts or large batch sizes they can consume enormous amounts of GPU memory. This is one of the primary engineering constraints in deploying large language models at scale - the attention computation and the KV-Cache are the two dominant consumers of GPU memory, and balancing them against each other is a major focus of inference optimisation work.

Techniques like grouped query attention (which shares KV caches across multiple attention heads rather than maintaining a separate cache per head) and paged attention (which manages KV cache memory using virtual paging, like an operating system manages RAM) were both invented specifically to make KV-Cache more memory-efficient at scale.

Analogy

A scratch pad of calculations you have already done. When working through a long maths problem, you write down intermediate results rather than rederiving them from scratch each time you need them. The KV-Cache is that scratch pad for attention calculations - everything already computed is saved and reused, so only the new part needs fresh calculation.

Real-world example

When a chatbot handles hundreds of thousands of simultaneous conversations, each with its own KV-Cache growing in memory, the memory requirements become a primary cost driver. This is why AI inference providers invest heavily in KV-Cache compression and eviction strategies - deciding which cached entries to keep, compress, or discard when memory runs low.

Why it matters

The KV-Cache is why AI models can generate responses in seconds rather than minutes. It is infrastructure that most users never think about, but without it, the economics of deploying AI at scale would be completely different. Almost all work on making AI inference faster and cheaper eventually touches the KV-Cache.

In the news

No recent coverage - search for KV-Cache.

Related concepts

Context Window

The maximum amount of text an AI can read and think about at once - everything you send it, plus the conversation history, has to fit within this limit.

Flash Attention

A faster, memory-efficient way of computing attention in transformer models - the engineering breakthrough that made very long context windows practical.

Grouped Query Attention (GQA)

A more memory-efficient variant of multi-head attention where multiple query heads share a single set of key-value pairs - cutting memory use without meaningfully hurting quality.

Inference

Using a trained AI model to make predictions or generate outputs - the fast, cheap counterpart to training's slow, expensive computation.

← Back to concepts