latentbrief
← Back to concepts

Concept

KV-Cache

A memory buffer that stores the results of attention calculations so the model does not have to recompute them on every generation step - the key to making AI responses fast.

Added May 18, 2026

When a language model generates text, it processes the entire conversation from the beginning every time it produces a new token. In the raw, unoptimised version, generating a 500-token response would require 500 full passes through the model - each pass processing the prompt plus all previously generated tokens. This would be extraordinarily slow.

The KV-Cache is the solution. During the attention computation, every token in the sequence produces two sets of vectors: key vectors and value vectors. These are what other tokens attend to when computing their contextualised representations. Crucially, for tokens that have already been processed, these vectors never change - they depend only on the token itself and everything before it, not on what comes after. So rather than recomputing them on every generation step, the model simply stores them in a cache and reuses them.

With a KV-Cache in place, generating each new token only requires computing the keys and values for that one new token, then running attention between it and all the cached keys and values from previous tokens. Instead of scaling linearly with sequence length on every step, generation costs a roughly constant amount per new token.

The trade-off is memory. KV-Cache entries accumulate as the conversation grows, and for long contexts or large batch sizes they can consume enormous amounts of GPU memory. This is one of the primary engineering constraints in deploying large language models at scale - the attention computation and the KV-Cache are the two dominant consumers of GPU memory, and balancing them against each other is a major focus of inference optimisation work.

Techniques like grouped query attention (which shares KV caches across multiple attention heads rather than maintaining a separate cache per head) and paged attention (which manages KV cache memory using virtual paging, like an operating system manages RAM) were both invented specifically to make KV-Cache more memory-efficient at scale.

Analogy

A scratch pad of calculations you have already done. When working through a long maths problem, you write down intermediate results rather than rederiving them from scratch each time you need them. The KV-Cache is that scratch pad for attention calculations - everything already computed is saved and reused, so only the new part needs fresh calculation.

Real-world example

When a chatbot handles hundreds of thousands of simultaneous conversations, each with its own KV-Cache growing in memory, the memory requirements become a primary cost driver. This is why AI inference providers invest heavily in KV-Cache compression and eviction strategies - deciding which cached entries to keep, compress, or discard when memory runs low.

Why it matters

The KV-Cache is why AI models can generate responses in seconds rather than minutes. It is infrastructure that most users never think about, but without it, the economics of deploying AI at scale would be completely different. Almost all work on making AI inference faster and cheaper eventually touches the KV-Cache.

In the news

No recent coverage - check back later.

Related concepts