latentbrief
← Back to concepts

Concept

PagedAttention

A memory management technique for AI inference that stores the KV-Cache in non-contiguous memory blocks - the same idea as virtual memory in operating systems, applied to language model serving.

Added May 18, 2026

When serving a language model to many users simultaneously, memory management becomes a critical engineering problem. Each conversation in progress needs its own KV-Cache - a growing block of memory that records the keys and values computed for every token in that conversation so far. The challenge is that these caches are unpredictable in size: you do not know in advance how long each conversation will run.

Traditional memory allocation approaches reserve a large contiguous block of memory per request, sized for the maximum possible context length. Most of this reserved memory goes unused for most of the request's lifetime - a conversation that runs for 500 tokens wastes 99% of the memory reserved for a 50,000-token context. This waste directly limits how many simultaneous conversations a server can handle.

PagedAttention, developed by the team behind vLLM (a widely used inference framework), borrows the core idea from virtual memory in operating systems. Instead of storing each conversation's KV-Cache as a single contiguous block, it divides the cache into small fixed-size pages and stores them wherever free space is available in memory - not necessarily adjacent. A lookup table maps logical pages to physical memory locations.

The result is that memory can be allocated incrementally as a conversation grows, and deallocated page by page when it ends. Fragmentation is eliminated because pages can slot into any available memory gap. A server running PagedAttention can handle roughly 2-4 times as many simultaneous conversations as one using conventional memory allocation, with the same hardware.

vLLM, which introduced PagedAttention in 2023, rapidly became one of the most widely deployed inference engines in production AI systems. Its throughput improvements are significant enough that major cloud AI providers either use it directly or have implemented equivalent approaches in their serving infrastructure.

Analogy

A library that, instead of reserving one long unbroken shelf for every borrower who might take out books, keeps a map of which books are where and slots them into any available gap on any shelf. A borrower can have 20 books scattered across different shelves, but the catalogue tracks them all. Memory is used efficiently because nothing is reserved speculatively.

Real-world example

When the vLLM team benchmarked PagedAttention against existing serving frameworks, they found 2-4x throughput improvements with no change in output quality. For a company paying for GPU server time to handle production AI traffic, that translates directly into 50-75% cost reduction for the same number of users. This is the kind of infrastructure improvement that quietly reshapes the economics of AI deployment.

Why it matters

PagedAttention is a case study in how systems engineering advances can matter as much as model architecture advances. Improving model quality by 10% requires enormously expensive training runs. Improving serving efficiency by 3x requires a smarter memory allocation strategy. Both contribute to making AI more practically capable and affordable.

In the news

No recent coverage - check back later.

Related concepts