Context Window Management

The set of strategies for handling inputs that exceed a model's maximum token limit - from sliding windows and summarisation to hierarchical chunking and selective retrieval.

Added May 18, 2026 · 2 min read

Context window management is the practical bridge between the theoretical capability of a model (what it can do with the right input) and its real-world performance (what it can do given the constraints of actual documents and conversations). As models grow more capable, thoughtful context management is what allows those capabilities to be applied to real tasks.

Every language model has a context window limit - a maximum number of tokens it can process at once. For many real-world tasks, the content you want to work with exceeds this limit. A 100-page legal contract. A full codebase. A months-long conversation history. Context window management is the collection of techniques for handling these situations intelligently.

The simplest approach is truncation: just cut off text that exceeds the limit. This is often poor practice because the most important information might be in the truncated section. A smarter version is sliding window processing: break the long document into overlapping chunks, process each chunk with sufficient overlap that context carries across boundaries, then combine the results.

Summarisation-based approaches compress past context rather than discarding it. Instead of keeping the raw text of a long conversation, periodically summarise the earlier portion and replace it with the summary, freeing up context space for new content while preserving key information in compressed form. This is how many long-running AI assistants maintain apparent memory of earlier conversation despite fixed context windows.

Retrieval-based approaches (which connect to RAG systems) avoid the problem by not loading all content into context at once. Instead, they retrieve only the most relevant portions based on the current query. For a 1,000-page technical manual, you never load the whole thing - you retrieve the relevant sections based on what the user is asking.

Hierarchical methods process documents at multiple levels of granularity: scan a whole document at a high level to build an index, then load specific sections when their detail is needed. For structured documents like code or legal contracts, this mirrors how humans actually navigate long documents.

Analogy

Navigating a long book with a limited desk. You cannot spread the whole book out at once. Smart strategies: read the table of contents first (indexing), keep a running summary of chapters already read (summarisation), use bookmarks to return to key passages (retrieval), and focus detailed reading only on the sections most relevant to your current question (selective attention).

Real-world example

GitHub Copilot, when suggesting code completions, cannot load an entire large codebase into a single context window. It uses a combination of strategies: the most recently edited files are included verbatim, a summary of related files is included in compressed form, and a retrieval system brings in the most semantically relevant function signatures and examples from across the codebase.

Why it matters

Context window management is the practical bridge between the theoretical capability of a model (what it can do with the right input) and its real-world performance (what it can do given the constraints of actual documents and conversations). As models grow more capable, thoughtful context management is what allows those capabilities to be applied to real tasks.

In the news

No recent coverage - search for Context Window Management.

Related concepts

Context Window

The maximum amount of text an AI can read and think about at once - everything you send it, plus the conversation history, has to fit within this limit.

KV-Cache

A memory buffer that stores the results of attention calculations so the model does not have to recompute them on every generation step - the key to making AI responses fast.

RAG (Retrieval-Augmented Generation)

A way of making AI smarter by letting it look things up before answering, instead of relying only on what it memorised during training.

← Back to concepts