latentbrief
← Back to concepts

Concept

Context Window Management

The set of strategies for handling inputs that exceed a model's maximum token limit - from sliding windows and summarisation to hierarchical chunking and selective retrieval.

Added May 18, 2026

Every language model has a context window limit - a maximum number of tokens it can process at once. For many real-world tasks, the content you want to work with exceeds this limit. A 100-page legal contract. A full codebase. A months-long conversation history. Context window management is the collection of techniques for handling these situations intelligently.

The simplest approach is truncation: just cut off text that exceeds the limit. This is often poor practice because the most important information might be in the truncated section. A smarter version is sliding window processing: break the long document into overlapping chunks, process each chunk with sufficient overlap that context carries across boundaries, then combine the results.

Summarisation-based approaches compress past context rather than discarding it. Instead of keeping the raw text of a long conversation, periodically summarise the earlier portion and replace it with the summary, freeing up context space for new content while preserving key information in compressed form. This is how many long-running AI assistants maintain apparent memory of earlier conversation despite fixed context windows.

Retrieval-based approaches (which connect to RAG systems) avoid the problem by not loading all content into context at once. Instead, they retrieve only the most relevant portions based on the current query. For a 1,000-page technical manual, you never load the whole thing - you retrieve the relevant sections based on what the user is asking.

Hierarchical methods process documents at multiple levels of granularity: scan a whole document at a high level to build an index, then load specific sections when their detail is needed. For structured documents like code or legal contracts, this mirrors how humans actually navigate long documents.

Analogy

Navigating a long book with a limited desk. You cannot spread the whole book out at once. Smart strategies: read the table of contents first (indexing), keep a running summary of chapters already read (summarisation), use bookmarks to return to key passages (retrieval), and focus detailed reading only on the sections most relevant to your current question (selective attention).

Real-world example

GitHub Copilot, when suggesting code completions, cannot load an entire large codebase into a single context window. It uses a combination of strategies: the most recently edited files are included verbatim, a summary of related files is included in compressed form, and a retrieval system brings in the most semantically relevant function signatures and examples from across the codebase.

Why it matters

Context window management is the practical bridge between the theoretical capability of a model (what it can do with the right input) and its real-world performance (what it can do given the constraints of actual documents and conversations). As models grow more capable, thoughtful context management is what allows those capabilities to be applied to real tasks.

In the news

Related concepts