Logit Lens

An interpretability technique that reads the model's "intermediate answers" at each layer of processing - giving researchers a window into how the model's prediction evolves as information flows through the network.

Added May 18, 2026 · 3 min read

The logit lens is one of a handful of tools that make AI interpretability research empirically tractable rather than purely theoretical. It turns the black box of a deep neural network into something you can actually inspect layer by layer. As AI systems become more powerful and are deployed in higher-stakes applications, tools like this become essential for building justified trust in their outputs.

A transformer model processes text through many sequential layers - each layer transforms the representations it receives and passes them to the next. From the outside, you only see the final output. The logit lens is a technique for looking at what the model appears to be "thinking" at each intermediate layer, making the step-by-step evolution of the prediction visible.

The technique works by taking the intermediate representation at any given layer and directly projecting it onto the model's output vocabulary using the same weight matrix the final layer uses. This produces a probability distribution over all possible next tokens at that intermediate stage - as if the model had stopped processing at that layer and made its best prediction with what it had so far.

What researchers find when they apply this is striking. In the early layers, the predictions are often nonsensical or near-random - the model is still processing local patterns, doing something like character-level or word-level pattern matching. As you move to middle layers, recognisable semantic content starts to emerge - the model begins to access knowledge about the world. By the late layers, the prediction has often already converged to the final answer, with the remaining layers mostly refining confidence rather than changing the core prediction.

This progression gives researchers insights into where different types of information processing happen in the model. Factual recall - retrieving a known fact from the model's stored knowledge - tends to happen in earlier middle layers. Logical reasoning and combining multiple pieces of information tends to happen in later layers. Syntactic processing is concentrated near the beginning.

The logit lens also reveals failure modes. When a model hallucinates, researchers can sometimes trace exactly where the prediction went wrong - whether the model retrieved incorrect information early, or started with the right information but a later layer overrode it with something wrong.

Analogy

Reading someone's draft document at multiple stages of editing. After the first pass, the argument is rough and the conclusion unclear. After the third pass, the structure is solid but some details are still off. By the eighth pass, it is nearly final. The logit lens reads the model's "draft" at each layer - watching the prediction sharpen from confusion toward confidence.

Real-world example

Researchers studying factual recall in language models used the logit lens to show that when models answer factual questions, the correct answer is often already present in the model's intermediate representations many layers before the final output. This suggests factual retrieval is a relatively early, concentrated process - and explains why models are sometimes overconfident in wrong answers: a wrong answer can be "committed to" early in the network.

Why it matters

The logit lens is one of a handful of tools that make AI interpretability research empirically tractable rather than purely theoretical. It turns the black box of a deep neural network into something you can actually inspect layer by layer. As AI systems become more powerful and are deployed in higher-stakes applications, tools like this become essential for building justified trust in their outputs.

In the news

New Protocol Enhances AI Transparency
LessWrong · 1w ago

Related concepts

Latent Space

The internal numerical world where an AI model represents meaning - a high-dimensional space where similar concepts cluster together and mathematical operations on numbers produce meaningful semantic results.

Transformer

The AI architecture that powers virtually every major language model today - the underlying design that makes GPT, Claude, Gemini, and most other modern AI systems work.

← Back to concepts