Decoder-Only Architecture

The design used by GPT, Claude, and most modern AI assistants - a model that generates text by predicting each next word based only on everything that came before it.

Added May 18, 2026 · 2 min read

Decoder-only models dominate modern AI because they scale exceptionally well: more parameters and more training data consistently produce better models, with no clear ceiling yet in sight. This scaling property, combined with the simplicity of the training objective, is why GPT, Claude, and the LLaMA family all use this design.

When you think of a modern AI language model - ChatGPT, Claude, Gemini, LLaMA - you are thinking of a decoder-only model. This is the dominant architectural choice for large language models in 2025, and understanding why it won out over alternatives tells you a lot about how these systems work.

A decoder-only model is trained on a single, elegant objective: given all the text so far, predict the next token. It reads left to right, and at each position, it can only see the tokens that came before it - never the tokens that come after. This is called causal or autoregressive generation, because each prediction causes the next one. You start with a prompt, generate one token, append it, generate the next token from the expanded sequence, and repeat until done.

The beauty of this approach is that the training task - predict the next word - can be applied to virtually any text ever written without any labelling. Every book, article, website, and conversation becomes a training example. The model learns grammar, reasoning, world knowledge, and conversational patterns purely by practising this prediction task at enormous scale.

What makes decoder-only models surprisingly capable is that to predict text well, the model must develop deep understanding of language and knowledge. A model that cannot understand cause and effect cannot predict what comes next in a causal chain. A model that does not understand grammar cannot predict grammatically correct continuations. So prediction as a training objective ends up teaching far more than just pattern matching.

The decoder-only design also makes generation simple and efficient at inference time: you just keep applying the same model to the growing sequence of tokens. This contrasts with encoder-decoder models, which require a separate encoding pass first. For open-ended generation - the dominant use case of AI assistants - this simplicity is a significant advantage.

Analogy

Writing a story by extending it one sentence at a time, where each sentence is shaped entirely by what came before it. The writer never jumps ahead to peek at where the story is going - they build forward from what already exists. A decoder-only model does exactly this, but at the level of individual tokens, millions of times during training until it becomes extraordinarily good at it.

Real-world example

When you type a message to Claude or ChatGPT, the response appears word by word or token by token - because that is genuinely how it is generated. The model has not pre-composed the entire answer and is streaming it out; it is computing one token at a time, each determined by everything that came before, including your prompt and all its previous output.

Why it matters

Decoder-only models dominate modern AI because they scale exceptionally well: more parameters and more training data consistently produce better models, with no clear ceiling yet in sight. This scaling property, combined with the simplicity of the training objective, is why GPT, Claude, and the LLaMA family all use this design.

In the news

No recent coverage - search for Decoder-Only Architecture.

Related concepts

Context Window

The maximum amount of text an AI can read and think about at once - everything you send it, plus the conversation history, has to fit within this limit.

Encoder-Decoder

A two-part neural network design where one half reads and compresses your input, and the other half uses that compressed understanding to generate a new output.

Token

The basic unit of text that AI models actually process - roughly a word or part of a word, and also the unit used to measure cost and limits when using AI.

Transformer

The AI architecture that powers virtually every major language model today - the underlying design that makes GPT, Claude, Gemini, and most other modern AI systems work.

← Back to concepts