Concept
Decoder-Only Architecture
The design used by GPT, Claude, and most modern AI assistants - a model that generates text by predicting each next word based only on everything that came before it.
Added May 18, 2026
When you think of a modern AI language model - ChatGPT, Claude, Gemini, LLaMA - you are thinking of a decoder-only model. This is the dominant architectural choice for large language models in 2025, and understanding why it won out over alternatives tells you a lot about how these systems work.
A decoder-only model is trained on a single, elegant objective: given all the text so far, predict the next token. It reads left to right, and at each position, it can only see the tokens that came before it - never the tokens that come after. This is called causal or autoregressive generation, because each prediction causes the next one. You start with a prompt, generate one token, append it, generate the next token from the expanded sequence, and repeat until done.
The beauty of this approach is that the training task - predict the next word - can be applied to virtually any text ever written without any labelling. Every book, article, website, and conversation becomes a training example. The model learns grammar, reasoning, world knowledge, and conversational patterns purely by practising this prediction task at enormous scale.
What makes decoder-only models surprisingly capable is that to predict text well, the model must develop deep understanding of language and knowledge. A model that cannot understand cause and effect cannot predict what comes next in a causal chain. A model that does not understand grammar cannot predict grammatically correct continuations. So prediction as a training objective ends up teaching far more than just pattern matching.
The decoder-only design also makes generation simple and efficient at inference time: you just keep applying the same model to the growing sequence of tokens. This contrasts with encoder-decoder models, which require a separate encoding pass first. For open-ended generation - the dominant use case of AI assistants - this simplicity is a significant advantage.
Analogy
Writing a story by extending it one sentence at a time, where each sentence is shaped entirely by what came before it. The writer never jumps ahead to peek at where the story is going - they build forward from what already exists. A decoder-only model does exactly this, but at the level of individual tokens, millions of times during training until it becomes extraordinarily good at it.
Real-world example
When you type a message to Claude or ChatGPT, the response appears word by word or token by token - because that is genuinely how it is generated. The model has not pre-composed the entire answer and is streaming it out; it is computing one token at a time, each determined by everything that came before, including your prompt and all its previous output.
Why it matters
Decoder-only models dominate modern AI because they scale exceptionally well: more parameters and more training data consistently produce better models, with no clear ceiling yet in sight. This scaling property, combined with the simplicity of the training objective, is why GPT, Claude, and the LLaMA family all use this design.
In the news
No recent coverage - check back later.
Related concepts