Concept

Transformer

The AI architecture that powers virtually every major language model today - the underlying design that makes GPT, Claude, Gemini, and most other modern AI systems work.

In the world of AI, an "architecture" is the blueprint that determines how a model is structured - how information flows through it, how it processes input, how it produces output. The transformer architecture, introduced by researchers at Google in 2017, changed everything. Before it, AI language systems processed text one word at a time, left to right, like a person reading slowly. The transformer threw out that approach and replaced it with something fundamentally different.

The key innovation is called "attention." Instead of reading text sequentially, a transformer looks at every word in relation to every other word simultaneously. When processing the sentence "the bank by the river," the word "bank" is understood in relation to "river" - which tells the model this is a riverbank, not a financial institution. This happens for every word, in parallel, all at once. It is a much richer way of understanding language than reading word by word.

This parallel processing also turned out to be a perfect match for modern computing hardware, which is designed to do many calculations simultaneously. Training on this hardware became dramatically faster, which meant researchers could feed transformers vastly more data and build vastly larger models than had been practical before.

What followed was a consistent finding: bigger transformer models, trained on more data, just kept getting better. This held true across a wide range of tasks - writing, summarising, translating, coding, answering questions. The architecture scaled in a way that previous designs had not, which is what kicked off the current era of large AI models.

GPT-4, Claude 3, Gemini, Llama - they are all transformer-based. The basic architecture has remained remarkably stable since 2017. The enormous advances since then have come from scaling it up, improving the training data, and layering additional techniques on top of it - not from replacing the core design.

Analogy

Think of a translator who, instead of converting a sentence word by word as they go, first reads the entire source sentence carefully to understand its full meaning, then produces the translation. The transformer's attention mechanism is that full-sentence understanding - the ability to see the whole picture before producing any output.

Real-world example

Every time you use ChatGPT, Claude, or Gemini, you are interacting with a transformer model. The 2017 paper that introduced the design - titled "Attention Is All You Need" - is one of the most cited research papers in AI history, because it laid the foundation for everything that followed in the current wave of AI.

Why it matters

Understanding the transformer architecture helps explain both what current AI is capable of and where it falls short. Its strengths and limitations are not arbitrary - they are direct consequences of the design. The models that exist today are transformers, and for the foreseeable future, that is what AI is.

In the news

Related concepts

Embeddings Foundation Model

← Back to concepts