LLM Architectures27
CoreThe building blocks inside large language models - how they store knowledge, process text, and generate responses.
BERT
Google's landmark 2018 language model that introduced bidirectional pre-training - a model that reads text in both directions simultaneously and set new standards for understanding tasks.
Byte-Pair Encoding (BPE)
The algorithm most large language models use to split text into tokens - finding the most efficient vocabulary of word fragments that can represent any text without getting overwhelmed by rare words.
Context Window
The maximum amount of text an AI can read and think about at once - everything you send it, plus the conversation history, has to fit within this limit.
All concepts
D
Decoder-Only Architecture
The design used by GPT, Claude, and most modern AI assistants - a model that generates text by predicting each next word based only on everything that came before it.
DistilBERT
A compressed version of BERT that retains 97% of its performance at 40% of the size and 60% of the inference speed - a landmark demonstration of knowledge distillation applied to large language models.
E
Embedding Dimension
The length of the numerical vector used to represent each token in a language model - a fundamental architectural choice that determines how much information the model can encode about any word or piece of text.
Embeddings
A way of turning words and sentences into lists of numbers, so that content with similar meanings ends up mathematically close together and can be found by meaning rather than exact wording.
Encoder-Decoder
A two-part neural network design where one half reads and compresses your input, and the other half uses that compressed understanding to generate a new output.
F
Flash Attention
A faster, memory-efficient way of computing attention in transformer models - the engineering breakthrough that made very long context windows practical.
Foundation Model
A large AI model trained on vast amounts of general data, designed to be the starting point for many different applications rather than built for a single task.
L
Latent Space
The internal numerical world where an AI model represents meaning - a high-dimensional space where similar concepts cluster together and mathematical operations on numbers produce meaningful semantic results.
Logit Lens
An interpretability technique that reads the model's "intermediate answers" at each layer of processing - giving researchers a window into how the model's prediction evolves as information flows through the network.
M
Mixture of Experts (MoE)
An architecture that routes each token through only a small fraction of the model's total parameters - enabling massive scale without proportionally massive compute costs.
Multi-Head Attention
A way of running self-attention multiple times in parallel with different learned perspectives, so the model can pick up on several types of relationships in the same sentence at once.
R
RAG (Retrieval-Augmented Generation)
A way of making AI smarter by letting it look things up before answering, instead of relying only on what it memorised during training.
RMSNorm (Root Mean Square Layer Normalization)
A simplified version of layer normalisation used in modern language models - a small architectural detail that improves training stability and slightly reduces computational cost.
Rotary Position Embedding (RoPE)
The position encoding method used by most modern language models - a mathematically elegant way of telling the model where each token sits in a sequence without compromising the model's ability to handle long contexts.
S
Self-Attention
The mechanism that lets every word in a sentence look at every other word simultaneously - the core innovation that makes transformer models understand context so well.
Sinusoidal Positional Encoding
The original method for telling transformer models where each token sits in a sequence - using mathematical sine and cosine waves to generate a unique position signal for every token.
SwiGLU
The activation function used inside the feed-forward layers of most modern language models - a small but significant architectural detail that measurably improves model quality.
T
Token
The basic unit of text that AI models actually process - roughly a word or part of a word, and also the unit used to measure cost and limits when using AI.
Toolformer
A model trained to teach itself to use external tools - an early demonstration that language models could learn when and how to call APIs without explicit human labelling of when tool use helps.
Transformer
The AI architecture that powers virtually every major language model today - the underlying design that makes GPT, Claude, Gemini, and most other modern AI systems work.