Concept
Embedding Dimension
The length of the numerical vector used to represent each token in a language model - a fundamental architectural choice that determines how much information the model can encode about any word or piece of text.
Added May 18, 2026
Every token in a language model is represented as a vector - a list of numbers. The embedding dimension is how long that list is. A token in a model with embedding dimension 512 is represented by 512 numbers. A token in a model with embedding dimension 4096 is represented by 4096 numbers.
Why does the length of this vector matter? Because each number in the vector is a degree of freedom - a dimension along which meaning can vary. A one-dimensional representation can encode something like "how positive or negative" a word is. A two-dimensional representation can independently encode two different aspects. A 4096-dimensional representation can simultaneously encode thousands of independent aspects of meaning, allowing extremely nuanced distinctions to be captured.
The embedding dimension is not just about the initial token representation - it flows through the entire model. The residual stream (the running representation of each token that is updated by each layer) has the same dimension. Attention heads split this into smaller subspaces. Feed-forward networks expand and then compress it. Every computation in the model is shaped by this fundamental dimension.
Larger embedding dimensions give the model more representational capacity - more ability to encode nuanced distinctions between concepts - but also make the model more expensive. The model's parameter count scales roughly with the square of the embedding dimension for the attention layers, and linearly with it for the embedding tables. The choice of embedding dimension is therefore a key trade-off between quality and cost.
Common embedding dimensions: smaller models like GPT-2 small use 768; medium models often use 2048 or 3072; large frontier models use 4096, 8192, or higher. The exact choice interacts with other hyperparameters like number of layers and number of attention heads to determine the overall model capacity.
Analogy
The number of ways to describe a wine. A basic description might have two dimensions - sweetness and bitterness. A sommelier's description might have fifty dimensions: acidity, tannin, body, finish, oak, fruit type, regional characteristics, and more. More dimensions mean more ability to make fine-grained distinctions. Embedding dimension is the number of descriptive dimensions the model can use for every token.
Real-world example
GPT-3's largest version (175 billion parameters) uses an embedding dimension of 12,288. The smallest GPT-2 model uses 768. The difference between these is not just scale - the higher-dimensional representation genuinely allows the model to capture more nuanced information about each token's meaning in context, contributing to the qualitative improvement in capability.
Why it matters
Embedding dimension is one of those architectural hyperparameters that shapes everything else in a model. It is fixed at design time and cannot easily be changed after training. Understanding it helps explain why models of different sizes have qualitatively different capabilities - it is not just about more parameters, but about richer representations at each step.
In the news
No recent coverage - check back later.
Related concepts