latentbrief
← Back to concepts

Concept

Byte-Pair Encoding (BPE)

The algorithm most large language models use to split text into tokens - finding the most efficient vocabulary of word fragments that can represent any text without getting overwhelmed by rare words.

Added May 18, 2026

Language models do not work with raw characters or whole words - they work with tokens, which are the basic units of text the model processes. The question of how to split text into tokens has a large effect on model efficiency and capability, and Byte-Pair Encoding is the method that won out for most modern systems.

The core problem BPE solves is vocabulary management. If you use whole words as tokens, you need an enormous vocabulary to cover all words in all languages - and you still cannot handle new words, technical jargon, or uncommon proper nouns. If you use individual characters, your vocabulary is tiny but your sequences become extremely long (every word is many tokens) and the model must learn to compose meaning from individual characters, which is slow and wasteful.

BPE finds a middle path by building a vocabulary of common subword units. It starts with a character-level vocabulary, then iteratively merges the most frequently co-occurring pairs of symbols into single symbols. "th" is very common in English, so it gets merged early. "ing" is very common, so it becomes a token. "transformer" might become a single token because it appears very frequently in training data, while a rare word like "phenolphthalein" might be split into "phen", "olph", "thale", "in". The merging process continues until you reach your target vocabulary size (typically 32,000 to 100,000 tokens for modern models).

The result is a vocabulary where common words and word fragments are single tokens, rare words are covered by combining a small number of subword tokens, and no text is ever completely unhandleable. Code, mathematical notation, and text in any language can all be represented, though with varying efficiency - English text tokenises more compactly than languages with complex morphology.

BPE directly affects model quality in subtle ways. If a word is always represented as multiple tokens, the model must learn its meaning from those parts, which takes more capacity than if it were a single token. This is one reason models sometimes handle common English words better than equivalent rare words - they see the rare words broken into parts rather than as unified units.

Analogy

A compression algorithm for text that builds a dictionary of common phrases. First it looks at which pairs of letters appear most often together and fuses them into single symbols. Then it looks at which pairs of those new symbols appear most often together and fuses those. After many rounds, it has a compact dictionary of common text fragments that covers the language efficiently.

Real-world example

The token "GPT" is a single token in many tokenisers because it appears frequently enough in training data to earn its own vocabulary entry. The word "transformative" might be split into "transform" and "ative" because while "transform" is common enough to earn a token, the full word is not. Whether a specific word is one token or several is entirely determined by its frequency in the training corpus.

Why it matters

Tokenisation choices have real effects on what models are good at. Code models fine-tuned on programming languages often have extended vocabularies with common code patterns as single tokens. Models designed for multilingual use have vocabularies that allocate more entries to non-English subword units. The "grain" of the vocabulary shapes what is easy and what is hard for the model to learn.

In the news

No recent coverage - check back later.

Related concepts