BERT

Google's landmark 2018 language model that introduced bidirectional pre-training - a model that reads text in both directions simultaneously and set new standards for understanding tasks.

Added May 18, 2026 · 2 min read

BERT's influence on NLP was profound and lasting. It demonstrated that scale of pre-training data and bidirectional context could produce representations good enough to transfer well to almost any language understanding task. This insight - that very large pre-trained models are the best starting point for almost any NLP problem - is now fundamental to the field.

Before BERT, most language models processed text in one direction - left to right. They were good at generating text (predicting what comes next) but had a fundamental limitation for understanding tasks: when processing a word, they could only use context from words that came before it, not the words that come after. For many understanding tasks, this is a serious handicap.

BERT (Bidirectional Encoder Representations from Transformers) addressed this by pre-training with a different objective: masked language modelling. Instead of predicting the next word in a sequence, BERT was trained to predict randomly masked words anywhere in a sentence, using context from both before and after the masked position. To predict what word should go in the blank in "The cat [MASK] on the mat," BERT could look at both "The cat" and "on the mat" simultaneously - giving it full context from both directions.

This bidirectional pre-training produced dramatically better representations for language understanding tasks. When fine-tuned on tasks like sentiment analysis, question answering, named entity recognition, or textual inference, BERT models set new records across essentially every benchmark. The representations it learned were so rich that a small amount of task-specific fine-tuning could produce state-of-the-art results with relatively little labelled data.

BERT established the modern paradigm of pre-train on large unlabelled text, fine-tune on small labelled datasets - a paradigm that still dominates NLP for understanding tasks. The family of BERT-style models is enormous: RoBERTa (a robustly trained version), DistilBERT (a compressed version), domain-specific variants like BioBERT (medicine) and FinBERT (finance), and many others.

The trade-off is that BERT is an encoder model - it builds representations but does not generate text. It is excellent for classification, extraction, and understanding tasks, but not for generating responses. Decoder-only models like GPT took over for generation tasks, while BERT-style models remained dominant for classification and retrieval.

Analogy

Reading a document and understanding each paragraph with full access to what came before and after it - rather than being forced to understand each paragraph only based on what you have read so far. BERT's bidirectionality gives it the same advantage a careful re-reader has over someone who can only read forward: full context for every decision.

Real-world example

Google Search incorporated BERT in 2019 to better understand query intent. When someone searches for "can you get medicine for someone pharmacy," BERT helps Google understand this is about picking up medicine for someone else - the "for someone" is crucial context that earlier, unidirectional models sometimes missed. The launch was one of the largest changes to Google's search algorithm in years.

Why it matters

BERT's influence on NLP was profound and lasting. It demonstrated that scale of pre-training data and bidirectional context could produce representations good enough to transfer well to almost any language understanding task. This insight - that very large pre-trained models are the best starting point for almost any NLP problem - is now fundamental to the field.

In the news

California Gas Stations Accused of Using AI to Raise Prices
Forbes · 1w ago

Related concepts

Encoder-Decoder

A two-part neural network design where one half reads and compresses your input, and the other half uses that compressed understanding to generate a new output.

Foundation Model

A large AI model trained on vast amounts of general data, designed to be the starting point for many different applications rather than built for a single task.

Transformer

The AI architecture that powers virtually every major language model today - the underlying design that makes GPT, Claude, Gemini, and most other modern AI systems work.

← Back to concepts