Denoising Objective

A pre-training approach where the model is trained to reconstruct original text from a corrupted version - teaching it to understand both what text means and how it is structured.

Added May 18, 2026 · 2 min read

The denoising objective matters because the choice of pre-training objective fundamentally shapes what capabilities a model develops. Models trained with denoising objectives develop strong bidirectional understanding of text and generalise well to reconstruction tasks. This makes them particularly suitable for tasks like summarisation, translation, and information extraction - the workhorse tasks of practical NLP.

Pre-training language models requires a training objective - a task on which the model can be trained using unlabelled text. The most common objectives are either predicting the next token (used by GPT-style decoder models) or predicting masked tokens (used by BERT-style encoder models). The denoising objective is a generalisation of these: corrupt the input text in some way, then train the model to restore it.

Corruption can take many forms. Masking - replacing some tokens with a special [MASK] token - is the form used in BERT. But the denoising framing supports richer corruption strategies: randomly deleting tokens, shuffling the order of spans, replacing spans with single tokens, rotating sentences. Each corruption type teaches the model something slightly different about language structure. Deletion teaches the model to infer missing content from context. Span shuffling teaches the model about the ordering and coherence of text. Replacement with single tokens ("noising") teaches reconstruction of structure from compressed signals.

T5 (Text-to-Text Transfer Transformer), Google''s highly influential 2019 model, used a span-corruption variant of the denoising objective: randomly sampled spans of tokens were replaced by single sentinel tokens, and the model was trained to predict the original spans in order. This approach produced an encoder-decoder model capable of performing almost any NLP task by framing everything as text-to-text transformation.

The denoising objective has an important advantage over purely generative objectives: the model must understand the full input context to reconstruct corrupted portions effectively, which produces richer bidirectional representations than left-to-right generation. At the same time, it avoids the limitation of masked language modelling (which only trains on the masked positions) by training the model on all tokens in the output.

Denoising pre-training also underpins diffusion models in other domains - particularly image generation, where models are trained to denoise progressively corrupted images. The conceptual link between language denoising and image diffusion has produced interesting cross-domain techniques and unified frameworks.

Analogy

Learning to read by practising on texts where some words have been obscured or rearranged, and your task is to figure out what the original said. This exercise forces you to use context, grammar, and meaning simultaneously - much more actively than simply reading clean text. The denoising objective forces language models into the same active reconstruction process.

Real-world example

T5's span-corruption pre-training produced a model that could perform text summarisation, translation, question answering, and sentiment classification all with the same architecture and the same fine-tuning approach - just by framing each task as a text-to-text problem. Its strong pre-training enabled transfer to a remarkable range of tasks, demonstrating how the training objective shapes what kinds of transfer are possible.

Why it matters

The denoising objective matters because the choice of pre-training objective fundamentally shapes what capabilities a model develops. Models trained with denoising objectives develop strong bidirectional understanding of text and generalise well to reconstruction tasks. This makes them particularly suitable for tasks like summarisation, translation, and information extraction - the workhorse tasks of practical NLP.

In the news

No recent coverage - search for Denoising Objective.

Related concepts

BERT

Google's landmark 2018 language model that introduced bidirectional pre-training - a model that reads text in both directions simultaneously and set new standards for understanding tasks.

Encoder-Decoder

A two-part neural network design where one half reads and compresses your input, and the other half uses that compressed understanding to generate a new output.

Foundation Model

A large AI model trained on vast amounts of general data, designed to be the starting point for many different applications rather than built for a single task.

← Back to concepts