latentbrief
← Back to concepts

Concept

RLHF (Reinforcement Learning from Human Feedback)

A training technique that teaches AI to produce responses humans actually prefer, by having real people rate different outputs and using those ratings to improve the model.

A language model that has only been trained to predict the next word is a strange and often unreliable thing. It completes patterns. It continues text in ways that are statistically likely. Sometimes that produces useful answers. Sometimes it produces harmful content, misleading information, or responses that technically answer a question while completely missing the point. It has no sense of what a person actually wants.

RLHF is the technique that transforms such a model into something that behaves more like a helpful assistant. The process starts with human raters - usually contractors who have been trained to evaluate AI outputs. They are shown pairs of responses to the same question and asked to judge which one is better: more helpful, more accurate, more appropriate. These judgements accumulate into a large dataset of human preferences.

That dataset is then used to train a separate model - called a reward model - to predict which kinds of responses humans would prefer. Think of it as an automated judge that has learned from thousands of real human judgements. Once the reward model is trained, it is used to guide further training of the original language model: the model is adjusted to produce outputs that score highly according to the reward model.

The result, done well, is a model that is noticeably more helpful, more honest, and less likely to produce harmful content than the raw version. The jump from early GPT-3 (which was largely unfiltered) to the polished, conversational behaviour of ChatGPT was largely the result of RLHF.

The technique is not perfect. The reward model can be fooled - models sometimes learn to produce responses that sound good without actually being more helpful. The quality of the human ratings matters enormously, and biases in the rater pool become biases in the model. Research into better alternatives - including using AI feedback instead of human feedback at scale - is an active area for every major AI lab.

Analogy

Training a dog with treats. When the dog does what you want, it gets a reward. When it does not, no reward. Over time, the dog learns which behaviours get rewarded and does more of those. RLHF works on the same principle - the model learns which kinds of responses humans rate highly and learns to produce more of them.

Real-world example

The difference between early GPT-3 and the ChatGPT that launched in late 2022 was striking. The base model would complete prompts literally and sometimes in harmful or bizarre ways. After RLHF, the same underlying model understood instructions, declined clearly harmful requests, and behaved in a much more predictably useful way. Same foundation; RLHF made it feel like a different product.

Why it matters

RLHF is how AI labs operationalise "alignment" - the goal of making models actually behave the way people want them to. The quality of the human feedback data and how the training is structured directly determines how helpful and safe the resulting model is. It is one of the most important levers available for shaping AI behaviour, and every major lab has its own version of it.

In the news

No recent coverage - check back later.

Related concepts