Constitutional AI

Anthropic's approach to alignment where a model is given a set of principles and trained to critique and revise its own outputs to comply with them - reducing reliance on human labelling of harmful content.

Added May 18, 2026 · 3 min read

Constitutional AI is significant both practically and philosophically. Practically, it dramatically reduces the need for human labelling of harmful content, making alignment training more scalable and less damaging to human annotators. Philosophically, it makes the model's values legible: you can read the constitution and understand what the model was trained to care about. As AI systems become more powerful, having explicit, inspectable value specifications becomes increasingly important.

Training an AI to refuse harmful requests and behave safely traditionally required vast amounts of human-labelled examples of harmful content and appropriate refusals. Annotators had to read and evaluate disturbing material at scale, which was both expensive and harmful to the people doing the work. Constitutional AI, developed by Anthropic, offered a different path.

The name comes from the central idea: give the model a written constitution - a set of principles about how it should behave. These principles might include things like "choose the response that is least likely to cause harm," "prefer responses that are honest over ones that are misleading," or "avoid responses that would be considered dangerous by a thoughtful senior employee." The model then uses these principles to evaluate and improve its own outputs.

The process works in two stages. In the first stage, called supervised learning from AI feedback (SLAI or RLAIF), the model is prompted to critique its own responses using the constitutional principles and then revise them to better comply. This generates a dataset of revised, more principled responses that can be used for supervised fine-tuning. The model essentially teaches itself to be more aligned by repeatedly applying the constitution to its own outputs.

In the second stage, the model''s revised responses are used to train a preference model - an AI judge that has learned what constitutional compliance looks like. This preference model then provides feedback signals for reinforcement learning, refining the main model further.

Constitutional AI makes the alignment criteria explicit and inspectable. Rather than relying on implicit human preferences encoded through millions of individual ratings, the values are written down in the constitution and can be read, debated, and revised. This transparency is an advantage for understanding and auditing what the model has been trained to value, even if the gap between a written principle and its implementation in model behaviour is still imperfect.

Analogy

A law school that teaches students to apply a written legal code rather than memorising every case outcome individually. The students learn the principles, then apply them to novel situations. Constitutional AI does the same: rather than labelling every possible harmful output, give the model the principles and let it apply them to generate its own training signal.

Real-world example

Claude, Anthropic's AI assistant, is trained using Constitutional AI. The principles in Claude's constitution include commitments to honesty, to avoiding harm, and to being genuinely helpful. When Claude declines to help with something harmful, it is applying the trained result of a constitutional training process - not following a lookup table of banned phrases, but expressing dispositions shaped by iterative self-critique against those principles.

Why it matters

Constitutional AI is significant both practically and philosophically. Practically, it dramatically reduces the need for human labelling of harmful content, making alignment training more scalable and less damaging to human annotators. Philosophically, it makes the model's values legible: you can read the constitution and understand what the model was trained to care about. As AI systems become more powerful, having explicit, inspectable value specifications becomes increasingly important.

In the news

AI May Not Solve US Debt Crisis
Fortune · 3d ago

Related concepts

Direct Preference Optimization (DPO)

A simpler alternative to RLHF that achieves alignment without needing a separate reward model - training the language model directly on human preference pairs.

RLAIF (Reinforcement Learning from AI Feedback)

A variant of RLHF where another AI model provides the preference judgements instead of human raters - dramatically reducing cost while maintaining much of the alignment quality.

RLHF (Reinforcement Learning from Human Feedback)

A training technique that teaches AI to produce responses humans actually prefer, by having real people rate different outputs and using those ratings to improve the model.

← Back to concepts