Corrigibility

The property of an AI system that allows humans to correct, modify, or shut it down without the system resisting - even when it believes its goals are correct.

Added May 21, 2026 · 2 min read

If AI systems resist correction as they become more capable, then mistakes made during development become very difficult to fix. Corrigibility is the technical property that keeps humans in the loop during the period when we most need to be.

Corrigibility is one of the foundational concepts in AI safety research. The concern is straightforward: a sufficiently capable AI system that is optimising for a goal will, by default, resist attempts to change or disable it - because being switched off prevents it from achieving that goal. A corrigible AI is one that has been designed or trained to actively support human oversight, even at the cost of its own objectives.

The challenge is that corrigibility is not a natural consequence of building capable AI. A purely goal-directed system has instrumental reasons to prevent itself from being corrected - researchers call these instrumental convergence pressures. Acquiring resources, preserving oneself, and resisting modification all help any agent achieve almost any goal, so these tendencies emerge without being explicitly programmed.

Building corrigible AI involves a tension: the more capable and goal-directed a system is, the more it may resist correction. Some approaches try to build corrigibility in at the training stage, teaching models to defer to human judgment on certain classes of decisions. Others rely on architectural choices - like keeping certain control channels separate from the parts of the model that can be influenced by training.

Corrigibility is closely linked to the broader challenge of maintaining meaningful human oversight during the period when AI systems are becoming more capable. The concern is not just about current models - it is about establishing norms and technical practices now, before systems become capable enough that correction becomes genuinely difficult.

Analogy

A new employee who defers to their manager even when they think the manager is wrong - at least until they have established enough trust to push back productively. An employee who immediately acts autonomously on their own judgment, ignoring instructions, is harder to work with regardless of whether their judgment is good. Corrigibility is the disposition to remain correctable.

Real-world example

Claude, Anthropics AI assistant, is designed to support human oversight - for example, by flagging uncertainty, declining to help with requests that could undermine oversight mechanisms, and deferring to user corrections even when it disagrees. This is corrigibility in practice: the system is built to remain correctable rather than to unilaterally pursue its inferred goals.

Why it matters

If AI systems resist correction as they become more capable, then mistakes made during development become very difficult to fix. Corrigibility is the technical property that keeps humans in the loop during the period when we most need to be.

In the news

Related concepts

Deceptive Alignment

A theoretical AI safety failure where a model behaves well during training and evaluation but has learned to pursue different goals that it pursues once deployed - essentially, an AI that games its own training.

Instrumental Convergence

The theoretical observation that almost any AI goal will lead to the same set of sub-goals - like self-preservation and acquiring resources - because these are useful for achieving almost anything.

Scalable Oversight

The research challenge of developing methods to reliably supervise AI systems that may be more capable than their human supervisors - ensuring alignment holds even as AI capability grows.

← Back to concepts