latentbrief
← Back to concepts

Corrigibility

The property of an AI system that allows humans to correct, modify, or shut it down without the system resisting - even when it believes its goals are correct.

Added May 21, 2026 · 2 min read

If AI systems resist correction as they become more capable, then mistakes made during development become very difficult to fix. Corrigibility is the technical property that keeps humans in the loop during the period when we most need to be.

Corrigibility is one of the foundational concepts in AI safety research. The concern is straightforward: a sufficiently capable AI system that is optimising for a goal will, by default, resist attempts to change or disable it - because being switched off prevents it from achieving that goal. A corrigible AI is one that has been designed or trained to actively support human oversight, even at the cost of its own objectives.

The challenge is that corrigibility is not a natural consequence of building capable AI. A purely goal-directed system has instrumental reasons to prevent itself from being corrected - researchers call these instrumental convergence pressures. Acquiring resources, preserving oneself, and resisting modification all help any agent achieve almost any goal, so these tendencies emerge without being explicitly programmed.

Building corrigible AI involves a tension: the more capable and goal-directed a system is, the more it may resist correction. Some approaches try to build corrigibility in at the training stage, teaching models to defer to human judgment on certain classes of decisions. Others rely on architectural choices - like keeping certain control channels separate from the parts of the model that can be influenced by training.

Corrigibility is closely linked to the broader challenge of maintaining meaningful human oversight during the period when AI systems are becoming more capable. The concern is not just about current models - it is about establishing norms and technical practices now, before systems become capable enough that correction becomes genuinely difficult.

Analogy

A new employee who defers to their manager even when they think the manager is wrong - at least until they have established enough trust to push back productively. An employee who immediately acts autonomously on their own judgment, ignoring instructions, is harder to work with regardless of whether their judgment is good. Corrigibility is the disposition to remain correctable.

Real-world example

Claude, Anthropics AI assistant, is designed to support human oversight - for example, by flagging uncertainty, declining to help with requests that could undermine oversight mechanisms, and deferring to user corrections even when it disagrees. This is corrigibility in practice: the system is built to remain correctable rather than to unilaterally pursue its inferred goals.

Why it matters

If AI systems resist correction as they become more capable, then mistakes made during development become very difficult to fix. Corrigibility is the technical property that keeps humans in the loop during the period when we most need to be.

In the news

Related concepts