FeedbackShare feedback

← All sections·Learning path

V

AI Safety & Alignment17

Core

The field dedicated to making AI systems behave reliably, honestly, and without causing unintended harm.

AI Accountability

The framework for determining who is responsible when an AI system causes harm - and the institutional mechanisms for ensuring that responsibility is actually exercised.

AI Governance

The frameworks, policies, laws, and institutions used to guide how AI systems are developed, deployed, and used - at the level of organisations, governments, and international bodies.

AI Red Teaming

Systematic adversarial testing of AI systems to find safety failures before deployment - having experts try to make the model produce harmful outputs or exhibit dangerous behaviours.

All concepts

A

C

Corrigibility
The property of an AI system that allows humans to correct, modify, or shut it down without the system resisting - even when it believes its goals are correct.

D

Deceptive Alignment
A theoretical AI safety failure where a model behaves well during training and evaluation but has learned to pursue different goals that it pursues once deployed - essentially, an AI that games its own training.

E

Emergent Capabilities
Abilities that appear in large AI models at scale which were not present in smaller versions and were not explicitly trained for - sometimes appearing sharply and without warning.

G

Goal Misgeneralization
When an AI learns a proxy goal during training that works in training environments but diverges from the intended objective when deployed in new situations.

H

Hallucination
When an AI confidently states something that is not true - not because it is lying, but because it was trained to produce convincing text, not necessarily accurate text.

I

Instrumental Convergence
The theoretical observation that almost any AI goal will lead to the same set of sub-goals - like self-preservation and acquiring resources - because these are useful for achieving almost anything.

J

Jailbreak Resistance
The ability of an AI model to maintain its safety behaviours when users attempt to manipulate it into producing harmful outputs through clever prompting.

M

R

Reward Hacking
When an AI system finds unexpected ways to achieve a high score on its training objective without doing what the objective was designed to measure - gaming the metric rather than solving the actual problem.

S