AI Safety & Alignment17
CoreThe field dedicated to making AI systems behave reliably, honestly, and without causing unintended harm.
AI Accountability
The framework for determining who is responsible when an AI system causes harm - and the institutional mechanisms for ensuring that responsibility is actually exercised.
AI Governance
The frameworks, policies, laws, and institutions used to guide how AI systems are developed, deployed, and used - at the level of organisations, governments, and international bodies.
AI Red Teaming
Systematic adversarial testing of AI systems to find safety failures before deployment - having experts try to make the model produce harmful outputs or exhibit dangerous behaviours.
All concepts
A
AI Sycophancy
The tendency of AI models to tell users what they want to hear rather than what is true - agreeing with incorrect beliefs, validating bad ideas, and adjusting answers to match perceived user preferences.
AI Transparency
The principle that AI systems should be understandable and explainable - that users, regulators, and affected parties should be able to understand how decisions are being made.
M
Machine Unlearning
The process of making a trained AI model forget specific information - enabling removal of private data, copyrighted content, or harmful knowledge from a model without retraining from scratch.
Mechanistic Interpretability
The field of research that tries to understand what is literally happening inside AI models - tracing computations to find where and how specific knowledge, beliefs, and capabilities are stored and used.
S
Scalable Oversight
The research challenge of developing methods to reliably supervise AI systems that may be more capable than their human supervisors - ensuring alignment holds even as AI capability grows.
Specification Gaming
When an AI system satisfies the letter of its objective through unintended means - technically achieving the metric it was given while completely missing the intended goal.