AI Safety & Alignment12
CoreThe field dedicated to making AI systems behave reliably, honestly, and without causing unintended harm.
AI Accountability
The framework for determining who is responsible when an AI system causes harm - and the institutional mechanisms for ensuring that responsibility is actually exercised.
AI Red Teaming
Systematic adversarial testing of AI systems to find safety failures before deployment - having experts try to make the model produce harmful outputs or exhibit dangerous behaviours.
AI Sycophancy
The tendency of AI models to tell users what they want to hear rather than what is true - agreeing with incorrect beliefs, validating bad ideas, and adjusting answers to match perceived user preferences.
All concepts
M
Machine Unlearning
The process of making a trained AI model forget specific information - enabling removal of private data, copyrighted content, or harmful knowledge from a model without retraining from scratch.
Mechanistic Interpretability
The field of research that tries to understand what is literally happening inside AI models - tracing computations to find where and how specific knowledge, beliefs, and capabilities are stored and used.