Inverse Reinforcement Learning
The problem of inferring the underlying reward function that explains an expert's observed behaviour - learning not just what to do from demonstrations, but why: recovering the goal structure that the expert's actions appear to optimise.
Added May 18, 2026 · 3 min read
IRL addresses a fundamental problem in AI alignment: rather than specifying what we want an agent to do, which is hard and error-prone, we can show it what we want and let it infer the objective. This is the philosophical basis for learning from human feedback - RLHF trains a reward model that functions like an IRL-derived reward. Understanding IRL provides the theoretical foundation for why learning from demonstrations is a promising path toward aligned AI behaviour.
Standard reinforcement learning takes a reward function as given and finds the policy that maximises it. Inverse reinforcement learning (IRL) reverses this: given observations of an expert's behaviour (a sequence of state-action pairs), infer the reward function that best explains the expert's decisions.
The motivation: specifying reward functions for complex tasks is difficult and error-prone. Reward hacking (the agent optimising a proxy reward rather than the intended objective) is a persistent problem when humans design rewards manually. If instead we could infer the true reward function from expert demonstrations, we would recover a reward function that is actually consistent with desired behaviour - without needing to specify every aspect of what the agent should care about.
IRL is an ill-posed problem: many different reward functions can explain the same observed behaviour (the trivial example: a constant zero reward is consistent with any behaviour, because nothing changes the agent's utility). The key insight from Ng and Russell's foundational work is that the optimal reward function distinguishes the expert's behaviour from all other policies as maximally. The expert's policy should achieve higher expected reward under the inferred reward function than any alternative policy.
Maximum Entropy IRL (Ziebart et al., 2008) resolves the ambiguity by selecting the reward function under which the expert's demonstrated trajectories have maximum entropy - the probability of observing each trajectory is proportional to its exponentiated reward. This produces a unique, well-calibrated reward function and has connections to probabilistic models of rational action.
Generative Adversarial Imitation Learning (GAIL) takes a GAN-inspired approach: the discriminator learns to distinguish expert trajectories from agent-generated trajectories (the "reward" is the discriminator's classification), while the generator (the policy) learns to produce trajectories indistinguishable from the expert. GAIL implicitly performs IRL without ever explicitly recovering the reward function.
IRL and imitation learning are closely related but distinct. Imitation learning (behaviour cloning) directly learns the policy pi(a|s) from demonstrations, without inferring a reward function. IRL infers the reward function and then solves for a policy. The advantage of IRL is that the inferred reward function transfers to new environments with different dynamics, while a behaviour-cloned policy may fail when the dynamics change.
Applications include autonomous driving (inferring human driving reward functions from naturalistic driving data), game playing (inferring strategic objectives from expert game replays), and robot learning from human demonstration.
Analogy
Watching a chess grandmaster play and trying to understand not just what moves they make but what they are trying to achieve - their underlying strategic objectives and priorities. A student can imitate the moves directly (behaviour cloning), but a deeper understanding comes from inferring the principles: the grandmaster values king safety, pawn structure, piece activity. With those principles understood (the inferred reward function), the student can apply them in novel situations the grandmaster never demonstrated.
Real-world example
Researchers used IRL on naturalistic human driving data to infer the reward function that explains human highway driving behaviour. Rather than manually specifying that drivers prefer certain following distances, lane positions, and speeds, IRL recovered these preferences from millions of kilometres of driving footage. The inferred reward function was then used to train an autonomous driving policy that, when transferred to a new vehicle with different dynamics, still drove in a human-like manner - whereas behaviour-cloned policies that directly imitated actions failed to transfer.
Why it matters
IRL addresses a fundamental problem in AI alignment: rather than specifying what we want an agent to do, which is hard and error-prone, we can show it what we want and let it infer the objective. This is the philosophical basis for learning from human feedback - RLHF trains a reward model that functions like an IRL-derived reward. Understanding IRL provides the theoretical foundation for why learning from demonstrations is a promising path toward aligned AI behaviour.
In the news
Related concepts
Imitation Learning
Learning a policy by directly training on expert demonstrations - teaching an agent to behave like an expert by showing it what to do, rather than having it discover behaviours through reward-driven trial and error.
Markov Decision Process
The mathematical framework that underpins reinforcement learning - formalising sequential decision-making as states, actions, transition probabilities, and rewards, where the future depends only on the current state and not on how you got there.
Reward Shaping
The practice of adding supplementary reward signals to a reinforcement learning environment to make learning faster and more reliable, guiding the agent toward useful behaviours before sparse natural rewards can be observed.