VII · Reinforcement Learning & RoboticsAdvanced

Imitation Learning

Learning a policy by directly training on expert demonstrations - teaching an agent to behave like an expert by showing it what to do, rather than having it discover behaviours through reward-driven trial and error.

Added May 18, 2026 · 3 min read

Imitation learning is one of the primary practical approaches for training policies when RL is impractical - when environments are hard to simulate, rewards are difficult to specify, or mistakes are dangerous. It is also the conceptual basis for the SFT phase of LLM fine-tuning, connecting robotics and NLP through a common learning paradigm. Understanding its limitations (distribution shift, compounding errors) explains why SFT alone is insufficient for aligned LLMs and why RLHF is needed.

Reinforcement learning requires the agent to discover good behaviours by trying many actions and learning from the resulting rewards. For tasks where reward is sparse, environments are hard to reset, or mistakes are costly (surgical robotics, autonomous driving), extensive trial and error is impractical. Imitation learning addresses this by using expert demonstrations - recordings of an expert performing the task - as the primary learning signal.

Behaviour cloning (BC), the simplest form of imitation learning, treats the problem as supervised learning: the state is the input, the expert's action is the label, and the policy is trained to predict the correct action in each state. On demonstrated state-action pairs, BC can achieve very high accuracy. In practice, however, behaviour-cloned policies often fail when deployed in the real environment. The issue is compounding errors: BC trains on the distribution of states the expert visited, but the learned policy is slightly imperfect and sometimes visits states the expert never encountered. With no training signal for these off-distribution states, the policy may take bad actions, leading to worse states, leading to worse actions - cascading failures that diverge exponentially from the expert's trajectory.

DAgger (Dataset Aggregation) addresses distribution shift by interleaving policy execution with expert annotation. The learner executes its current policy, the expert annotates what action should have been taken in each encountered state (including off-distribution states the learner visited), and the annotated data is added to the training set. This builds a dataset covering the states the learner actually visits, not just states the expert visits. DAgger converges to a policy competitive with the expert even under distribution shift.

Theta* and Inverse RL-based imitation approaches recover the underlying reward function from demonstrations and then solve for a policy, potentially generalising better to new environments with different dynamics.

In the context of large language models, instruction tuning and RLHF include an imitation learning phase: supervised fine-tuning (SFT) trains the model on human-written demonstrations of desired behaviour (the SFT step in InstructGPT). This is behaviour cloning applied to text generation: the expert's written responses are the demonstrations, and the model learns to imitate their distribution.

Robotics applications of imitation learning have advanced significantly: learning from human teleoperation demonstrations (where a human controls a robot arm to demonstrate a task), learning from video of humans performing tasks, and one-shot imitation (learning a new task from a single demonstration by generalising across task structure).

Analogy

Learning to cook from a mentor who cooks a dish while you watch. You note what ingredients they use, how they cut them, what temperature they cook at. Initially you imitate their exact process (behaviour cloning). When you cook the dish yourself and things go slightly differently - the pan is hotter, the onions are larger - you improvise based on general understanding rather than exactly replicating their movements (generalisation beyond the demonstration). The richer your understanding of the underlying objective (the dish should taste like this, the onions should be translucent), the better you can adapt to novel situations.

Real-world example

Training a robotic arm to pack grocery bags: a human teleoperator demonstrates 50 packing sessions across different sets of grocery items. Behaviour cloning trains the robot policy on these demonstrations. Initial deployment fails when the robot encounters novel item combinations not in the demonstrations - the policy has no training signal for those states. DAgger is applied: the robot's policy executes, a human annotator labels actions for every state the robot visits (including novel states), and the expanded dataset is used to retrain. After two DAgger rounds, the policy handles 95% of novel item combinations successfully.

Why it matters

Imitation learning is one of the primary practical approaches for training policies when RL is impractical - when environments are hard to simulate, rewards are difficult to specify, or mistakes are dangerous. It is also the conceptual basis for the SFT phase of LLM fine-tuning, connecting robotics and NLP through a common learning paradigm. Understanding its limitations (distribution shift, compounding errors) explains why SFT alone is insufficient for aligned LLMs and why RLHF is needed.

In the news

No recent coverage - search for Imitation Learning.

Related concepts

Inverse Reinforcement Learning

The problem of inferring the underlying reward function that explains an expert's observed behaviour - learning not just what to do from demonstrations, but why: recovering the goal structure that the expert's actions appear to optimise.

Reward Shaping

The practice of adding supplementary reward signals to a reinforcement learning environment to make learning faster and more reliable, guiding the agent toward useful behaviours before sparse natural rewards can be observed.

Sim-to-Real Transfer

The technique of training a robot control policy in simulation - where data is cheap and mistakes are safe - and then deploying it on a real physical robot, bridging the performance gap caused by imperfect simulation of real-world physics.

← Back to concepts