VII · Reinforcement Learning & RoboticsAdvanced

World Models

Learned neural network representations of an environment's dynamics - enabling an RL agent to simulate future outcomes in its "mind" and plan ahead without additional real-world experience, dramatically improving sample efficiency.

Added May 18, 2026 · 3 min read

World models are central to the quest for sample-efficient RL that can learn complex skills without requiring impractical amounts of real-world experience. They also connect RL to the broader field of generative modelling, suggesting that powerful generative models of the future could serve as world models for decision-making. Understanding world models explains why MuZero works without knowing game rules, why Dreamer can learn from so little real data, and why video generation models are increasingly relevant to robotics.

Model-free RL agents learn directly from interaction with the real environment - they make no assumptions about the environment's structure and learn purely from the experience they collect. This requires enormous amounts of data: DQN trained on Atari games needed hundreds of millions of frames. World models take a different approach: learn an internal model of the environment's dynamics, then use that model to generate synthetic experience, plan ahead, or improve policy learning.

A world model consists of two components. A representation model encodes observations (images, sensor readings) into a compact latent state. A dynamics model predicts the next latent state given the current state and action. Together, these allow the agent to mentally simulate future trajectories: start from a current observation, encode it, and then predict next states by applying sequences of hypothetical actions through the dynamics model - all in the agent's own learned representation, without interacting with the real environment.

The Dreamer algorithm (Hafner et al., 2019-2023) is the most influential world model architecture. DreamerV3 uses a Recurrent State Space Model (RSSM) as the world model: a recurrent neural network that maintains a latent state, predicts transitions stochastically, and decodes the latent state to predicted observations and rewards. The agent is trained entirely within the world model's "imagination" - generating millions of imagined rollouts without any additional real-world interaction. DreamerV3 achieves human-level performance across diverse domains (visual robotics, Atari, real robot tasks) from orders of magnitude fewer real environment steps than model-free methods.

Monte Carlo Tree Search (MCTS) is a planning algorithm that uses a learned model (or known model) to simulate future states and select actions by tree search. AlphaGo and AlphaZero use MCTS with a learned value function and policy network to search the game tree and select moves. MuZero extends this to learn the dynamics model itself from data, achieving superhuman performance across board games, Atari, and video games without being given the game rules.

World models connect RL to generative modelling: a good world model is essentially a conditional generative model of future observations. Large generative models (video generation models, latent diffusion models) can serve as world models for robotics, predicting future video frames given actions and enabling visual planning. Genie (Google DeepMind, 2024) and similar systems learn interactive world models from internet video that can be controlled by action inputs.

The key advantage of world models: sample efficiency. Rather than needing millions of real environment interactions, an agent with a good world model can simulate millions of rollouts internally and learn from them, requiring far less real-world data.

Analogy

Chess grandmasters are estimated to mentally calculate 5-20 moves ahead during play. They do not need to physically move pieces to evaluate the consequences of a move sequence - they use a mental model of chess dynamics to simulate future positions and evaluate them. This internal simulation is far more sample-efficient than learning only from actually played games. World models give RL agents the same capability: an internal model to simulate and plan rather than requiring all learning to come from real interaction.

Real-world example

DreamerV3 trains a robotic manipulation policy in 100,000 real robot interactions (roughly 12 hours of data). During training, each real experience is stored and used to train the world model. The policy is then trained by generating millions of imagined rollouts inside the world model - the robot's policy network sees imagined observations and imagined rewards rather than real ones. Despite training primarily on imagination, the resulting policy transfers to the real robot and achieves competitive performance with model-free methods that required 10 million real robot interactions.

Why it matters

World models are central to the quest for sample-efficient RL that can learn complex skills without requiring impractical amounts of real-world experience. They also connect RL to the broader field of generative modelling, suggesting that powerful generative models of the future could serve as world models for decision-making. Understanding world models explains why MuZero works without knowing game rules, why Dreamer can learn from so little real data, and why video generation models are increasingly relevant to robotics.

In the news

Related concepts

Actor-Critic

A reinforcement learning architecture that combines a policy network (the actor, which decides which actions to take) with a value network (the critic, which evaluates how good the current state is) - reducing gradient variance and enabling more stable learning than pure policy gradient.

Markov Decision Process

The mathematical framework that underpins reinforcement learning - formalising sequential decision-making as states, actions, transition probabilities, and rewards, where the future depends only on the current state and not on how you got there.

Monte Carlo Tree Search

A planning algorithm that builds a search tree by simulating random rollouts from each candidate action, using the aggregate results to estimate action values - the algorithm that powered AlphaGo's superhuman performance in the ancient game of Go.

← Back to concepts