Offline Reinforcement Learning
A variant of reinforcement learning that learns a policy entirely from a static dataset of pre-collected experience, without any environment interaction during training - enabling RL from historical logs when real-world exploration is impossible or dangerous.
Added May 18, 2026 · 3 min read
Offline RL unlocks reinforcement learning for the vast number of domains where online exploration is impractical. Healthcare, finance, robotics, and recommendation systems all have rich historical datasets and domains where deployment risk makes online RL infeasible. Understanding offline RL - and its key challenge of distributional shift - explains why algorithms like CQL and DPO are designed differently from online RL algorithms, and why the field is increasingly important as RL moves from game environments to real-world applications.
Standard reinforcement learning requires the agent to interact with an environment to collect training data. This interaction is the source of its learning signal but also its primary limitation: for many real-world applications, online exploration is dangerous (medical treatments, financial markets), expensive (physical robots, large language model API calls), or simply impossible (the data collection process has ended). Offline RL (also called batch RL) removes this constraint by learning from a fixed dataset of logged transitions collected by some other policy (or policies) without any further environment interaction.
The dataset in offline RL may have been collected by a human expert, a suboptimal prior policy, a combination of policies, or historical system logs. The offline RL agent's task is to learn a policy that is better than any policy in the dataset by leveraging the structural patterns in the data - understanding which state-action sequences led to good outcomes and generalising this to novel states.
Offline RL is fundamentally harder than online RL due to distributional shift: the agent may generate queries for state-action pairs that are not covered by the dataset, and the learned Q-function or value function may produce wildly inaccurate estimates for out-of-distribution actions. Online RL can correct these errors by collecting new experience in poorly understood regions; offline RL cannot.
Conservative Q-Learning (CQL) addresses this with explicit pessimism: it penalises Q-values for state-action pairs not in the dataset, ensuring the learned Q-function underestimates the value of out-of-distribution actions rather than overestimating them. The resulting policy stays close to the data distribution, where the Q-function estimates are reliable. Implicit Q-Learning (IQL) takes a different approach, never querying the Q-function on out-of-distribution actions by using only actions from the dataset in its value function updates.
Decision Transformer reframes offline RL as a sequence modelling problem: train a Transformer to predict actions given the sequence of past states, actions, and returns. At inference time, condition the model on a high target return, and it generates actions predicted to achieve that return. This avoids the distributional shift problem entirely by treating offline RL as supervised prediction over trajectories.
Applications include healthcare (learning treatment policies from electronic health records without conducting clinical trials), recommendation systems (learning from historical click logs), and robotics (learning from human teleoperation data without additional robot time). Offline RL also plays a role in LLM fine-tuning: DPO (Direct Preference Optimisation) can be viewed as an offline RL algorithm that avoids PPO's online sample collection requirement.
Analogy
A new surgeon learning to perform a procedure by studying the detailed records of thousands of past operations - reading surgical notes, studying outcomes, analysing what decisions were made under what conditions - without being allowed to perform any operations during training. The challenge is learning a skill from historical data rather than practice, while avoiding conclusions that generalise poorly beyond the documented cases. The surgeon must be appropriately conservative, sticking to approaches well-represented in the historical record rather than extrapolating to untested territory.
Real-world example
A hospital wants to train an RL policy to recommend sepsis treatment without conducting potentially harmful real-world exploration. They collect 5 years of electronic health records: patient states at each 4-hour interval (vital signs, lab results, treatments administered), treatment decisions made by physicians, and patient outcomes. CQL trains on this dataset, learning that certain treatment sequences for certain patient presentations led to better outcomes than others. The resulting policy is evaluated in retrospective simulation and held-out patient cases before being considered for clinical decision support.
Why it matters
Offline RL unlocks reinforcement learning for the vast number of domains where online exploration is impractical. Healthcare, finance, robotics, and recommendation systems all have rich historical datasets and domains where deployment risk makes online RL infeasible. Understanding offline RL - and its key challenge of distributional shift - explains why algorithms like CQL and DPO are designed differently from online RL algorithms, and why the field is increasingly important as RL moves from game environments to real-world applications.
In the news
No recent coverage - search for Offline Reinforcement Learning.
Related concepts
Direct Preference Optimization (DPO)
A simpler alternative to RLHF that achieves alignment without needing a separate reward model - training the language model directly on human preference pairs.
Imitation Learning
Learning a policy by directly training on expert demonstrations - teaching an agent to behave like an expert by showing it what to do, rather than having it discover behaviours through reward-driven trial and error.
Markov Decision Process
The mathematical framework that underpins reinforcement learning - formalising sequential decision-making as states, actions, transition probabilities, and rewards, where the future depends only on the current state and not on how you got there.