Concept

Markov Decision Process

The mathematical framework that underpins reinforcement learning - formalising sequential decision-making as states, actions, transition probabilities, and rewards, where the future depends only on the current state and not on how you got there.

Added May 18, 2026

Almost every reinforcement learning algorithm is grounded in the Markov Decision Process (MDP). An MDP is a mathematical framework for modelling sequential decision-making under uncertainty. It consists of four elements: a set of states S (all possible situations the agent might be in), a set of actions A (all possible choices the agent can make), a transition function T(s, a, s') that specifies the probability of reaching state s' after taking action a in state s, and a reward function R(s, a, s') that specifies the immediate reward received for each transition.

The Markov property - the defining feature that gives MDPs their name - states that the probability of transitioning to a future state depends only on the current state and action, not on the history of how the agent arrived at the current state. This "memorylessness" is what makes MDPs mathematically tractable. The full history of past states is irrelevant once you know the current state.

The agent's goal in an MDP is to find a policy - a mapping from states to actions - that maximises the expected cumulative reward over time. Two formulations are common. Episodic MDPs have a terminal state; the agent maximises total reward within an episode. Continuing MDPs run indefinitely; future rewards are discounted by a discount factor gamma (typically 0.95 to 0.99), making immediate rewards more valuable than distant future rewards and ensuring the sum converges.

The value function V(s) represents the expected cumulative discounted reward starting from state s and following the optimal policy. The action-value function Q(s, a) represents the expected cumulative reward starting from state s, taking action a, and thereafter following the optimal policy. The Bellman equations express the recursive relationship between these values: the value of a state equals the immediate reward plus the discounted value of the next state. Most RL algorithms solve for the value function or policy by exploiting this recursive structure.

Partially Observable MDPs (POMDPs) extend the framework to situations where the agent cannot directly observe the full state - only a partial, noisy observation. POMDPs are significantly harder to solve but better model real-world problems like robotic perception (the robot does not know its exact position in the world, only what its sensors report).

MDPs with large or continuous state spaces require function approximation (neural networks) rather than exact table-based solutions, which is the foundation of deep reinforcement learning.

Analogy

Chess as an MDP: the state is the current board position, actions are legal moves, the transition function is deterministic (the board position after a move is fully determined), and rewards are sparse (only at game end: +1 for win, -1 for loss, 0 for draw). The Markov property holds: to play optimally from a given board position, you do not need to know the sequence of moves that led to it - only the position itself. A grandmaster evaluating a mid-game position needs only the current position, not the game history, to determine the best move.

Real-world example

A robot navigating a warehouse can be modelled as an MDP: states are grid positions, actions are movements (up, down, left, right), transitions specify where the robot ends up after each move (deterministic, or stochastic if wheel slip is modelled), and rewards are defined as -1 per step (encouraging efficiency) and +100 for reaching the delivery destination. Solving this MDP produces a policy mapping every possible position to the optimal next action.

Why it matters

MDPs are the conceptual foundation of reinforcement learning. Every RL algorithm - Q-learning, policy gradient, actor-critic - is solving an MDP or a variant of one. Understanding MDPs explains why RL algorithms are structured the way they are, why the Bellman equations appear everywhere in RL theory, and why certain problem structures (large state spaces, partial observability, reward sparsity) make RL hard. It is the vocabulary needed to understand any serious RL work.

In the news

AI Agents Learn When to Act Safely and Efficiently
arXiv CS.LG · 6d ago

Related concepts

Policy Gradient Q-Learning Reward Shaping

← Back to concepts