Concept

Q-Learning

A foundational reinforcement learning algorithm that learns the value of state-action pairs directly from experience, without needing a model of the environment - allowing an agent to discover optimal policies through trial and error.

Added May 18, 2026

Q-learning is one of the most important algorithms in reinforcement learning. It learns a Q-function: a mapping from (state, action) pairs to their expected long-term value. Q(s, a) answers the question: "if I'm in state s and take action a, what is the best total cumulative reward I can expect from this point forward?" Once the Q-function is learned, the optimal policy is simple: in any state, take the action with the highest Q-value.

Q-learning is model-free: it learns Q-values directly from experience without needing to know the transition probabilities or reward function of the environment. The agent interacts with the environment, observes the state, takes an action, receives a reward, and observes the next state. It updates its Q estimate using the Bellman equation: Q(s,a) is updated toward r + gamma * max_a' Q(s', a'), where r is the immediate reward, gamma is the discount factor, and max_a' Q(s', a') is the best Q-value achievable from the next state.

This update rule has an elegant property: it is an off-policy algorithm. The agent can learn about the optimal policy while following a different (exploratory) policy. The exploratory policy tries actions that might not be optimal to gather information about the Q-values; the learning update still converges toward the optimal Q-values regardless of which policy the agent uses to collect data.

For environments with small, discrete state and action spaces, Q-learning can be implemented as a lookup table (a Q-table) that stores one Q-value per state-action pair. For Tic-Tac-Toe or small grid worlds, Q-tables work well. For larger state spaces (Atari games, where states are raw pixel images), a Q-table is infeasible - there are too many states to enumerate. Deep Q-Networks (DQN) replace the Q-table with a neural network that takes a state as input and outputs Q-values for all actions.

Convergence of Q-learning (to the optimal Q-function) is guaranteed under tabular conditions if all state-action pairs are visited infinitely often and the learning rate satisfies standard diminishing conditions. The exploration-exploitation tradeoff determines how the agent balances trying new actions (to discover better strategies) versus exploiting known good actions. Epsilon-greedy exploration - taking a random action with probability epsilon, otherwise taking the greedy best action - is the simplest and most widely used strategy.

Double Q-learning addresses a systematic overestimation bias in standard Q-learning: using the same Q-function to both select and evaluate actions produces optimistic value estimates. Double Q-learning uses two Q-networks: one to select the best action, another to evaluate its value, decorrelating the two operations.

Analogy

Learning a board game by playing many games and, after each move, updating your mental note of how good that move was from that position. If a move from a certain board position eventually led to a win, you update your estimate upward. If it led to a loss, you update it downward. After thousands of games, your mental Q-table contains accurate estimates of how good each move is from each position, and you can play optimally by always choosing the move with the best estimated value.

Real-world example

A Q-learning agent learning to play Frozen Lake (a small grid world where the agent must reach a goal while avoiding holes) starts with all Q-values set to zero. Over thousands of episodes, it tries various paths. When it falls in a hole, it receives -1 reward and updates Q-values for the actions that led there downward. When it reaches the goal, it receives +1 and updates Q-values upward through the sequence of moves. After enough exploration, the Q-table converges to values that guide the agent reliably to the goal.

Why it matters

Q-learning is the conceptual ancestor of DQN and a large family of value-based deep RL algorithms. Understanding it provides the foundation for understanding why deep RL works, what the Bellman equation is doing, and what the off-policy distinction means. It is also directly applicable to smaller-scale RL problems where tabular methods work, and appears frequently in ML interviews and course curricula as the canonical introduction to RL.

In the news

Related concepts

Deep Q-Network Exploration vs Exploitation Markov Decision Process

← Back to concepts