Concept
Policy Gradient
A family of reinforcement learning algorithms that directly optimise a parameterised policy by computing gradients of expected reward with respect to policy parameters - enabling RL on continuous action spaces where value-based methods struggle.
Added May 18, 2026
Value-based RL methods like Q-learning learn the value of actions and derive a policy from those values. Policy gradient methods take a fundamentally different approach: they directly represent the policy as a parameterised function (typically a neural network) and optimise it by gradient ascent on expected reward. Rather than learning how good each action is and then acting greedily, policy gradient methods directly learn which actions to take.
The parameterised policy takes a state as input and outputs a probability distribution over actions: for discrete actions, a softmax distribution; for continuous actions, a Gaussian distribution with learned mean and variance. The agent samples actions from this distribution, receives rewards, and updates the policy parameters to make higher-reward actions more probable.
The policy gradient theorem provides the mathematical basis for computing the gradient of expected reward with respect to policy parameters. The REINFORCE algorithm (Williams, 1992) is the canonical policy gradient algorithm: collect a full episode of experience under the current policy, compute the return (cumulative discounted reward) for each step, and update policy parameters in the direction of actions that led to higher-than-average returns.
Policy gradients have high variance in their gradient estimates: because reward signals are noisy and the return depends on the full sequence of future actions, the gradient estimate for any single episode can differ substantially from the true gradient. Baselines reduce variance: instead of reinforcing actions by their absolute return, reinforce them by how much better their return was than a baseline (typically the current state value estimate). This preserves the expected gradient (an unbiased estimator) while reducing variance significantly.
Policy gradient methods are naturally suited to continuous action spaces. A robot arm with joints that can take any angle in a continuous range cannot use Q-learning directly (there are infinitely many actions to evaluate). Policy gradients output a Gaussian distribution over joint angles and optimise the mean and variance directly, making them the standard approach for continuous control tasks.
Trust region methods constrain how much the policy can change in each update, preventing destructively large policy updates. TRPO (Trust Region Policy Optimisation) enforces a constraint on the KL divergence between the old and new policy. PPO (Proximal Policy Optimisation) achieves a similar effect with a simpler clipped objective, making it more computationally practical. PPO has become the workhorse algorithm for policy gradient in both robotics and LLM fine-tuning (RLHF).
Analogy
Training a competitive diver by giving performance scores after each dive. The coach does not tell the diver exactly which muscle to flex or which movement to correct - just whether this dive scored higher or lower than usual. The diver adjusts their technique to make higher-scoring dive patterns more likely and lower-scoring ones less likely. Over many dives, the technique converges toward consistently high scores. Policy gradient works identically: the algorithm adjusts policy parameters to make actions that led to higher rewards more probable, without explicitly modelling which state-action transitions were individually valuable.
Real-world example
Training a bipedal walking robot in simulation uses policy gradient (specifically PPO). The policy network takes the robot's joint angles, velocities, and contact forces as input and outputs a Gaussian distribution over joint torques. Initially the robot falls immediately. The PPO algorithm collects rollouts, computes advantages (how much better than baseline each action sequence was), and updates the policy to make torque sequences that kept the robot upright more probable. After millions of simulation steps, the policy produces smooth, stable walking gaits.
Why it matters
Policy gradient methods are essential for reinforcement learning in continuous action spaces and have become the dominant RL approach for real-world applications including robot control, game playing, and large language model fine-tuning. PPO, a policy gradient algorithm, is what RLHF uses to train language models from human feedback. Understanding policy gradients explains how RL training works for LLMs - why the training is unstable, why clipping is needed, and what the reward signal is actually optimising.
In the news
AI Model Showdown: November 2025 Inflection Point
Hacker News · 4h ago
AI Could Become Strongly Power-Seeking, According to New Insights
LessWrong · 11h ago
Prominent AI Researcher Joins Anthropic Over OpenAI
The Decoder, Digg AI · 1d ago
AI Research Pushes Beyond Chatbots, Focuses on Real-World Applications
Analytics Vidhya · 2d ago
AI Agents Learn When to Act Safely and Efficiently
arXiv CS.LG · 6d ago
Related concepts