VII · Reinforcement Learning & RoboticsAdvanced

Policy Gradient

A family of reinforcement learning algorithms that directly optimise a parameterised policy by computing gradients of expected reward with respect to policy parameters - enabling RL on continuous action spaces where value-based methods struggle.

Added May 18, 2026 · 3 min read

Policy gradient methods are essential for reinforcement learning in continuous action spaces and have become the dominant RL approach for real-world applications including robot control, game playing, and large language model fine-tuning. PPO, a policy gradient algorithm, is what RLHF uses to train language models from human feedback. Understanding policy gradients explains how RL training works for LLMs - why the training is unstable, why clipping is needed, and what the reward signal is actually optimising.

Value-based RL methods like Q-learning learn the value of actions and derive a policy from those values. Policy gradient methods take a fundamentally different approach: they directly represent the policy as a parameterised function (typically a neural network) and optimise it by gradient ascent on expected reward. Rather than learning how good each action is and then acting greedily, policy gradient methods directly learn which actions to take.

The parameterised policy takes a state as input and outputs a probability distribution over actions: for discrete actions, a softmax distribution; for continuous actions, a Gaussian distribution with learned mean and variance. The agent samples actions from this distribution, receives rewards, and updates the policy parameters to make higher-reward actions more probable.

The policy gradient theorem provides the mathematical basis for computing the gradient of expected reward with respect to policy parameters. The REINFORCE algorithm (Williams, 1992) is the canonical policy gradient algorithm: collect a full episode of experience under the current policy, compute the return (cumulative discounted reward) for each step, and update policy parameters in the direction of actions that led to higher-than-average returns.

Policy gradients have high variance in their gradient estimates: because reward signals are noisy and the return depends on the full sequence of future actions, the gradient estimate for any single episode can differ substantially from the true gradient. Baselines reduce variance: instead of reinforcing actions by their absolute return, reinforce them by how much better their return was than a baseline (typically the current state value estimate). This preserves the expected gradient (an unbiased estimator) while reducing variance significantly.

Policy gradient methods are naturally suited to continuous action spaces. A robot arm with joints that can take any angle in a continuous range cannot use Q-learning directly (there are infinitely many actions to evaluate). Policy gradients output a Gaussian distribution over joint angles and optimise the mean and variance directly, making them the standard approach for continuous control tasks.

Trust region methods constrain how much the policy can change in each update, preventing destructively large policy updates. TRPO (Trust Region Policy Optimisation) enforces a constraint on the KL divergence between the old and new policy. PPO (Proximal Policy Optimisation) achieves a similar effect with a simpler clipped objective, making it more computationally practical. PPO has become the workhorse algorithm for policy gradient in both robotics and LLM fine-tuning (RLHF).

Analogy

Training a competitive diver by giving performance scores after each dive. The coach does not tell the diver exactly which muscle to flex or which movement to correct - just whether this dive scored higher or lower than usual. The diver adjusts their technique to make higher-scoring dive patterns more likely and lower-scoring ones less likely. Over many dives, the technique converges toward consistently high scores. Policy gradient works identically: the algorithm adjusts policy parameters to make actions that led to higher rewards more probable, without explicitly modelling which state-action transitions were individually valuable.

Real-world example

Training a bipedal walking robot in simulation uses policy gradient (specifically PPO). The policy network takes the robot's joint angles, velocities, and contact forces as input and outputs a Gaussian distribution over joint torques. Initially the robot falls immediately. The PPO algorithm collects rollouts, computes advantages (how much better than baseline each action sequence was), and updates the policy to make torque sequences that kept the robot upright more probable. After millions of simulation steps, the policy produces smooth, stable walking gaits.

Why it matters

Policy gradient methods are essential for reinforcement learning in continuous action spaces and have become the dominant RL approach for real-world applications including robot control, game playing, and large language model fine-tuning. PPO, a policy gradient algorithm, is what RLHF uses to train language models from human feedback. Understanding policy gradients explains how RL training works for LLMs - why the training is unstable, why clipping is needed, and what the reward signal is actually optimising.

In the news

Related concepts

Actor-Critic

A reinforcement learning architecture that combines a policy network (the actor, which decides which actions to take) with a value network (the critic, which evaluates how good the current state is) - reducing gradient variance and enabling more stable learning than pure policy gradient.

Markov Decision Process

The mathematical framework that underpins reinforcement learning - formalising sequential decision-making as states, actions, transition probabilities, and rewards, where the future depends only on the current state and not on how you got there.

Proximal Policy Optimization

The dominant policy gradient algorithm in modern RL and LLM fine-tuning - achieving stable, sample-efficient training by clipping the policy update ratio to prevent destructively large parameter changes.

← Back to concepts