VII · Reinforcement Learning & RoboticsAdvanced

Actor-Critic

A reinforcement learning architecture that combines a policy network (the actor, which decides which actions to take) with a value network (the critic, which evaluates how good the current state is) - reducing gradient variance and enabling more stable learning than pure policy gradient.

Added May 18, 2026 · 3 min read

Actor-critic is the foundational architecture for modern deep RL, with virtually all state-of-the-art algorithms implementing some variant of it. Understanding the actor-critic distinction - policy (what to do) versus value (how good is this state) - is essential for reading RL papers, understanding why PPO uses the specific architecture it does, and reasoning about the tradeoffs between different RL algorithm families.

Actor-critic methods occupy the middle ground between value-based RL (learning Q-values) and pure policy gradient (directly optimising the policy). They maintain two separate but coordinated components: an actor network that represents the policy and selects actions, and a critic network that estimates the value of states and provides feedback to guide the actor's learning.

The motivation for this combination addresses a core weakness of policy gradient methods: high gradient variance. REINFORCE-style policy gradients compute advantages based on complete episode returns, which are noisy estimates of the true advantage. An action taken at step 10 of a 100-step episode is reinforced by the total remaining reward from step 10 onward - but much of that reward came from subsequent random actions, not from the action at step 10. This produces high-variance gradient estimates that require many samples to converge.

The critic addresses this by learning a value function V(s) that estimates expected future return from a state. The actor is then reinforced by the advantage - the difference between the actual return and the critic's baseline estimate: A(s,a) = Q(s,a) - V(s). If an action led to a return higher than the critic predicted, it was a good action (positive advantage); if it led to a return lower than predicted, it was bad (negative advantage). By using the critic's estimate as a baseline rather than zero, gradient variance is substantially reduced without introducing bias.

The standard implementation trains both networks simultaneously from the same experience. The actor gradient update pushes toward actions with positive advantages. The critic loss is the mean squared error between its value predictions and the observed returns (or bootstrapped TD targets). Both losses are typically combined into a single objective and trained jointly.

A2C (Advantage Actor-Critic) and A3C (Asynchronous Advantage Actor-Critic) are synchronous and asynchronous implementations of the basic actor-critic concept. A3C runs multiple actor-critic instances in parallel (each with its own copy of the environment), accumulating gradients asynchronously and updating a shared model - enabling efficient use of multicore CPUs for RL training before GPU RL was common.

Soft Actor-Critic (SAC) extends actor-critic to maximum entropy RL: the policy is optimised not just for high reward but also for high entropy (diversity of actions), encouraging exploration and producing more robust policies that do not collapse to a single deterministic strategy. SAC is one of the most sample-efficient RL algorithms for continuous control and is widely used in robot learning.

Most practical deep RL algorithms - PPO, SAC, TD3 - are variants of actor-critic. The distinction from pure value-based methods (DQN) is the explicit policy network; from pure policy gradient is the critic baseline.

Analogy

A sports team's coach-athlete relationship. The athlete (actor) decides what moves to make on the field. The coach (critic) observes the situation and provides an assessment of how well-positioned the team is - not rewarding specific moves with points, but evaluating the state: "we're in a strong position" or "we're in trouble." The athlete uses the coach's assessment to calibrate whether their decisions were above or below expectation, enabling more targeted feedback than waiting until the final score.

Real-world example

Training a robotic arm to pick and place objects uses soft actor-critic. The actor network takes the arm's current joint states and target object position as input and outputs a Gaussian distribution over joint torques. The critic network takes the same state plus the sampled torque action and estimates the Q-value (expected future reward). Training alternates: collect experience with the current actor policy, update the critic using Bellman targets, update the actor using the critic's gradient to increase the probability of high-value torque distributions. The entropy bonus in SAC prevents the policy from collapsing to a single rigid motion pattern, producing generalised grasping behaviours.

Why it matters

Actor-critic is the foundational architecture for modern deep RL, with virtually all state-of-the-art algorithms implementing some variant of it. Understanding the actor-critic distinction - policy (what to do) versus value (how good is this state) - is essential for reading RL papers, understanding why PPO uses the specific architecture it does, and reasoning about the tradeoffs between different RL algorithm families.

In the news

No recent coverage - search for Actor-Critic.

Related concepts

Policy Gradient

A family of reinforcement learning algorithms that directly optimise a parameterised policy by computing gradients of expected reward with respect to policy parameters - enabling RL on continuous action spaces where value-based methods struggle.

Proximal Policy Optimization

The dominant policy gradient algorithm in modern RL and LLM fine-tuning - achieving stable, sample-efficient training by clipping the policy update ratio to prevent destructively large parameter changes.

Q-Learning

A foundational reinforcement learning algorithm that learns the value of state-action pairs directly from experience, without needing a model of the environment - allowing an agent to discover optimal policies through trial and error.

← Back to concepts