Reinforcement Learning & Robotics17
AdvancedAI that learns by doing - and the systems that let it operate in the physical world.
Actor-Critic
A reinforcement learning architecture that combines a policy network (the actor, which decides which actions to take) with a value network (the critic, which evaluates how good the current state is) - reducing gradient variance and enabling more stable learning than pure policy gradient.
Deep Q-Network
The breakthrough algorithm from DeepMind (2013) that combined Q-learning with deep neural networks, enabling RL agents to learn directly from raw pixel inputs to play Atari games at superhuman level - and igniting the field of deep reinforcement learning.
Exploration vs Exploitation
The central dilemma of reinforcement learning: whether to exploit currently known good strategies to collect reward, or explore unknown actions that might reveal even better strategies - a tradeoff with no universally correct answer.
All concepts
I
Imitation Learning
Learning a policy by directly training on expert demonstrations - teaching an agent to behave like an expert by showing it what to do, rather than having it discover behaviours through reward-driven trial and error.
Inverse Reinforcement Learning
The problem of inferring the underlying reward function that explains an expert's observed behaviour - learning not just what to do from demonstrations, but why: recovering the goal structure that the expert's actions appear to optimise.
M
Markov Decision Process
The mathematical framework that underpins reinforcement learning - formalising sequential decision-making as states, actions, transition probabilities, and rewards, where the future depends only on the current state and not on how you got there.
Monte Carlo Tree Search
A planning algorithm that builds a search tree by simulating random rollouts from each candidate action, using the aggregate results to estimate action values - the algorithm that powered AlphaGo's superhuman performance in the ancient game of Go.
Motion Planning
The algorithmic problem of computing a collision-free path for a robot or agent from a start configuration to a goal configuration through an environment with obstacles - the core of autonomous robot navigation and manipulation.
Multi-Armed Bandit
A simplified reinforcement learning setting where an agent must choose repeatedly between several options (arms) with unknown reward distributions, balancing exploration of uncertain options with exploitation of known good ones.
P
Policy Gradient
A family of reinforcement learning algorithms that directly optimise a parameterised policy by computing gradients of expected reward with respect to policy parameters - enabling RL on continuous action spaces where value-based methods struggle.
Proximal Policy Optimization
The dominant policy gradient algorithm in modern RL and LLM fine-tuning - achieving stable, sample-efficient training by clipping the policy update ratio to prevent destructively large parameter changes.
S
Sim-to-Real Transfer
The technique of training a robot control policy in simulation - where data is cheap and mistakes are safe - and then deploying it on a real physical robot, bridging the performance gap caused by imperfect simulation of real-world physics.
SLAM
Simultaneous Localisation and Mapping - the robotics problem of building a map of an unknown environment and determining the robot's position within it at the same time, without any external reference like GPS.