Multi-Armed Bandit
A simplified reinforcement learning setting where an agent must choose repeatedly between several options (arms) with unknown reward distributions, balancing exploration of uncertain options with exploitation of known good ones.
Added May 18, 2026 · 3 min read
Multi-armed bandits are one of the most broadly applicable ideas in decision-making under uncertainty. They provide the theoretical foundation for A/B testing, personalised recommendation, adaptive clinical trials, and online advertising. Understanding bandits - especially the UCB algorithm and the regret framework - provides the intellectual vocabulary for reasoning about any sequential decision problem where you must learn while doing. They also make exploration-exploitation concrete before tackling the full complexity of RL.
The multi-armed bandit problem takes its name from a metaphorical casino scenario: imagine a slot machine (a one-armed bandit) with K arms, each paying out with a different probability distribution. You can pull one arm per round and receive the payout. The question: which arm should you pull each round to maximise your total winnings over N rounds?
Bandits are simpler than full MDPs in one key way: there are no state transitions. Each round, you choose an arm, receive a reward, and the situation is identical next round. The only challenge is learning which arm is best while collecting as much reward as possible during the learning process. This makes bandits the cleanest setting in which to study the exploration-exploitation tradeoff.
The regret framework measures bandit algorithm performance. Regret is the total reward you would have earned by always pulling the optimal arm minus the reward you actually collected. An algorithm that achieves sublinear regret (regret grows slower than the number of rounds) eventually learns to exploit the optimal arm almost exclusively - it regrets less and less relative to the optimal as N grows.
UCB1 (Upper Confidence Bound) achieves optimal regret for the stochastic bandit: select the arm with the highest upper confidence bound UCB_i = mu_i_hat + sqrt(2 * log(t) / n_i), where mu_i_hat is the estimated mean reward of arm i, t is the total rounds elapsed, and n_i is how many times arm i has been pulled. Arms pulled fewer times have wider confidence intervals and larger bonuses, encouraging their selection. As any arm is pulled more, its confidence interval narrows and the bonus decreases.
Contextual bandits extend the setting to include context: before each round, you observe some features (the "context") about the situation, and the reward distributions of arms depend on the context. In online advertising, the context might be a user's demographic features and browsing history; the arms are different ads; the reward is whether the user clicks. LinUCB and neural contextual bandit algorithms handle this setting efficiently.
Bandit algorithms appear widely outside pure RL research. A/B testing in product development is a multi-armed bandit problem: each product variant is an arm, and you want to identify the best variant while minimising the total traffic sent to suboptimal variants. Adaptive clinical trials use bandit algorithms to allocate more patients to better-performing treatments as evidence accumulates. Hyperparameter optimisation with successive halving is bandit-inspired. Recommendation systems use contextual bandits to select content for users.
Analogy
Managing a team of salespeople, each with different performance you do not yet know. You must decide how much time to allocate to each salesperson (arm) to maximise total sales (reward). Spending all time with the top performers you know about exploits your current knowledge but might miss a high-potential performer you have not given enough opportunities. Giving everyone equal time explores but wastes time on demonstrably weak performers. UCB-style allocation gives each person time proportional to their potential (current performance + uncertainty bonus), automatically prioritising both known performers and untested high-upside salespeople.
Real-world example
Netflix uses a bandit-style algorithm to select which thumbnail image to show for a title to maximise click-through rate. Each candidate thumbnail is an arm. For each user view, they show a thumbnail (action), observe whether the user clicked (reward), and update their estimate of each thumbnail's click rate. UCB-style exploration ensures under-tested thumbnails get shown occasionally, while the best-performing thumbnails are shown most often. This allows Netflix to discover optimal thumbnails quickly without excessive testing of poor ones.
Why it matters
Multi-armed bandits are one of the most broadly applicable ideas in decision-making under uncertainty. They provide the theoretical foundation for A/B testing, personalised recommendation, adaptive clinical trials, and online advertising. Understanding bandits - especially the UCB algorithm and the regret framework - provides the intellectual vocabulary for reasoning about any sequential decision problem where you must learn while doing. They also make exploration-exploitation concrete before tackling the full complexity of RL.
In the news
No recent coverage - search for Multi-Armed Bandit.
Related concepts
Exploration vs Exploitation
The central dilemma of reinforcement learning: whether to exploit currently known good strategies to collect reward, or explore unknown actions that might reveal even better strategies - a tradeoff with no universally correct answer.
Hyperparameter Optimization
The automated search for the best configuration settings for an ML model - learning rate, batch size, architecture choices, regularisation strength - that cannot be learned during training and must be specified in advance.
Q-Learning
A foundational reinforcement learning algorithm that learns the value of state-action pairs directly from experience, without needing a model of the environment - allowing an agent to discover optimal policies through trial and error.