Skip to main content

Multi-Armed Bandits

The Concept

  • k different options, or actions at each step
  • A reward given at each action, depending on a stationary probability distribution
  • Objective: Maximize total reward over a given number of steps, say 500 steps.

The name comes from imagining a gambler at a row of slot machines (sometimes known as "one-armed bandits"), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine.

- Wikipedia


Exploration vs Exploitation

ExploitationExploration
Using known-good strategies/actionsExploring new unknown strategies/actions
Known amount of rewardsRewards unknown

If the bandit just exploits the known-strategies, it is possible that they are stuck in a local-optima.

There may be unknown strategies which yield a higher total reward in the long run.


ϵ\epsilon-greedy strategy

Exploit known good strategies most of the time, but for an ϵ\epsilon fraction, explore new strategies randomly.

As the number of steps increases, different actions will be sampled more and more times, eventually leading us to finding the optimal strategy in the limit of very large number of steps.