Exploration
Till now, we have mostly followed the simple -greedy exploration with discounting, which has more or less worked. There could be many other strategies for exploration. How do we measure their quality?
Regret
Given an optimal policy , the regret of policy is given by:
Regret per time step = Optimal Policy - Our Policy
- Finite Horizon:
- Infinite Horizon:
Exploration Strategies
We again come to this issue, that if our agent always takes the 'best' actions from its current point of view, how will it ever learn that other actions may be better than the current best one?
In other words, this is the issue of Exploring new actions vs Exploiting the current best ones for reward.
There are multiple strategies.
-greedy
We take random action with probability , otherwise the current optimal action.
The regret of this strategy grows linearly with time. This is because is constant at every time step. Even if we have learned the optimal strategy, we still keep exploring random actions with a given probability. And so, the regret keeps growing.
-greedy with discounting
We don't always want to keep exploring. After a sufficient number of iterations, we have a pretty good idea of the optimal actions, and so, we discount this with each iteration, so that we take fewer exploration turns over time.
Hence,
This strategy converges to the optimal policy, so regret eventually stops growing (theoretically).
An alternative strategy to reduce is to reduce it linearly. For example, subtract, say, 0.05 from at each time step. eventually becomes zero. This linear reduction does not guarantee convergence to optimal policy (as we may not have explored optimal actions before our became zero), however, in practice, with proper tuning, this generally works.
Softmax / Boltzman
The softmax function is a function often used to take an input vector of real numbers and normalize it into a probability distribution, proportional to the exponentials of the input numbers.
Here, it can be used to convert action values into action probabilities. The function commonly used is:
The parameter is called a temperature parameter. For high temperatures (), all actions have nearly the same probability, and the lower the temperature, the actions with higher expected rewards have higher probability. For a low temperature (), the probability of the action with the highest expected reward tends to 1.
Uncertainty Based Exploration
Any -greedy exploration strategy is going to terribly break down in a problem like this. It will take an astronomically large number of samples to learn a policy with any algorithm.
This is more or less because -greedy brings a lot of repeated exploration, which we don't exactly need.
We want to try actions if we believe there's a chance they turn out optimal.
Thompson Sampling
Suppose we somehow know the probability distributions of values for multiple actions . Thompson Sampling simply does the following:
- Sample once from each distribution
- Take argmax over the samples
We pick the action whose sampled value is the largest.
Continuing from that, we also want to explore actions that haven't been explored properly yet. So, an idea could be to prioritize actions with uncertain outcomes. This is also known as the optimism in the face of uncertainty idea.
UCB (Upper Confidence Bounds)
Upper Confidence Bound (UCB) is a good way to model this.
- Compute 95% UCB for each action on a probability distribution of values
- Take action with highest UCB
Actions which have lower UCB would mean that we fairly certain that they lead to lower reward in most cases.
Frequentist Approach
We do not really need to know the actual distributions. We could simply use some inequality that leads to a bound on the UCB.
Hoeffding's Inequality
Let be independent random variables bounded by the interval . Then,
We could use any of the other inequalities, like Chebyshev Inequality, for example.
UCB-1 for Bandits
Let's revisit the Multi Armed Bandit again, this time a bit more formally.
Multi Armed Bandits
A bandit can be thought of as a slot machine with a lever. Pulling the lever can give you the reward, or give nothing.
The multi-armed bandit can be seen as a set of real distributions , each distribution being associated with rewards delivered by one of the levers.
Let be the mean values associated with these reward distributions. The gambler iteratively plays one lever per round and observes the associated reward. The objective is to maximize the sum of collected rewards.
The bandit problem is formally equivalent to a one-state MDP. The regret after rounds is defined as the expected difference between the reward sum of an optimal strategy and the sum of the collected rewards:
where is the maximal reward, , and is the reward in round .
A common formulation is the Binary multi-armed bandit, or the Bernoulli multi-armed bandit, which issues a reward of 1 with probability , and 0 otherwise.
Let be the average reward obtained by taking action . Then, in UCB-1, we taken actions in proportion to
where
- = number of time steps so far
- = number of times action has been taken
Clearly, UCB-1 strategy will focus on exploring actions which haven't been taken at all. [This is because for such an action.]
As both and all grow, approaches . Then, we are just left to exploit the optimal actions.
Bayesian UCB
- Start from Prior ,
- Learn Posterior
- Take percentile
We could learn the posterior in the standard way, for example, by using a normal prior distribution (or any other that is suitable in the situation)
There is a much more complex method, which uses Bayesian Neural Networks. Here, instead of weights which are just constant numbers, we have distributions of each weight. We do not see it here.
The advantage of Bayesian UCB over UCB is that we can choose any distribution we want and learn an actual distribution, not just the upper bound (which comes from an inequality anyway).
However, it could happen that the Prior does not work at all in the given situation. Without proper domain knowledge, Bayesian UCB may not perform as well. For example, we choose a unimodal prior when in reality the distribution looks like bimodal.
Comparing Regret
Regret compared on three strategies, on a Bernoulli Bandit.