Cross Entropy

The Method

A tabular method for finding a good policy.

Sample $N$ sessions with this policy
Pick $M$ best sessions. We call them elite.
Compute new probabilities for these elite (state, action) pairs:
$\pi^{'}(a | s) = \frac{\sum\limits_{s_t, a_t \in \text{Elite}}[s_t=s][a_t=a]}{\sum\limits_{s_t\in \text{Elite}}[s_t=s]} = \frac{\text{Took $a$ at $s$}}{\text{Was at $s$}}\ (\text{among best $M$}) \\ \ \\ \pi_{new}(a | s) = \alpha\ \pi^{'}(a | s) + (1 - \alpha)\ \pi(a | s)$
where $\alpha$ can be thought of as learning rate. The updated policy is a combination of the old and the one constructed with elite state/actions.

Repeat Steps 1-3 for a given number of times.