Skip to main content

Cross Entropy

The Method

A tabular method for finding a good policy.

  • Policy is a matrix: Ï€(a∣s)=As,a\pi(a | s) = A_{s, a}
  • Initialize Policy with all actions having equal probabilities in each state
  1. Sample NN sessions with this policy

  2. Pick MM best sessions. We call them elite.

  3. Compute new probabilities for these elite (state, action) pairs:

    π′(a∣s)=∑st,at∈Elite[st=s][at=a]∑st∈Elite[st=s]=Took a at sWas at s (among best M) πnew(a∣s)=α π′(a∣s)+(1−α) π(a∣s)\pi^{'}(a | s) = \frac{\sum\limits_{s_t, a_t \in \text{Elite}}[s_t=s][a_t=a]}{\sum\limits_{s_t\in \text{Elite}}[s_t=s]} = \frac{\text{Took $a$ at $s$}}{\text{Was at $s$}}\ (\text{among best $M$}) \\ \ \\ \pi_{new}(a | s) = \alpha\ \pi^{'}(a | s) + (1 - \alpha)\ \pi(a | s)

    where α\alpha can be thought of as learning rate. The updated policy is a combination of the old and the one constructed with elite state/actions.

Repeat Steps 1-3 for a given number of times.