The Method​
A tabular method for finding a good policy.
- Policy is a matrix: π(a∣s)=As,a​
- Initialize Policy with all actions having equal probabilities in each state
Sample N sessions with this policy
Pick M best sessions. We call them elite.
Compute new probabilities for these elite (state, action) pairs:
π′(a∣s)=st​∈Elite∑​[st​=s]st​,at​∈Elite∑​[st​=s][at​=a]​=Was at sTook a at s​ (among best M) πnew​(a∣s)=α π′(a∣s)+(1−α) π(a∣s) where α can be thought of as learning rate. The updated policy is a combination of the old and the one constructed with elite state/actions.
Repeat Steps 1-3 for a given number of times.