Generalized Policy Iteration Generalized Policy Iteration is the general idea of letting policy evaluation and policy improvement processes interact.
Model-based setup and value-based approachβ Model-based setup
Model of the world is known p ( r , s β² β£ s , a ) p(r, s' | s, a) p ( r , s β² β£ s , a ) known for all r , s β² , s , a r, s', s, a r , s β² , s , a Value based approach
Build or estimate a value Extract a policy from the value Policy Evaluation (Prediction)β Policy Evaluation , also called Prediction is predicting the value function for a particular policy.
The Bellman expectation equation is basically a system of linear equations where the number of unknowns is equal to the number of equations and number of states.
v Ο ( s ) = β a Ο ( a β£ s ) β r , s β² p ( r , s β² β£ s , a ) [ r + Ξ³ Β v Ο ( s β² ) ] = E Ο [ R t + Ξ³ v Ο ( S t + 1 ) β£ S t = s ] \large \begin{aligned} v_{\pi}(s) &= \sum\limits_{a} \pi(a | s) \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma\ v_{\pi}(s')] \\ \\ &= \mathbb{E}_{\pi}[R_t + \gamma v_{\pi}(S_{t+1}) | S_t = s] \end{aligned} v Ο β ( s ) β = a β β Ο ( a β£ s ) r , s β² β β p ( r , s β² β£ s , a ) [ r + Ξ³ Β v Ο β ( s β² ) ] = E Ο β [ R t β + Ξ³ v Ο β ( S t + 1 β ) β£ S t β = s ] β Iterative Policy EvaluationInput Ο \pi Ο , the policy to be evaluated Initialize an array v ( s ) = 0 v(s) = 0 v ( s ) = 0 , for all s β S s \in S s β S Repeat until Ξ < ΞΈ \Delta < \theta Ξ < ΞΈ (a small positive threshold)Ξ β 0 \Delta \gets 0 Ξ β 0 For each s β S s \in S s β S :v o l d ( s ) β v ( s ) \large v_{old}(s) \gets v(s) v o l d β ( s ) β v ( s ) v ( s ) β β a Ο ( a β£ s ) β s β² , r p ( s β² , r β£ s , a ) [ r + Ξ³ v ( s β² ) ] \large \boxed{v(s) \gets \sum\limits_{a} \pi(a | s) \sum\limits_{s', r} p(s', r | s, a) [r + \gamma v(s')]} v ( s ) β a β β Ο ( a β£ s ) s β² , r β β p ( s β² , r β£ s , a ) [ r + Ξ³ v ( s β² ) ] β (Bellman expectation equation)Ξ β max β‘ ( Ξ , β£ v o l d ( s ) β v ( s ) β£ ) \large \Delta \gets \max (\Delta, |v_{old}(s) - v(s)|) Ξ β max ( Ξ , β£ v o l d β ( s ) β v ( s ) β£ ) Output v β v Ο v \approx v_{\pi} v β v Ο β Policy Improvementβ Once we know the value function for a particular policy, we could improve on it by acting greedily.
From some state s s s , we want to know whether it would be better to follow the current policy or take an action not recommended by the current policy.
One way to go about this is taking an action a a a not recommended by the current policy, i.e. a β Ο ( s ) a \neq \pi(s) a ξ = Ο ( s ) , and then following the policy Ο \pi Ο thereafter.
Ο β² ( s ) β argβmax β‘ a β r , s β² p ( r , s β² β£ s , a ) [ r + Ξ³ v Ο ( s β² ) ] β q Ο ( s , a ) \Large \pi'(s) \gets \argmax_{a} \overbrace{\sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma v_{\pi}(s')]}^{q_{\pi}(s, a)} Ο β² ( s ) β a a r g m a x β r , s β² β β p ( r , s β² β£ s , a ) [ r + Ξ³ v Ο β ( s β² ) ] β q Ο β ( s , a ) β This procedure is guaranteed to produce a better policy, owing to the Policy Improvement Theorem .
Policy Improvement TheoremLet Ο \pi Ο and Ο β² \pi' Ο β² be any pair of deterministic policies such that, for all s β S s \in S s β S ,
q Ο ( s , Ο β² ( s ) ) β₯ v Ο ( s ) . \large q_{\pi}(s, \pi'(s)) \geq v_{\pi}(s). q Ο β ( s , Ο β² ( s ) ) β₯ v Ο β ( s ) . Then, v Ο β² ( s ) β₯ v Ο ( s ) \large v_{\pi'}(s) \geq v_{\pi}(s) v Ο β² β ( s ) β₯ v Ο β ( s ) and hence Ο β² β₯ Ο \large \pi' \geq \pi Ο β² β₯ Ο
If a new policy after improvement is the same as the old one, i.e. Ο β² = Ο \large \pi' = \pi Ο β² = Ο , then it is optimal.
For an idea of the proof, refer to Section 4.9 of the Book by Sutton and Barto, given in Resources .
The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement.
Recovering Optimal Policy from known valuesIf q β ( s , a ) q_{*}(s, a) q β β ( s , a ) is known:
Ο β ( s ) β argβmax β‘ a q β ( s , a ) \large \pi_{*}(s) \gets \argmax_{a} q_{*}(s, a) Ο β β ( s ) β a a r g m a x β q β β ( s , a ) If v β ( s ) v_{*}(s) v β β ( s ) is known:
Ο β ( s ) β argβmax β‘ a β r , s β² p ( r , s β² β£ s , a ) [ r + Ξ³ v β ( s β² ) ] β q β ( s , a ) \large \pi_{*}(s) \gets \argmax_{a} \overbrace{\sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma v_{*}(s')]}^{q_{*}(s, a)} Ο β β ( s ) β a a r g m a x β r , s β² β β p ( r , s β² β£ s , a ) [ r + Ξ³ v β β ( s β² ) ] β q β β ( s , a ) β So, if the model dynamics are unknown, i.e. p ( r , s β² β£ s , a ) p(r, s' | s, a) p ( r , s β² β£ s , a ) are unknown, then we cannot recover an optimal policy from v β v_{*} v β β , and we need to rely on q β q_{*} q β β .
Generalized Policy Iterationβ Generalized Policy Iteration
Evaluate given policy Improve policy by acting greedily with respect to its value function Policy Iteration
Evaluate policy until convergence (with some tolerance) Improve Policy Value Iteration
Evaluate policy only with a single iteration Improve policy Policy Iterationβ Initialize v ( s ) v(s) v ( s ) and Ο ( s ) \pi(s) Ο ( s ) arbitrarily for all s β S s \in S s β S Perform policy evaluation Policy improvement:policy_stable
β \gets β trueFor each s β S s \in S s β S :old_action
β Ο ( s ) \gets \pi(s) β Ο ( s ) Ο ( s ) β argβmax β‘ a β s β² , r p ( s β² , r β£ s , a ) [ r + Ξ³ v ( s β² ) ] β q ( s , a ) \large \pi(s) \gets \argmax\limits_{a} \overbrace{\sum\limits_{s', r} p(s', r | s, a)[r + \gamma v(s')]}^{q(s, a)} Ο ( s ) β a a r g m a x β s β² , r β β p ( s β² , r β£ s , a ) [ r + Ξ³ v ( s β² ) ] β q ( s , a ) β If old_action
β Ο ( s ) \neq \pi(s) ξ = Ο ( s ) , then policy_stable
β \gets β false If policy_stable
, then stop and return v β v β v \approx v_* v β v β β and Ο β Ο β \pi \approx \pi_* Ο β Ο β β , else go to Step 2 Value Iterationβ Initialize array v v v arbitrarilyFor example, v ( s ) = 0 v(s) = 0 v ( s ) = 0 for all s β S s \in S s β S Repeat until Ξ < ΞΈ \Delta < \theta Ξ < ΞΈ (a small positive threshold)Ξ β 0 \Delta \gets 0 Ξ β 0 For each s β S s \in S s β S :v o l d ( s ) β v ( s ) \large v_{old}(s) \gets v(s) v o l d β ( s ) β v ( s ) v ( s ) β max β‘ a β s β² , r p ( s β² , r β£ s , a ) [ r + Ξ³ v ( s β² ) ] \large \boxed{v(s) \gets \max\limits_{a} \sum\limits_{s', r} p(s', r | s, a) [r + \gamma v(s')]} v ( s ) β a max β s β² , r β β p ( s β² , r β£ s , a ) [ r + Ξ³ v ( s β² ) ] β (Bellman optimality equation)Ξ β max β‘ ( Ξ , β£ v o l d ( s ) β v ( s ) β£ ) \large \Delta \gets \max (\Delta, |v_{old}(s) - v(s)|) Ξ β max ( Ξ , β£ v o l d β ( s ) β v ( s ) β£ ) Output a deterministic policy, Ο β Ο β \pi \approx \pi_* Ο β Ο β β , such thatΟ ( s ) = argβmax β‘ a β s β² , r p ( s β² , r β£ s , a ) [ r + Ξ³ v ( s β² ) ] \large \pi(s) = \argmax_{a} \sum\limits_{s', r} p(s', r | s, a)[r + \gamma v(s')] Ο ( s ) = a a r g m a x β s β² , r β β p ( s β² , r β£ s , a ) [ r + Ξ³ v ( s β² ) ] Comparison of Policy Iteration and Value Iterationβ Policy Iteration (PI)
PI is slower per cycle - O ( β£ A β£ Β β£ S β£ 2 + β£ S β£ 3 ) O(|A|\ |S|^2 + |S|^3) O ( β£ A β£ Β β£ S β£ 2 + β£ S β£ 3 ) PI requires few cycles Value Iteration (VI)
VI is faster per cycle - O ( β£ A β£ Β β£ S β£ 2 ) O(|A|\ |S|^2) O ( β£ A β£ Β β£ S β£ 2 ) VI requires many cycles