Dynamic Programming is a method to solve complex problems by breaking it into smaller parts and solving them recursively.
In Reinforcement Learning, we want to find an optimal policy that gives maximum total rewards. So, we need to find a policy π \pi π with the maximum return.
We use the Discounted Rewards system with the factor of γ \gamma γ .
A policy is a mapping from states to probability of selecting each possible action. If the agent is following policy π \pi π at time t t t , then π ( a ∣ s ) \pi(a|s) π ( a ∣ s ) is the probability that A t = a A_t = a A t ​ = a if S t = s S_t = s S t ​ = s .
State-Value Function v π ( s ) v_{\pi}(s) v π ​ ( s ) ​ The value function of a state s s s under a policy π \pi π , denoted by v π ( s ) v_{\pi}(s) v π ​ ( s ) , is the expected return when starting in s s s and following π \pi π thereafter.
v π ( s ) = E π [ G t ∣ S t = s ] = E π [ R t + γ G t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) ∑ r , s ′ p ( r , s ′ ∣ s , a ) [ r + γ  E π [ G t + 1 ∣ S t + 1 = s ′ ] ] = ∑ a π ( a ∣ s ) ∑ r , s ′ p ( r , s ′ ∣ s , a ) [ r + γ  v π ( s ′ ) ] \large \begin{aligned} v_{\pi}(s) &= \mathbb{E}_{\pi}[G_t | S_t = s] \\ \\ &= \mathbb{E}_{\pi}[R_t + \gamma G_{t+1} | S_t = s] \\ \\ &= \sum\limits_{a} \pi(a|s) \sum\limits_{r, s'} p(r, s' | s, a)\left[r + \gamma\ \green{\mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s']}\right] \\ \\ &= \sum\limits_{a} \pi(a | s) \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma\ \green{v_{\pi}(s')}] \end{aligned} v π ​ ( s ) ​ = E π ​ [ G t ​ ∣ S t ​ = s ] = E π ​ [ R t ​ + γ G t + 1 ​ ∣ S t ​ = s ] = a ∑ ​ π ( a ∣ s ) r , s ′ ∑ ​ p ( r , s ′ ∣ s , a ) [ r + γ  E π ​ [ G t + 1 ​ ∣ S t + 1 ​ = s ′ ] ] = a ∑ ​ π ( a ∣ s ) r , s ′ ∑ ​ p ( r , s ′ ∣ s , a ) [ r + γ  v π ​ ( s ′ ) ] ​ This is called the Bellman expectation equation .
Bellman Expectation Equation for State-Value Functionv π ( s ) = ∑ a π ( a ∣ s ) ∑ r , s ′ p ( r , s ′ ∣ s , a ) [ r + γ  v π ( s ′ ) ] = E π [ R t + γ v π ( S t + 1 ) ∣ S t = s ] \Large \begin{aligned} v_{\pi}(s) &= \sum\limits_{a} \pi(a | s) \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma\ v_{\pi}(s')] \\ \\ &= \mathbb{E}_{\pi}[R_t + \gamma v_{\pi}(S_{t+1}) | S_t = s] \end{aligned} v π ​ ( s ) ​ = a ∑ ​ π ( a ∣ s ) r , s ′ ∑ ​ p ( r , s ′ ∣ s , a ) [ r + γ  v π ​ ( s ′ ) ] = E π ​ [ R t ​ + γ v π ​ ( S t + 1 ​ ) ∣ S t ​ = s ] ​ Backup Tree Diagram​
Source: Practical RL, HSE University
Action Value Function q π ( s , a ) q_{\pi}(s, a) q π ​ ( s , a ) ​ The action value function for a policy π \pi π , denoted by q π ( s , a ) q_{\pi}(s, a) q π ​ ( s , a ) , is the expected return starting from state s s s , taking action a a a , and then following policy π \pi π thereafter.
q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] = E π [ R t + γ  G t + 1 ∣ S t = s , A t = a ] = ∑ r , s ′ p ( r , s ′ ∣ s , a ) [ r + γ E π [ G t + 1 ∣ S t + 1 = s ′ ] ] = ∑ r , s ′ p ( r , s ′ ∣ s , a ) [ r + γ v π ( s ′ ) ] \large \begin{aligned} q_{\pi}(s, a) &= \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a] \\ \\ &= \mathbb{E}_{\pi}[R_t + \gamma\ G_{t+1} | S_t = s, A_t = a] \\ \\ &= \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma \mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s']] \\ \\ &= \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma v_{\pi}(s')] \end{aligned} q π ​ ( s , a ) ​ = E π ​ [ G t ​ ∣ S t ​ = s , A t ​ = a ] = E π ​ [ R t ​ + γ  G t + 1 ​ ∣ S t ​ = s , A t ​ = a ] = r , s ′ ∑ ​ p ( r , s ′ ∣ s , a ) [ r + γ E π ​ [ G t + 1 ​ ∣ S t + 1 ​ = s ′ ] ] = r , s ′ ∑ ​ p ( r , s ′ ∣ s , a ) [ r + γ v π ​ ( s ′ ) ] ​ Relation between v π ( s ) v_{\pi}(s) v π ​ ( s ) and q π ( s , a ) q_{\pi}(s, a) q π ​ ( s , a ) ​ We already have q π ( s , a ) q_{\pi}(s, a) q π ​ ( s , a ) in terms of v π ( s ) v_{\pi}(s) v π ​ ( s ) .
q π ( s , a ) = ∑ r , s ′ p ( r , s ′ ∣ s , a ) [ r + γ v π ( s ′ ) ] \large q_{\pi}(s, a) = \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma \green{v_{\pi}(s')}] q π ​ ( s , a ) = r , s ′ ∑ ​ p ( r , s ′ ∣ s , a ) [ r + γ v π ​ ( s ′ ) ] We can now try writing v π ( s ) v_{\pi}(s) v π ​ ( s ) in terms of q π ( s , a ) q_{\pi}(s, a) q π ​ ( s , a ) .
v π ( s ) = ∑ a π ( a ∣ s ) ∑ r , s ′ p ( r , s ′ ∣ s , a ) [ r + γ  v π ( s ′ ) ] = ∑ a π ( a ∣ s )  q π ( s , a ) \large \begin{aligned} v_{\pi}(s) &= \sum\limits_{a} \pi(a | s) \blue{\sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma\ v_{\pi}(s')]} \\ \\ &= \sum\limits_{a} \pi(a | s)\ \blue{q_{\pi}(s, a)} \end{aligned} v π ​ ( s ) ​ = a ∑ ​ π ( a ∣ s ) r , s ′ ∑ ​ p ( r , s ′ ∣ s , a ) [ r + γ  v π ​ ( s ′ ) ] = a ∑ ​ π ( a ∣ s )  q π ​ ( s , a ) ​ So, we can now write a recurrence relation for the action value function.
q π ( s , a ) = ∑ r , s ′ p ( r , s ′ ∣ s , a ) [ r + γ  ∑ a ′ π ( a ′ ∣ s ′ )  q π ( s ′ , a ′ ) ] \large q_{\pi}(s, a) = \sum\limits_{r, s'} p(r, s' | s, a)\left[r + \gamma\ \sum\limits_{a'} \pi(a'|s')\ q_{\pi}(s', a')\right] q π ​ ( s , a ) = r , s ′ ∑ ​ p ( r , s ′ ∣ s , a ) ⎣ ⎢ ⎡ ​ r + γ  a ′ ∑ ​ π ( a ′ ∣ s ′ )  q π ​ ( s ′ , a ′ ) ⎦ ⎥ ⎤ ​ Bellman Expectation Equation for Action-Value Functionq π ( s , a ) = ∑ r , s ′ p ( r , s ′ ∣ s , a ) [ r + γ  v π ( s ′ ) ] = ∑ r , s ′ p ( r , s ′ ∣ s , a ) [ r + γ  ∑ a ′ π ( a ′ ∣ s ′ )  q π ( s ′ , a ′ ) ] \Large \begin{aligned} q_{\pi}(s, a) &= \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma\ v_{\pi}(s')] \\ \\ &= \sum\limits_{r, s'} p(r, s' | s, a)\left[r + \gamma\ \sum\limits_{a'} \pi(a'|s')\ q_{\pi}(s', a')\right] \end{aligned} q π ​ ( s , a ) ​ = r , s ′ ∑ ​ p ( r , s ′ ∣ s , a ) [ r + γ  v π ​ ( s ′ ) ] = r , s ′ ∑ ​ p ( r , s ′ ∣ s , a ) ⎣ ⎢ ⎢ ⎡ ​ r + γ  a ′ ∑ ​ π ( a ′ ∣ s ′ )  q π ​ ( s ′ , a ′ ) ⎦ ⎥ ⎥ ⎤ ​ ​ Backup Tree Diagram​
Source: Practical RL, HSE University
With Bellman expectation equations, we can assess policy performance.