Skip to main content

Value Functions

Dynamic Programming is a method to solve complex problems by breaking it into smaller parts and solving them recursively.

In Reinforcement Learning, we want to find an optimal policy that gives maximum total rewards. So, we need to find a policy π\pi with the maximum return.

We use the Discounted Rewards system with the factor of γ\gamma.

A policy is a mapping from states to probability of selecting each possible action. If the agent is following policy π\pi at time tt, then π(a∣s)\pi(a|s) is the probability that At=aA_t = a if St=sS_t = s.

State-Value Function vπ(s)v_{\pi}(s)

The value function of a state ss under a policy π\pi, denoted by vπ(s)v_{\pi}(s), is the expected return when starting in ss and following π\pi thereafter.

vπ(s)=Eπ[Gt∣St=s]=Eπ[Rt+γGt+1∣St=s]=∑aπ(a∣s)∑r,s′p(r,s′∣s,a)[r+γ Eπ[Gt+1∣St+1=s′]]=∑aπ(a∣s)∑r,s′p(r,s′∣s,a)[r+γ vπ(s′)]\large \begin{aligned} v_{\pi}(s) &= \mathbb{E}_{\pi}[G_t | S_t = s] \\ \\ &= \mathbb{E}_{\pi}[R_t + \gamma G_{t+1} | S_t = s] \\ \\ &= \sum\limits_{a} \pi(a|s) \sum\limits_{r, s'} p(r, s' | s, a)\left[r + \gamma\ \green{\mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s']}\right] \\ \\ &= \sum\limits_{a} \pi(a | s) \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma\ \green{v_{\pi}(s')}] \end{aligned}

This is called the Bellman expectation equation.

Bellman Expectation Equation for State-Value Function
vπ(s)=∑aπ(a∣s)∑r,s′p(r,s′∣s,a)[r+γ vπ(s′)]=Eπ[Rt+γvπ(St+1)∣St=s]\Large \begin{aligned} v_{\pi}(s) &= \sum\limits_{a} \pi(a | s) \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma\ v_{\pi}(s')] \\ \\ &= \mathbb{E}_{\pi}[R_t + \gamma v_{\pi}(S_{t+1}) | S_t = s] \end{aligned}

Backup Tree Diagram

Source: Practical RL, HSE University


Action Value Function qπ(s,a)q_{\pi}(s, a)

The action value function for a policy π\pi, denoted by qπ(s,a)q_{\pi}(s, a), is the expected return starting from state ss, taking action aa, and then following policy π\pi thereafter.

qπ(s,a)=Eπ[Gt∣St=s,At=a]=Eπ[Rt+γ Gt+1∣St=s,At=a]=∑r,s′p(r,s′∣s,a)[r+γEπ[Gt+1∣St+1=s′]]=∑r,s′p(r,s′∣s,a)[r+γvπ(s′)]\large \begin{aligned} q_{\pi}(s, a) &= \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a] \\ \\ &= \mathbb{E}_{\pi}[R_t + \gamma\ G_{t+1} | S_t = s, A_t = a] \\ \\ &= \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma \mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s']] \\ \\ &= \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma v_{\pi}(s')] \end{aligned}

Relation between vπ(s)v_{\pi}(s) and qπ(s,a)q_{\pi}(s, a)

We already have qπ(s,a)q_{\pi}(s, a) in terms of vπ(s)v_{\pi}(s).

qπ(s,a)=∑r,s′p(r,s′∣s,a)[r+γvπ(s′)]\large q_{\pi}(s, a) = \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma \green{v_{\pi}(s')}]

We can now try writing vπ(s)v_{\pi}(s) in terms of qπ(s,a)q_{\pi}(s, a).

vπ(s)=∑aπ(a∣s)∑r,s′p(r,s′∣s,a)[r+γ vπ(s′)]=∑aπ(a∣s) qπ(s,a)\large \begin{aligned} v_{\pi}(s) &= \sum\limits_{a} \pi(a | s) \blue{\sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma\ v_{\pi}(s')]} \\ \\ &= \sum\limits_{a} \pi(a | s)\ \blue{q_{\pi}(s, a)} \end{aligned}

So, we can now write a recurrence relation for the action value function.

qπ(s,a)=∑r,s′p(r,s′∣s,a)[r+γ ∑a′π(a′∣s′) qπ(s′,a′)]\large q_{\pi}(s, a) = \sum\limits_{r, s'} p(r, s' | s, a)\left[r + \gamma\ \sum\limits_{a'} \pi(a'|s')\ q_{\pi}(s', a')\right]
Bellman Expectation Equation for Action-Value Function
qπ(s,a)=∑r,s′p(r,s′∣s,a)[r+γ vπ(s′)]=∑r,s′p(r,s′∣s,a)[r+γ ∑a′π(a′∣s′) qπ(s′,a′)]\Large \begin{aligned} q_{\pi}(s, a) &= \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma\ v_{\pi}(s')] \\ \\ &= \sum\limits_{r, s'} p(r, s' | s, a)\left[r + \gamma\ \sum\limits_{a'} \pi(a'|s')\ q_{\pi}(s', a')\right] \end{aligned}

Backup Tree Diagram

Source: Practical RL, HSE University


With Bellman expectation equations, we can assess policy performance.