Skip to main content

Optimality

Value functions define a partial ordering over policies.

Definition

A policy ฯ€\pi is defined to be better than or equal to a policy ฯ€โ€ฒ\pi' if its expected return is greater than or equal to that of ฯ€โ€ฒ\pi' over all states.

In other words,

ฯ€โ‰ฅฯ€โ€ฒโ€…โ€ŠโŸบโ€…โ€Švฯ€(s)โ‰ฅvฯ€โ€ฒ(s)ย โˆ€s\Large \pi \geq \pi' \iff v_{\pi}(s) \geq v_{\pi'}(s)\ \forall s

The optimal policy is the one that is better than or equal to all the other policies. There may be multiple optimal policies. We denote it by vโˆ—v_{*}.

Optimal policies also share the same optimal action-value function, denoted by qโˆ—q_{*}.

For the state-action pair (s,a)(s, a), this gives the expected return for taking the action aa in state ss and thereafter following an optimal policy.

vโˆ—(s)=maxโกฯ€vฯ€(s)ย qโˆ—(s,a)=maxโกฯ€qฯ€(s,a)\large v_{*}(s) = \max_{\pi} v_{\pi}(s) \\ \ \\ q_{*}(s, a) = \max_{\pi} q_{\pi}(s, a)

In any finite MDP, there is always at least one deterministic optimal policy.

Bellman optimality equation for vโˆ—(s)v_{*}(s)

Source: Practical RL, HSE University

Equations
vโˆ—(s)=maxโกaโˆ‘r,sโ€ฒp(r,sโ€ฒโˆฃs,a)[r+ฮณvโˆ—(sโ€ฒ)]=maxโกaย Eฯ€[Rt+ฮณvโˆ—(St+1)โˆฃSt=s,At=a]\Large \begin{aligned} v_{*}(s) &= \max_{a} \sum\limits_{r, s'} p(r, s' | s, a)[r + \gamma v_{*}(s')] \\ &= \max_{a}\ \mathbb{E}_{\pi}\left[R_t + \gamma v_{*}(S_{t+1})|S_t = s, A_t = a\right] \end{aligned}

Bellman optimality equation for qโˆ—(s,a)q_{*}(s, a)

Source: Practical RL, HSE University

equations
qโˆ—(s,a)=Eฯ€[Rt+ฮณmaxโกaโ€ฒqโˆ—(St+1,aโ€ฒ)โˆฃSt=s,At=a]=โˆ‘r,sโ€ฒp(r,sโ€ฒโˆฃs,a)[r+ฮณmaxโกaโ€ฒqโˆ—(sโ€ฒ,aโ€ฒ)]\Large \begin{aligned} q_{*}(s, a) &= \mathbb{E}_{\pi}\left[R_t + \gamma \max_{a'} q_{*}(S_{t+1}, a')|S_t = s, A_t = a\right] \\ &= \sum\limits_{r, s'} p(r, s' | s, a) \left[ r + \gamma \max_{a'} q_{*}(s', a') \right] \end{aligned}