Skip to main content

Approximate Solution Methods

In our methods till now, we tried to store state value functions (vฯ€v_\pi) for every state (or action value function for state-action pair).

In situations with a very large state space (which is like, most real situations), this starts becoming a problem very soon.

Now, we try to look at methods to approximate the state value function. We no longer want it to perfectly give the correct state-value for any state, just an approximate value is enough.

This falls under the purview of Function Approximation, which is a theme in Supervised Learning. We want to approximate a function, by giving it training data points. We expect our function to perform reasonably well on new unknown points.

Value Function Approximation

The approximte value function is represented not as a table, but as a parameterized function form with weight vector wโˆˆRd\bold{w} \in \mathbb{R}^{d}, where d<<<โˆฃSโˆฃd <<< |S|, the size of state space.

We denote

v^(s,w)โ‰ˆvฯ€(s)\large \hat{v}(s, \bold{w}) \approx v_{\pi}(s)

for the approximate value of state ss given weight vector w\bold{w}.

Changing one weight changes the estimated value of many states. Consequently, when a single state is updated, the change generalizes from that state to affect the values of many other states. This generalization makes the learning potentially more powerful, but also potentially more difficult to manage and understand.

Similar to above, we can also approximate the action value function:

q^(s,a,w)โ‰ˆqฯ€(s,a)\large \hat{q}(s, a, \bold{w}) \approx q_{\pi}(s, a)

We also refer to an individual update by the notation sโ†ฆgs \mapsto g, where ss is the state updated and gg is the update target or the goal.

sโ†ฆvฯ€(s)s,aโ†ฆqฯ€(s,a)\large \begin{aligned} s &\mapsto v_{\pi}(s) \\ s, a &\mapsto q_{\pi}(s, a) \end{aligned}

In a sense, the update sโ†ฆgs \mapsto g means that the estimated value for state ss should be more like the goal gg.

Monte Carlo

Ideal World Goals

sโ†ฆEฯ€[GtโˆฃSt=s]s,aโ†ฆEฯ€[GtโˆฃSt=s,At=a]\large \begin{aligned} s &\mapsto \mathbb{E}_{\pi}[G_t | S_t = s] \\ s, a &\mapsto \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a] \end{aligned}

Sample based Estimates of Goals g(s)g(s), g(s,a)g(s, a)

sโ†ฆR(s,ฯ€(s))+ฮณGt+1s,aโ†ฆR(s,a)+ฮณGt+1\large \begin{aligned} s &\mapsto R(s, \pi(s)) + \gamma G_{t+1} \\ s, a &\mapsto R(s, a) + \gamma G_{t+1} \end{aligned}
Temporal Difference (TD)

Ideal World Goals

sโ†ฆEฯ€[Rt+1+ฮณvฯ€(St+1)โˆฃSt=s]s,aโ†ฆEฯ€[Rt+1+ฮณvฯ€(St+1)โˆฃSt=s,At=a]\large \begin{aligned} s &\mapsto \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s] \\ s, a &\mapsto \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s, A_t = a] \end{aligned}

Sample based Estimates of Goals g(s)g(s), g(s,a)g(s, a)

sโ†ฆR(s,ฯ€(s))+ฮณv^ฯ€(St+1,w)s,aโ†ฆR(s,a)+ฮณv^ฯ€(St+1,w)\large \begin{aligned} s &\mapsto R(s, \pi(s)) + \gamma \hat{v}_{\pi}(S_{t+1}, \bold{w}) \\ s, a &\mapsto R(s, a) + \gamma \hat{v}_{\pi}(S_{t+1}, \bold{w}) \end{aligned}

Loss Function

Now that we have goals g(s)g(s) and g(s,a)g(s, a) (in general notation), we can now look for an equation to optimize. The Loss Function. Many loss functions are available, we go with the Mean Standard Error (MSE), the Loss Function for a Regression Problem in Supervised Learning.

L(w)=12โˆ‘s,aฯฯ€(s,a)[g(s,a)โˆ’q^ฯ€(s,a,w)]2\large \mathcal{L}(\bold{w}) = \frac{1}{2} \sum\limits_{s, a} \rho_{\pi}(s, a) [g(s, a) - \hat{q}_{\pi}(s, a, \bold{w})]^2

where ฯฯ€(s,a)\rho_{\pi}(s, a) denotes the weights of importance; simply a way to denote which (s,a)(s, a) pairs we care more about.

The weight of importance ฯฯ€(s,a)\rho_{\pi}(s, a) can be thought of as the fraction of time for which the policy encounters state ss, and in that state makes action aa.

Of course, we want โˆ‘s,aฯฯ€(s,a)=1\sum\limits_{s, a} \rho_{\pi}(s, a) = 1

Some definitions again:

Online, Offline, On-Policy, Off-Policy

Off-policy is more general

  • Behaviour policy: collects data (makes actions)
  • Target policy: subject to evaluation and improvement

On-policy - less general but easier

  • Behaviour policy is the same as Target policy

Minimizing Loss with Gradient Descent

L(w)=12โˆ‘s,aฯฯ€(s,a)[g(s,a)โˆ’q^ฯ€(s,a,w)]2โžLs,a(w)\large \mathcal{L}(\bold{w}) = \frac{1}{2} \sum\limits_{s, a} \rho_{\pi}(s, a) \overbrace{[g(s, a) - \hat{q}_{\pi}(s, a, \bold{w})]^2}^{\mathcal{L}_{s, a}(\bold{w})}

Gradient Descent (GD)

wโ†wโˆ’ฮฑโˆ‡wL(w)\large \bold{w} \gets \bold{w} - \alpha \nabla_{\bold{w}} \mathcal{L}(\bold{w})

where

โˆ‡wL(w)=(โˆ‚L(w)โˆ‚w1,โˆ‚L(w)โˆ‚w2,โ€ฆ,โˆ‚L(w)โˆ‚wd)โŠบ\large \nabla_{\bold{w}} \mathcal{L}(\bold{w}) = \left(\frac{\partial \mathcal{L}(\bold{w})}{\partial w_1}, \frac{\partial \mathcal{L}(\bold{w})}{\partial w_2}, \dots, \frac{\partial \mathcal{L}(\bold{w})}{\partial w_d} \right)^\intercal

Stochastic Gradient Descent (SGD)

  • On-policy: s,aโˆผฯฯ€s, a \sim \rho_{\pi}
  • Off-policy: s,aโˆผฯbs, a \sim \rho_{b} (b = behaviour policy)
wโ†wโˆ’ฮฑโˆ‡wLs,a(w)\large \bold{w} \gets \bold{w} - \alpha \nabla_{\bold{w}} \mathcal{L}_{s, a}(\bold{w})

Gradient Descent can become computationally heavy, due to computing many squares and derivatives. It is slow on large data.

In Stochastic Gradient Descent (SGD), we reduce some of this computational complexity by simplifying some aspect. For example, in the case of regression, there is a method called mini-batch SGD, where we sample multiple points instead of just one at each iteration of the computation. This reduces the accuracy of the gradient, but gives a good speed boost.

Here, in RL, we simply assume that the states appear in the examples with the same distribution. In other words, we let go of the ฯฯ€(s,a)\rho_\pi{(s, a)}.

Stochastic Semi-Gradient

We further simplify computation by treating the goals as fixed:

โˆ‡wg(s,a)=0\large \nabla_{\bold{w}} g(s, a) = 0

So,

wโ†wโˆ’ฮฑโˆ‡wLs,a(w)wโ†w+ฮฑ[g(s,a)โˆ’q^ฯ€(s,a,w)]โˆ‡wq^ฯ€(s,a,w)\large \begin{aligned} \bold{w} &\gets \bold{w} - \alpha \nabla_{\bold{w}} \mathcal{L}_{s, a}(\bold{w}) \\ \\ \bold{w} &\gets \bold{w} + \alpha\left[ g(s, a) - \hat{q}_{\pi}(s, a, \bold{w}) \right] \nabla_{\bold{w}} \hat{q}_{\pi}(s, a, \bold{w}) \end{aligned}

Stochastic Semi-Gradient Descent

  • Treats goals g(s,a)g(s, a) as fixed numbers
  • Changes parameters to move estimates closer to targets
  • Is not a proper gradient
    • No SGD's theoretical convergence properties
    • Converges reliably in most cases
    • More computationally efficient than SGD
Targets

SARSA

g(s,a)=R(s,a)+ฮณq^ฯ€(St+1,At+1,w)\large g(s, a) = R(s, a) + \gamma \hat{q}_{\pi}(S_{t+1}, A_{t+1}, \bold{w})

Expected SARSA

g(s,a)=R(s,a)+ฮณโˆ‘aฯ€(aโˆฃSt+1)q^ฯ€(St+1,a,w)\large g(s, a) = R(s, a) + \gamma \sum_{a} \pi(a | S_{t+1}) \hat{q}_{\pi}(S_{t+1}, a, \bold{w})

Q-Learning

g(s,a)=R(s,a)+ฮณmaxโกaq^ฯ€(St+1,a,w)\large g(s, a) = R(s, a) + \gamma \max\limits_{a} \hat{q}_{\pi}(S_{t+1}, a, \bold{w})