In our methods till now, we tried to store state value functions (vฯโ) for every state (or action value function for state-action pair).
In situations with a very large state space (which is like, most real situations), this starts becoming a problem very soon.
Now, we try to look at methods to approximate the state value function. We no longer want it to perfectly give the correct state-value for any state, just an approximate value is enough.
This falls under the purview of Function Approximation, which is a theme in Supervised Learning. We want to approximate a function, by giving it training data points. We expect our function to perform reasonably well on new unknown points.
The approximte value function is represented not as a table, but as a parameterized function form with weight vector wโRd, where d<<<โฃSโฃ, the size of state space.
We denote
v^(s,w)โvฯโ(s)
for the approximate value of state s given weight vector w.
Changing one weight changes the estimated value of many states. Consequently, when a single state is updated, the change generalizes from that state to affect the values of many other states. This generalization makes the learning potentially more powerful, but also potentially more difficult to manage and understand.
Similar to above, we can also approximate the action value function:
q^โ(s,a,w)โqฯโ(s,a)
We also refer to an individual update by the notation sโฆg, where s is the state updated and g is the update target or the goal.
ss,aโโฆvฯโ(s)โฆqฯโ(s,a)โ
In a sense, the update sโฆg means that the estimated value for state s should be more like the goal g.
Now that we have goals g(s) and g(s,a) (in general notation), we can now look for an equation to optimize. The Loss Function. Many loss functions are available, we go with the Mean Standard Error (MSE), the Loss Function for a Regression Problem in Supervised Learning.
where ฯฯโ(s,a) denotes the weights of importance; simply a way to denote which (s,a) pairs we care more about.
The weight of importance ฯฯโ(s,a) can be thought of as the fraction of time for which the policy encounters state s, and in that state makes action a.
Of course, we want s,aโโฯฯโ(s,a)=1
Some definitions again:
Online, Offline, On-Policy, Off-Policy
Off-policy is more general
Behaviour policy: collects data (makes actions)
Target policy: subject to evaluation and improvement
Gradient Descent can become computationally heavy, due to computing many squares and derivatives. It is slow on large data.
In Stochastic Gradient Descent (SGD), we reduce some of this computational complexity by simplifying some aspect. For example, in the case of regression, there is a method called mini-batch SGD, where we sample multiple points instead of just one at each iteration of the computation. This reduces the accuracy of the gradient, but gives a good speed boost.
Here, in RL, we simply assume that the states appear in the examples with the same distribution. In other words, we let go of the ฯฯโ(s,a).