Wednesday, November 14, 2018

Some Basic Reinforcement Learning (RL) Notes

Basics

First, let's understand the basics of Reinforcement Learning (RL). We have two main processes:
  • \textbf{Prediction} -- estimating what's going to happen in the future, based on the current state and the current action taken \footnote{($V(s), Q(s,a)$)}
  • \textbf{Control} -- figuring how to act in a given context, with given information \footnote{($\pi(a)$)}
These two processes are always layered over a Markov Decision Process (MDP) where ``the future only depends on the past through the present``. Let's write down a few of the key mathematical elements for such a system (we won't see these explicitly while solving Reinforcement Learning Problems but they are the objects that are being implicitly estimated). First, for every MDP we have a state transition matrix, \begin{eqnarray} P^{s^\prime}_{sa} &=& P\left(S_{t+1}=s^\prime \vert S_t=s, A_t=a \right) \end{eqnarray} which is the probability that some agent will move from state $s \to s^\prime$, after taking action $a$. Pretty simple at face value, but when the number of states are very large -- and the number of actions too -- this can be a very an unwieldy, uncountable matrix. We will often never see it in practice when solving RL problems. What comes naturally with a state transition matrix is a set of possible states at every timestep: \begin{eqnarray} S_t \in S \end{eqnarray} and a set of actions available at each of those states: \begin{eqnarray} A_t \in A(S_t) \end{eqnarray} Similar to the MDP transition matrix $P^{s^\prime}_{sa}$, we also have the expected reward matrix, which is the expected immediate return, given some state transition and previous action: \begin{eqnarray} R^{s^\prime}_{sa} &=& \mathbf{E}\left[R_t \vert s, s^\prime, a \right] \end{eqnarray} Fundamental to all RL contexts is also a concept of total \textbf{value} over time $G$, subject to some discount factor $\gamma$: \begin{eqnarray} G &=& \sum_{t} \gamma^t R_t \end{eqnarray} We can calculate only future realized value from some a point in time $t$, such as: \begin{eqnarray} G _t&=& \sum_{k=0} \gamma^k R_{t+k} \end{eqnarray} More on this later, but $G_t$ is essentially the net present value of future rewards $R_t, R_{t+1}, \dots,$ etc.


Main Pieces for Prediction and Control: $V(s)$, $Q(s,a)$, $\pi(a)$

Just like one has an estimator for the mean in a Gaussian distribution: \begin{eqnarray} \hat{\mu} &=& \frac{1}{N} \sum_{i=1}^N x_i \end{eqnarray} which, will converge to the true mean $\mu$ as sample size $N$ increases: \begin{eqnarray} \lim_{N\to\infty} \hat{\mu} &=& \mu \end{eqnarray} in reinforcement learning we use estimators for the \textbf{value}, $G$, conditioned on a certain state $s$, and a certain policy $\pi$. The \textbf{Value Function}, which is the expected value of future returns, conditioned on the current state: \begin{eqnarray} V^\pi(s) &=& \mathbf{E}\left[G_t \vert S_t=s, \pi \right] \end{eqnarray} When we are ``learning`` in RL, what we're actually doing refining our Value Function estimator $\hat{V}(s)$, which allows us to more accurately \textbf{predict} the rewards that are going to come downstream if we follow our policy $\pi$. Another estimator of note is the \textbf{Action-Value} function, which is the expected sum of future rewards -- or \textbf{value} -- given some state action pair at time $t$ (note, a policy is not needed in this case, we assume that after the next step, everything goes as ``well as it possibly could, in terms of decisions/actions $a$'', so we have no need for $\pi$): \begin{eqnarray} Q_t(s,a) &=& \mathbf{G_t} \left[ \vert S_t=s, A_t=a \right] \end{eqnarray} And, as we learn/update our agents in RL, what we're really doing is updating our estimators of $\hat{Q}(s,a)$


Online Estimators


Example: Stationary Mean

Let's say we are trying to estimate the average height of males in state of Pennsylvania. One way to do this, is to take a random poll of $N$ males from voter registries (where $N << M$, the total number of males in PA). From all the measured heights of these men $x_n$, we can construct a sample mean: \begin{eqnarray} \hat{\mu} &=& \frac{\sum_{n=1}^N x_n}{N} \end{eqnarray} and, a sample variance: \begin{eqnarray} \hat{\sigma_x^2} &=& \frac{\sum_{n=1}^N \left(x_n - \hat{\mu}\right)^2}{N-1} \end{eqnarray} which, is slightly different than the variance of our sample mean (error bar): \begin{eqnarray} \mathrm{Var}\left[\hat{\mu}\right] &=& \frac{\hat{\sigma_x^2}}{N} \end{eqnarray} Note that, the variance of the sample mean goes like $1/N$, so the bigger our survey becomes, the smaller our error bars become and the more sure we are of our "height" model. In Reinforcement Learning, we are constantly adjusting estimators but in an online fashion (one data point at a time). Let's see what this looks like for a sample mean: \begin{eqnarray} \hat{\mu}_N &=& \frac{1}{N}\left( \sum_{n=1}^N x_n \right)\\ &=& \frac{1}{N}\left( x_N + \sum_{n=1}^{N-1} x_n \right)\\ &=& \frac{1}{N}\left( x_N + (N-1)\frac{\sum_{n=1}^{N-1} x_n }{N-1}\right)\\ &=& \frac{1}{N}\left( x_N + (N-1) \hat{\mu}_{N-1}\right)\\ \hat{\mu}_N &=& \hat{\mu}_{N-1} + \frac{1}{N}\left(x_N -\hat{\mu}_{N-1}\right) \end{eqnarray} We can see the form that this is ultimately taking, the change in our estimator is $\delta \hat{\mu}_{N}=\hat{\mu}_{N}-\hat{\mu}_{N-1}$ is the difference between the most recent observation and the most recent estimator value, weighted by a decreasing factor of $\frac{1}{N}$. In reinforcement learning, we will see the same online structure over and over again: \begin{eqnarray} \mathrm{new} \ \mathrm{estimator} \ \mathrm{value} &=& \mathrm{old} \ \mathrm{estimator} \ \mathrm{value} + \ \alpha \left( \mathrm{new} \ \mathrm{observation} - \mathrm{old} \ \mathrm{estimator} \ \mathrm{value} \right) \end{eqnarray} Where $\alpha$ is some step size for adaptation -- be it static, or changing over time (such as $1/N$).


A Brief Fastforward: Online updates of Value Function

Let us take a brief fast-forward and look at the update equation for Value-Function under Value iteration: \begin{eqnarray} V_{k+1}(S_t) &=& V_k(S_t) + \alpha \left[ R_{t+1} + \gamma V_k(S_{t+1})-V_k(S_t) \right] \end{eqnarray} To draw some parallels -- $k$ corresponds to $N$, our $k^\mathrm{th}$ and $(k+1)^\mathrm{th}$ sample step of the game/survey. The estimator in this case is $V(S)$, but we are not estimating only one value -- such as average height of males of males in Pennsylvania -- but many different values -- such as height of males in $s=PA, NY, AL, CA$, etc. Because of the nature of the value function, we can always write our estimator as: \begin{eqnarray} V_k(S_t) &=& R_{t+1} + \gamma V_k(S_{t+1} \end{eqnarray} and so what we are truly doing between the square brackets above, is taking the empirical, newly observed value function and taking it's difference with our old estimation. (Ignore the discount factor $\gamma$ for now.


Stationarity and Non-Stationarity for Online Estimators

One very important point to note, in our example of estimating height from a population of people is that with the step weight: \begin{eqnarray} \alpha_N &=& \frac{1}{N} \end{eqnarray} We are making an implicit assumption that the distribution of male heights \textbf{do not change while we collect samples of data}. (i.e. that the distribution of heights is not changing over time). For many games this is not the case, and one way to allow an estimator to ``adapt'' over time is to set $\alpha$ to be some constant. One can think of $\alpha$ in units of one over time, or \begin{eqnarray} \alpha &=& \frac{1}{\tau} \end{eqnarray} Where $\tau$ is the expected number of timestamps/datapoints to make a \textbf{major} influence on the estimators conclusions/values. Our difference equation above actually ties in perfectly with an exponentially damped system: \begin{eqnarray} \Delta V(s,t) &=& \alpha \left(f(t) - V(s,t) \right)\\ \frac{\partial V(s)}{\partial t} &=& \alpha^\prime \left(f(t) - V(s,t) \right) \\ \left(\frac{\partial}{\partial t} + \alpha^\prime \right)V(s,t) &=& \alpha^\prime f(t) \end{eqnarray} This is the canonical example of an exponentially damped system with Green's function: \begin{eqnarray} G(s,t) &=& G(s)e^{-\alpha t} \end{eqnarray} driven by some forcing function $f(t)$, to which our solution will converge with expected time $1/\alpha^\prime =\tau$. So the larger our step size $\alpha$, the sooner we learn things and the sooner we can ``re-estimate'' a new system. The lower $\alpha$, the higher/longer $\tau$ and the longer our estimator adapts to a new system. Many reinforcement learning implementations use a ``step-down'' strategy for $\alpha$, starting very, very low -- long memory -- and then increasing slowly -- shorter memory -- for assured convergence to the global optimum. (Much like simulated annealing which parallels to statistical mechanical systems, where one fits a model by ``turning down the temperature'' slowly, causing a model to be chaotic/exploratory at first, and then eventually still/convergent).

No comments:

Post a Comment