Wednesday, November 21, 2018

Re-Post: D-optimal Experiment Design and its Statistical Mechanics Interpretation (minimum Entropy of Posterior)

Reposting the below notes from a while ago.

As I've been thinking more and more about Reinforcement Learning and optimal sampling/experimental design for parametric regression models, the concept of "posterior volume" keeps coming up. There are many ways to frame the problem, but in order to minimize the variance of regression coefficient estimates for a linear model,


\begin{eqnarray}
y &=& \vec{\beta}\cdot \vec{x} + \epsilon \\
\mathrm{Cov}(\epsilon) &=& \Sigma \\
\mathrm{Cov}\left(\beta_i, \beta_j \right) &=& (X_{ni}\Sigma X_{nj})^{-1}
\end{eqnarray}

One often wants to `design' the design matrix -- or data matrix -- $X_{ni}$ such that our sampled observation points yield the Fisher Information matrix $(X_{ni}\Sigma X_{nj})$ with the:


  • Largest Determinant (D-Optimal Design) -- also means minimum posterior entropy as you'll see below
  • Largest Trace (A-optimal Design) -- minimizes the average variance of every $\beta$
  • Largest Max Eigenvalue (E-optimal Design) -- minimizes the maximum variance

There's also G-optimality, I-optimality, V-optimality, etc. The list goes on, with varying levels of computational feasibility and relation to the more "physical" concepts below. One of the most popular, and the most intuitive to me is D-optimality, which for a Gaussian, is just minimizing the "theory" space inhabited by the Posterior (also minimizing the Entropy, or log of phase space volume, in the statistical mechanics interpretation.)


There isn't much literature about the "next best" observation point, and solving a convex program for the next N samples based on prior information in an online way -- most of the literature concerns batch optimization, of designing from the get-go, but extending to an online/Reinforcement Learning space seems like a straightforward extension of the same language.


--------------------------------------------------------------------------------------------------------------------
Begin Re-Post
--------------------------------------------------------------------------------------------------------------------
The starting point for all inference is the likelihood function:

\begin{eqnarray}
\mathcal{L} \left(\mathbf{x} \vert \mathbf{\theta}\right)
\end{eqnarray}

where $\vec{\theta}$ is some set of parameters in our model and $X$ is the data matrix, the aggregate of all our experimental data. We can write the posterior of our experiment as:

\begin{eqnarray}
P(\vec{\theta} \vert X) &=& \frac{\mathcal{L}(X \vert \vec{\theta})P(\vec{\theta})}{P(X)}
\end{eqnarray}

Where the term in the denominator is called the evidence, and the $P(\vec{\theta})$ is our prior, filled with hyperparameters that give our initial beliefs on the model. David Mackay likes to talk about something called the ``Occam Factor'' which he gets by Taylor expanding the posterior in $\theta$ while keeping the Prior constant:

\begin{eqnarray}
P(\vec{\theta} \vert X) &\approx & \frac{\mathcal{L}(X \vert \hat{\theta})P(\hat{\theta})}{P(X)}e^{-(\vec{\theta}-\hat{\theta})_i F_{ij}(\vec{\theta}-\hat{\theta})_j}
\end{eqnarray}

Notice that the gradient of the prior at maximum likelihood, or Maximum a Posterior (MAP) is going to be zero, but the Hessian will not. We can take the Hessian of the negative log likelihood, whose expectation value is the Fisher information matrix:

\begin{eqnarray}
F_{ij} &=& \langle \frac{-\partial^2 \log \left(\mathcal{L}(\mathbf{x}\vert \mathbf{\theta})\right)}{\partial \theta_i \partial \theta_j}\rangle
\end{eqnarray}

One might ask, from an information theory point of view, how much ``help'' has our experiment, given the model and the data set X, really been? How much have we learned, relative to our prior? The answer to this is the information content in our Likelihood function, defined for any probability distribution to be:

\begin{eqnarray}
H\left[ p(x) \right] &=& -\int dx p(x) \log p(x)
\end{eqnarray}

For a multidimensional Gaussian, $x \in \mathcal{R}^d \sim \mathcal{N}(0,A_{ij})$ this is:

\begin{eqnarray}
H \left[p(x) \right] &=& \int d\vec{x} \frac{1}{(2\pi)^{d/2} \vert A_{ij} \vert^{1/2} }e^{\frac{-\vec{x}_i A^{-1}_{ij}\vec{x}_j}{2}} \left(\frac{d}{2}\log(2\pi) + \frac{1}{2}x_i x_j  i A^{-1}_{ij}+\frac{1}{2}\log \vert A_{ij} \vert \right)
\end{eqnarray}

The first and third terms are just constants, and the second is the covariance of our Gaussian distribution: $\langle x_i x_j \rangle = A_{ij}$, so we get the Trace of a $d \times d$ identity matrix:


\begin{eqnarray}
H \left[p(x) \right] &=& \int d\vec{x} \frac{1}{(2\pi)^{d/2} \vert A_{ij} \vert^{1/2} }e^{\frac{-\vec{x}_i A^{-1}_{ij}\vec{x}_j}{2}} \left(\frac{d}{2}\log(2\pi) + \frac{1}{2}A_{ij}  i A^{-1}_{ij}+\frac{1}{2}\log \vert A_{ij} \vert \right)\\
&=& \int d\vec{x} \frac{1}{(2\pi)^{d/2} \vert A_{ij} \vert^{1/2} }e^{\frac{-\vec{x}_i A^{-1}_{ij}\vec{x}_j}{2}} \left(\frac{d}{2}\log(2\pi) + \frac{1}{2}\mathrm{Tr}(I_d)+\frac{1}{2}\log \vert A_{ij} \vert \right)\\
&=& \frac{d}{2}\log(2\pi e) + \frac{1}{2}\log \vert A_{ij} \vert \
\end{eqnarray}

This is the information content for any Gaussian with arbitrary mean $\mu_i$ and Covariance matrix $A_{ij}$. If we assume a Gaussian prior $P(\vec{\theta})=N(0,\sigma_{ij}$, then we can write the information content of our Likelihood function, and prior as:

\begin{eqnarray}
H\left[\mathcal{L} \right] &=& \frac{d}{2}\log(2\pi e) + \log \left( \vert F_{ij}^{-1} \vert \right) \\
H\left[P(\vec{\theta})\right] &=& \frac{d}{2}\log(2\pi e) + \log \left( \vert \Sigma_{ij} \vert \right)
\end{eqnarray}

This is quite interesting, because in statistical mechanics  entropy goes like the Logarithm of accessible volume in phase space. What we have here, this logarithm of a determinant, is precisely the same thing, because:

\begin{eqnarray}
\sqrt{\vert A_{ij} \vert} & \sim & \int d\vec{x} \ \ \mathrm{s.t.} \ x_i A^{-1}_{ij}x_j  \le 1
\end{eqnarray}

Or the approximate volume of an ``ellipsoid'' in phase space (not allowing the - log likelihood to go above unity). If we compare information content between the prior and the Likelihood we get:

\begin{eqnarray}
H \left[\mathcal{L} \right]-H\left[P(\vec{\theta})\right] &=& \frac{1}{2} \log  \frac{\vert F_{ij}^{-1}\vert}{\vert \Sigma_{ij} \vert }
\end{eqnarray}

Exponentiating the difference in Entropies, we get a ratio of the Volumes in parameter space:

\begin{eqnarray}
e^{H \left[\mathcal{L} \right]-H\left[P(\vec{\theta})\right]} &=& \frac{\sqrt{\vert F_{ij}^{-1}\vert}}{\sqrt{\vert \Sigma_{ij}\vert}} = \frac{\mathrm{Vol}(\mathcal{L})}{\mathrm{Vol}(P)}
\end{eqnarray}

This ratio of Volumes is the ``true'' Occam factor, in my opinion as it represents the ``collapse'' in parameter space of our beliefs. But we have not taken the full posterior into account, only our learning from the Likelihood function, separate from the Prior. It would perhaps be more consistent to deal with the prior as a whole, which has a negative log Hessian of the form:

\begin{eqnarray}
P(\vec{\theta} \vert X) &=& \frac{\mathcal{L}(X \vert \vec{\theta})P(\vec{\theta})}{P(X)}\\
P(\vec{\theta}) &=& \mathcal{N}(0, \Sigma_{ij})\\
\mathcal{L}(X \vert \vec{\theta}) & \approx & \mathcal{N}(\hat{\theta}, F_{ij}^{-1}) \\
\langle -\frac{\partial^2 \log P(\vec{\theta}\vert X)}{\partial \theta_i \partial \theta_j} \rangle &=& F_{ij}+\Sigma_{ij}^{-1}
\end{eqnarray}

So the information content of the full posterior goes like:

\begin{eqnarray}
H\left[ P(\vec{\theta} \vert X) \right] &\sim & \frac{1}{2}\log \left( \mathrm{det} \vert (F_{ij}+\Sigma_{ij}^{-1})^{-1} \vert \right)\\
&=& \frac{1}{2}\log \left( \frac{1}{\mathrm{det} \vert (F_{ij}+\Sigma_{ij}^{-1})\vert }\right)
\end{eqnarray}

and if we ask for the information difference, we get:

\begin{eqnarray}
H\left[ P(\vec{\theta} \vert X) \right]-H\left[ P(\vec{\theta}) \right] &=& \frac{1}{2}\log \left( \frac{1}{\mathrm{det} \vert (F_{ij}+\Sigma_{ij}^{-1})\vert  \mathrm{det} \vert\Sigma_{ij} \vert}\right) \\
H\left[ P(\vec{\theta} \vert X) \right]-H\left[ P(\vec{\theta}) \right] &=& \frac{-1}{2}\log \left( \mathrm{det} \vert (F_{ij}+\Sigma_{ij}^{-1})\vert \right) - \frac{1}{2}\log  \mathrm{det} \vert\Sigma_{ij} \vert
\end{eqnarray}

and since $\mathrm{det}\vert A+B \vert \ge \mathrm{det}\vert A \vert+\mathrm{det}\vert B \vert$, we can write the inequality:

\begin{eqnarray}
H\left[ P(\vec{\theta} \vert X) \right]-H\left[ P(\vec{\theta}) \right] &\ge & \frac{-1}{2}\log \left( \mathrm{det} \vert F_{ij}\vert+ \mathrm{det} \vert\Sigma_{ij}^{-1})\vert \right) - \frac{1}{2}\log  \mathrm{det} \vert\Sigma_{ij} \vert \\
H\left[ P(\vec{\theta} \vert X) \right]-H\left[ P(\vec{\theta}) \right] &\ge & \frac{-1}{2}\log \left(1+ \mathrm{det} \vert F_{ij}\vert \mathrm{det} \vert\Sigma_{ij})\vert \right) \\
H\left[ P(\vec{\theta} \vert X) \right]-H\left[ P(\vec{\theta}) \right] &\ge & \frac{-1}{2}\log \left(1+ \mathrm{det} \vert F_{ij}\Sigma_{jl})\vert \right) \\
\mathrm{entropy} \ \mathrm{``destroyed''}& \sim & \frac{\mathrm{det} \vert\Sigma \vert }{ \mathrm{det}\vert F^{-1} \vert}
\end{eqnarray}

Where in the last line I've Taylor expanded the log. This negative quantity is approximately how much entropy we are ``killing'' by doing our experiment $X$. We are gauranteed to learn something by observing $X$ and putting it into our model, but the important consideration is the determinant of the Fisher matrix. If it is very large, our parameter space will ``collapse'' substantially, as we saw before.  

Wednesday, November 14, 2018

Some Basic Reinforcement Learning (RL) Notes

Basics

First, let's understand the basics of Reinforcement Learning (RL). We have two main processes:
  • \textbf{Prediction} -- estimating what's going to happen in the future, based on the current state and the current action taken \footnote{($V(s), Q(s,a)$)}
  • \textbf{Control} -- figuring how to act in a given context, with given information \footnote{($\pi(a)$)}
These two processes are always layered over a Markov Decision Process (MDP) where ``the future only depends on the past through the present``. Let's write down a few of the key mathematical elements for such a system (we won't see these explicitly while solving Reinforcement Learning Problems but they are the objects that are being implicitly estimated). First, for every MDP we have a state transition matrix, \begin{eqnarray} P^{s^\prime}_{sa} &=& P\left(S_{t+1}=s^\prime \vert S_t=s, A_t=a \right) \end{eqnarray} which is the probability that some agent will move from state $s \to s^\prime$, after taking action $a$. Pretty simple at face value, but when the number of states are very large -- and the number of actions too -- this can be a very an unwieldy, uncountable matrix. We will often never see it in practice when solving RL problems. What comes naturally with a state transition matrix is a set of possible states at every timestep: \begin{eqnarray} S_t \in S \end{eqnarray} and a set of actions available at each of those states: \begin{eqnarray} A_t \in A(S_t) \end{eqnarray} Similar to the MDP transition matrix $P^{s^\prime}_{sa}$, we also have the expected reward matrix, which is the expected immediate return, given some state transition and previous action: \begin{eqnarray} R^{s^\prime}_{sa} &=& \mathbf{E}\left[R_t \vert s, s^\prime, a \right] \end{eqnarray} Fundamental to all RL contexts is also a concept of total \textbf{value} over time $G$, subject to some discount factor $\gamma$: \begin{eqnarray} G &=& \sum_{t} \gamma^t R_t \end{eqnarray} We can calculate only future realized value from some a point in time $t$, such as: \begin{eqnarray} G _t&=& \sum_{k=0} \gamma^k R_{t+k} \end{eqnarray} More on this later, but $G_t$ is essentially the net present value of future rewards $R_t, R_{t+1}, \dots,$ etc.


Main Pieces for Prediction and Control: $V(s)$, $Q(s,a)$, $\pi(a)$

Just like one has an estimator for the mean in a Gaussian distribution: \begin{eqnarray} \hat{\mu} &=& \frac{1}{N} \sum_{i=1}^N x_i \end{eqnarray} which, will converge to the true mean $\mu$ as sample size $N$ increases: \begin{eqnarray} \lim_{N\to\infty} \hat{\mu} &=& \mu \end{eqnarray} in reinforcement learning we use estimators for the \textbf{value}, $G$, conditioned on a certain state $s$, and a certain policy $\pi$. The \textbf{Value Function}, which is the expected value of future returns, conditioned on the current state: \begin{eqnarray} V^\pi(s) &=& \mathbf{E}\left[G_t \vert S_t=s, \pi \right] \end{eqnarray} When we are ``learning`` in RL, what we're actually doing refining our Value Function estimator $\hat{V}(s)$, which allows us to more accurately \textbf{predict} the rewards that are going to come downstream if we follow our policy $\pi$. Another estimator of note is the \textbf{Action-Value} function, which is the expected sum of future rewards -- or \textbf{value} -- given some state action pair at time $t$ (note, a policy is not needed in this case, we assume that after the next step, everything goes as ``well as it possibly could, in terms of decisions/actions $a$'', so we have no need for $\pi$): \begin{eqnarray} Q_t(s,a) &=& \mathbf{G_t} \left[ \vert S_t=s, A_t=a \right] \end{eqnarray} And, as we learn/update our agents in RL, what we're really doing is updating our estimators of $\hat{Q}(s,a)$


Online Estimators


Example: Stationary Mean

Let's say we are trying to estimate the average height of males in state of Pennsylvania. One way to do this, is to take a random poll of $N$ males from voter registries (where $N << M$, the total number of males in PA). From all the measured heights of these men $x_n$, we can construct a sample mean: \begin{eqnarray} \hat{\mu} &=& \frac{\sum_{n=1}^N x_n}{N} \end{eqnarray} and, a sample variance: \begin{eqnarray} \hat{\sigma_x^2} &=& \frac{\sum_{n=1}^N \left(x_n - \hat{\mu}\right)^2}{N-1} \end{eqnarray} which, is slightly different than the variance of our sample mean (error bar): \begin{eqnarray} \mathrm{Var}\left[\hat{\mu}\right] &=& \frac{\hat{\sigma_x^2}}{N} \end{eqnarray} Note that, the variance of the sample mean goes like $1/N$, so the bigger our survey becomes, the smaller our error bars become and the more sure we are of our "height" model. In Reinforcement Learning, we are constantly adjusting estimators but in an online fashion (one data point at a time). Let's see what this looks like for a sample mean: \begin{eqnarray} \hat{\mu}_N &=& \frac{1}{N}\left( \sum_{n=1}^N x_n \right)\\ &=& \frac{1}{N}\left( x_N + \sum_{n=1}^{N-1} x_n \right)\\ &=& \frac{1}{N}\left( x_N + (N-1)\frac{\sum_{n=1}^{N-1} x_n }{N-1}\right)\\ &=& \frac{1}{N}\left( x_N + (N-1) \hat{\mu}_{N-1}\right)\\ \hat{\mu}_N &=& \hat{\mu}_{N-1} + \frac{1}{N}\left(x_N -\hat{\mu}_{N-1}\right) \end{eqnarray} We can see the form that this is ultimately taking, the change in our estimator is $\delta \hat{\mu}_{N}=\hat{\mu}_{N}-\hat{\mu}_{N-1}$ is the difference between the most recent observation and the most recent estimator value, weighted by a decreasing factor of $\frac{1}{N}$. In reinforcement learning, we will see the same online structure over and over again: \begin{eqnarray} \mathrm{new} \ \mathrm{estimator} \ \mathrm{value} &=& \mathrm{old} \ \mathrm{estimator} \ \mathrm{value} + \ \alpha \left( \mathrm{new} \ \mathrm{observation} - \mathrm{old} \ \mathrm{estimator} \ \mathrm{value} \right) \end{eqnarray} Where $\alpha$ is some step size for adaptation -- be it static, or changing over time (such as $1/N$).


A Brief Fastforward: Online updates of Value Function

Let us take a brief fast-forward and look at the update equation for Value-Function under Value iteration: \begin{eqnarray} V_{k+1}(S_t) &=& V_k(S_t) + \alpha \left[ R_{t+1} + \gamma V_k(S_{t+1})-V_k(S_t) \right] \end{eqnarray} To draw some parallels -- $k$ corresponds to $N$, our $k^\mathrm{th}$ and $(k+1)^\mathrm{th}$ sample step of the game/survey. The estimator in this case is $V(S)$, but we are not estimating only one value -- such as average height of males of males in Pennsylvania -- but many different values -- such as height of males in $s=PA, NY, AL, CA$, etc. Because of the nature of the value function, we can always write our estimator as: \begin{eqnarray} V_k(S_t) &=& R_{t+1} + \gamma V_k(S_{t+1} \end{eqnarray} and so what we are truly doing between the square brackets above, is taking the empirical, newly observed value function and taking it's difference with our old estimation. (Ignore the discount factor $\gamma$ for now.


Stationarity and Non-Stationarity for Online Estimators

One very important point to note, in our example of estimating height from a population of people is that with the step weight: \begin{eqnarray} \alpha_N &=& \frac{1}{N} \end{eqnarray} We are making an implicit assumption that the distribution of male heights \textbf{do not change while we collect samples of data}. (i.e. that the distribution of heights is not changing over time). For many games this is not the case, and one way to allow an estimator to ``adapt'' over time is to set $\alpha$ to be some constant. One can think of $\alpha$ in units of one over time, or \begin{eqnarray} \alpha &=& \frac{1}{\tau} \end{eqnarray} Where $\tau$ is the expected number of timestamps/datapoints to make a \textbf{major} influence on the estimators conclusions/values. Our difference equation above actually ties in perfectly with an exponentially damped system: \begin{eqnarray} \Delta V(s,t) &=& \alpha \left(f(t) - V(s,t) \right)\\ \frac{\partial V(s)}{\partial t} &=& \alpha^\prime \left(f(t) - V(s,t) \right) \\ \left(\frac{\partial}{\partial t} + \alpha^\prime \right)V(s,t) &=& \alpha^\prime f(t) \end{eqnarray} This is the canonical example of an exponentially damped system with Green's function: \begin{eqnarray} G(s,t) &=& G(s)e^{-\alpha t} \end{eqnarray} driven by some forcing function $f(t)$, to which our solution will converge with expected time $1/\alpha^\prime =\tau$. So the larger our step size $\alpha$, the sooner we learn things and the sooner we can ``re-estimate'' a new system. The lower $\alpha$, the higher/longer $\tau$ and the longer our estimator adapts to a new system. Many reinforcement learning implementations use a ``step-down'' strategy for $\alpha$, starting very, very low -- long memory -- and then increasing slowly -- shorter memory -- for assured convergence to the global optimum. (Much like simulated annealing which parallels to statistical mechanical systems, where one fits a model by ``turning down the temperature'' slowly, causing a model to be chaotic/exploratory at first, and then eventually still/convergent).