Thursday, January 17, 2019

Active Learning Through Fisher Information - a more thorough overview

Introduction

Active Learning is a very popular subject - it is repeated in circles of sequentail Optimal Experimental Design (sOED), Reinforcement Learning, and Bayes Optimization. The goal is to identify the best sequence of samples, $\left[\mathbf{x}_n, y_n \right]_{n=1}^N$ to teach a model $y \sim f_\theta(\mathbf{x})$ how to maximize an objective function or maximize its own confidence in predictions. Gaussian processes succeed wildly at this in Bayes Optimization, by providing, after every sample $x, y$ both the mean and variance of its predictions

\begin{eqnarray}
\mathrm{E}\left[f_\theta(x)\right] \\
\mathrm{Var}\left[f_\theta(x)\right]
\end{eqnarray}

so that one can pick where to sample next based on upper confidence bounds, expected improvement, etc -- thereby allowing the control to swing between exploration (for learning) and exploitation (sampling from the highest estimated target value regions). Reinforcement learning can also be tuned to explore and learn by placing "optimistic"/high initial values into the value function $V(x(s_t))$, and keeping the discount factor $\gamma \approx 1$, so that an agent thinks long-term and more aggressively samples from the feature space. In sequential optimal design, the problem is often articulated as a linear/convex program, where one is trying to maxmize the significance -- i.e. minimize p values -- of model coefficients $\theta$ while not over-running experimental costs (since acquiring a sample from some population has an associated expense).

All of these problems center around different formulations of an acquistion function, which tells a learner/model where to look next. We're going to take a hard look at the Fisher Information matrix, and entropy of priors and posteriors, so that, for any Generalized Linear model that is trying to 'learn', one can clearly stitch together the best possible "path" of samples. Immediate applications arise in terms of experimental design, and for those RL agents that are composed of linear models, this represents a clear way of killing uncertainty.

Shannon Entropy and Laplace Approximation


As always in model settings, we can articulate our prior beliefs on the problem through some prior distribution on our model parameters, $p(\theta)$, and that prior can be associated with a level of information. Let's look at the definition of Shannon entropy on a probability distribution for our prior:

\begin{eqnarray}
H[p(\theta)] &=& - \int d\theta \ p(\theta) \mathrm{log}\left( p(\theta) \right)
\end{eqnarray}

Now, this will return a single number that represents the ``informativeness`` of the prior: low entropy is a highly informative and opinionated prior, high entropy is a highly ambivalent, flexible prior. The most extreme examples being a dirac delta function (where only a single value of $\theta$ has any chance) and a uniform probability (where all $\theta$ values have equal chance), respectively.

Now, priors take many forms, so in order to unify things let's just say we can approximate the probability distribution $p(\theta)$ as a Gaussian, using a form of taylor's expansion called the Laplace approximation (check it out on wikipedia) that puts our prior into Gaussian form:

\begin{eqnarray}
p(\theta) &\approx & \frac{1}{\sqrt(2\pi \mathrm{det}\vert F \vert)} e^{\frac{(\theta_i - \mu_i) F_{ij} (\theta_j - \mu_j)}{2} + \epsilon}
\end{eqnarray}

Where the inverse covariance matrix is given by the log-hessian of our prior:

\begin{eqnarray}
F_{ij} &=& -\frac{\partial^2 \mathrm{log}(p(\theta))}{\partial \theta_i \partial \theta_j}
\end{eqnarray}

This matrix is called the information matrix - and it tells us about the curvature of the prior, or approximately how much volume that probability distribution takes up in "theory space". Let's take the entropy of this distribution:

\begin{eqnarray}
H[p(\theta)] &=& \frac{d}{2}\mathrm{log}(2\pi e) - \mathrm{log}(\mathrm{det}\vert F_{ij} \vert)
\end{eqnarray}

Notice that we have two terms: a dimension-specific constant and determinant of the fisher information as the second term.\footnote{Now take a hard look at this equation, and then look at the Boltzman probability distribution and Entropy from statistical mechanics:

\begin{eqnarray}
p(E) &=& \frac{1}{Z}e^{-\beta E} \\
S[p] &=& \mathrm{log}(Z) + \beta \langle E \rangle
\end{eqnarray}

There is a perfect corrolary between $S$ and $H$, $Z$ and $\mathrm{det}\vert F_{ij} \vert)$.} The larger the determinant of the information matrix, the lower the entropy of our distribution.

Entropy of Posterior

Now once data observations, $X$ come into play, we are going to update our beliefs based on what we've seen, using the likelihood $\mathcal{L}(X \vert \theta)$ which represents the frequentist question "how likely is the data, given my model?". Using Bayes theorm we can write down an expression for the final posterior:

\begin{eqnarray}
p(\theta \vert X) &=& \frac{\mathcal{L}(X\vert\theta)p(\theta)}{p(X)}
\end{eqnarray}

And, if do some math, it's fairly easy to show that the fisher information matrices of the prior and likelihood our additive:

\begin{eqnarray}
F^{(\mathrm{posterior})}_{ij} &=& F^{(\mathrm{observations})}_{ij}+F^{(\mathrm{prior})}_{ij}
\end{eqnarray}

\paragraph{Information Gain} If we want to get our "updated" information or entropy of the posterior -- $H[p(\theta \vert X)]$ -- our dimesionality $d$ hasn't changed, so we have a difference of:

\begin{eqnarray*}
H[p(\theta \vert X)] - H[p(\theta)] &=& \mathrm{log}\left(\frac{\mathrm{det}\vert F^{(\mathrm{observations})}_{ij}+F^{(\mathrm{prior})}_{ij} \vert} {\mathrm{det}\vert F^{(\mathrm{prior})}_{ij} \vert} \right)
\end{eqnarray*}

This expression is called the information gain, and it is precisely the quantity we want to maximize in the D-optimal design, so common in sequential optimal experimental design circles. In that case however, one ignores the information from the prior.

Convex Optimization and D-optimal Experimental Design


Now the quantity mentioned above is of the form $\mathrm{log}(\mathrm{det}|X|)$, which a convex function. This gives a lot of advantages when it comes to optimization. The information gain is a function of the datapoints $\mathbf{x}$, and so we have a function, $f(x)$ that we would clearly like to maximize:

\begin{eqnarray}
\mathrm{EIG}(X) &=& H[p(\theta \vert X)] - H[p(\theta)] \\
&=& \mathrm{log}\left(\frac{\mathrm{det}\vert F^{(\mathrm{observations})}_{ij}(X)+F^{(\mathrm{prior})}_{ij} \vert} {\mathrm{det}\vert F^{(\mathrm{prior})}_{ij} \vert} \right)
\end{eqnarray}

Chaining Experiments $X_1, X_2, \dots$, etc.

The input to this information gain function can be a single vector $\vec{x}$ or a set of $N$ vectors of dimension $D$ -- a design matrix $X_{nd}$ -- just like a normal batch training set. And, going beyond even this, we can chain experiments together $X_1, X_2, X_3$ by calculating the incremental information after the first set of observations,

\begin{eqnarray}
\mathrm{EIG}(X_1) &=& \mathrm{log}\left(\frac{\mathrm{det}\vert F_{ij}(X_1)+F^{(\mathrm{prior})}_{ij} \vert} {\mathrm{det}\vert F^{(\mathrm{prior})}_{ij} \vert} \right)
\end{eqnarray}

the second,

\begin{eqnarray}
\mathrm{EIG}(X_2 \vert X_1) &=& \mathrm{log}\left(\frac{\mathrm{det}\vert F_{ij}(X_2)+ F_{ij}(X_1) +F^{(\mathrm{prior})}_{ij} \vert} {\mathrm{det}\vert F_{ij}(X_1) + F^{(\mathrm{prior})}_{ij} \vert} \right)
\end{eqnarray}

and the third,

\begin{eqnarray}
\mathrm{EIG}(X_3 \vert X_1, X_2) &=& \mathrm{log}\left(\frac{\mathrm{det}\vert F_{ij}(X_3) +F_{ij}(X_2)+ F_{ij}(X_1) +F^{(\mathrm{prior})}_{ij} \vert} {\mathrm{det}\vert F_{ij}(X_2) + F_{ij}(X_1) + F^{(\mathrm{prior})}_{ij} \vert} \right)
\end{eqnarray}

and so on. What we're doing here is clearly chaining together sequential observations, so that at each step we're taking the optimal design-step forward.

Solving for optimal $\mathrm{EIG}(X)$

Let's say we are actually trying to solve for the optimal next $K$ samples. And, let's say that we can choose from some very large array of possible feature vectors, which we can call the population:

\begin{eqnarray}
\mathrm{population} &=& \left[\mu_m \right]_{i=1}^{M}\\
&=& U_{md}
\end{eqnarray}

Out of all the $M$ samples, which $K$ do we choose? How do we go about optimizing $\mathrm{EIG}$? The answer is that if we want to do this with a program it will have to be an integer one, since the decision variables will all have to be zero or one -- to sample or not to sample -- which is not great news for speed or scalability.

\paragraph{Example} For an explicit example, take linear regression, whose Fisher information matrix is the outer product of the sum of feature vectors, or in plainer english, $N$ times the covariance estimator:

\begin{eqnarray}
F_{ij}(X) &=& N \mathrm{Cov}(i, j) \\
F_{ij}(X) &=& \sum_{n=1}^N \vec{x}_n \vec{x}_n ^T \\
F_{ij}(X) &=& X_{ni}X_{nj}
\end{eqnarray}

If we frame this in terms of our population, with decision variables $\lambda_m = 0,1$ on whether or not to sample that member of the population, we have for the 'population poll' fisher information based on who we chose:

\begin{eqnarray}
F_{ij}(\lambda) &=& \sum_{n=1}^m \lambda_m \vec{\mu}_m \vec{\mu}_m ^T \\
F_{ij}(\lambda) &=& \lambda_m U_{mi}U_{nj}
\end{eqnarray}

which one can then plug into our expected information gain:

\begin{eqnarray}
EIG(\lambda) &=& \mathrm{log}\left(\frac{\mathrm{det}\vert \lambda_m U_{mi}U_{nj} \vert + F^{(\mathrm{prior})}_{ij} \vert} {\mathrm{det}\vert F^{(\mathrm{prior})}_{ij} \vert} \right)
\end{eqnarray}

The quantity above is very difficult to optimize, and the program would look something like this:

\begin{eqnarray}
\mathrm{maximize} \ && \mathrm{log}\left(\frac{\mathrm{det}\vert \lambda_m U_{mi}U_{nj} \vert + F^{(\mathrm{prior})}_{ij} \vert} {\mathrm{det}\vert F^{(\mathrm{prior})}_{ij} \vert} \right) \\
\mathrm{subject} \ \mathrm{to} && \sum_m \lambda_m = K \\
&& \lambda_m = 0,1 \ \forall \ m
\end{eqnarray}


\paragraph{Relaxed Design for $K$ additional samples} One solution to this computational difficulty, and often used in the case of D-optimal design, is to relax the integer constraints on the problem and just ask ourselves "what's the best proportion of samples from my population if I'm going to pick $K$ more samples?" That can be solved much more quickly:

\begin{eqnarray}
\mathrm{maximize} \ && \mathrm{log}\left(\frac{\mathrm{det}\vert K \lambda_m U_{mi}U_{nj} \vert + F^{(\mathrm{prior})}_{ij} \vert} {\mathrm{det}\vert F^{(\mathrm{prior})}_{ij} \vert} \right) \\
\mathrm{subject} \ \mathrm{to} && \sum_m \lambda_m = 1 \\
&& 0 < \lambda_m < 1 \ \forall \ m
\end{eqnarray}

Notice the difference in the constraints. This relaxed design problem applies equally as well if we've already conditioned on previous observations $X_1, X_2, \dots$, etc.

\paragraph{Note} Need to think carefully about derivatives determinants of determinants, but it's quite possible that optimal \textit{next} sampling proportion, $\lambda$ changes direction as a function of $K$. This is interesting indeed

Sampling Towards Optimal Design Balance $\tilde{\lambda}$


Now, we've helped ourselves a little bit with the above, by relaxing integer constraints on the problem. But what if we want to just long term sample towards the optimal design proportion $\tilde{\lambda}$, without ever having to update determinants or Fisher matrices? One answer, is to compare, at any point in time, the current proportion of $\lambda$, to the solved ideal proportion of samples from the population, $\tilde{\lambda}$. The difference in these vectors tells us the optimal "path" along which to sample:

\begin{eqnarray}
\vec{\nu} &=& \tilde{\lambda}_m-\lambda_m
\end{eqnarray}

But this vector has same strange properties, namely that its negative components will be telling us that it will tell us to \textbf{unsample} formerly sampled points! How do we account for this? One answer is to use log odds instead. Let's take the log odds of the optimal to current proportion:

\begin{eqnarray}
\vec{\nu}_m &=& \mathrm{log}\left(\frac{\tilde{\lambda}_m}{\lambda_m}\frac{(1-\lambda_m)}{\tilde{(1-\lambda}_m)} \right)
\end{eqnarray}

Now, we can use these log odds to generate sampling probabilities for each member of the population, by drawing a random number from a type I extreme value distribution or using a softmax.

Tuesday, January 15, 2019

Active Learning through Fisher Information Gain

Typical Active Learning/Expensive Observation Setup

Typically when people are in 'active learning' data environments, there are a few issues:
  • You have a tiny training set, but can expensively review further points (x,y)
  • You can never hope to label all the data points out there -- i.e. total test set is huge
  • The labels themselves (y) are highly skewed / rare
  • The data is highly biased in feature (x) space
And in reaction, people decide to one of a few fairly easy things:
  • Uncertainty sampling: building a simple model p(y|x) and then prioritizing further data point review using the Gini coefficient, which is gini=p(y|x)(1-p(y|x)). Essentially, this is the uncertainty in a binomial random variable
  • Stratified Sampling: getting ridding of bias in some dimension -- be it labels (y) or some categorical combination of features (x) -- such that further expensive (x,y) reviews are well spread across the feature space

Trouble is, uncertainty sampling has no notion of x-space bias, and stratified sampling has no notion of prediction uncertainty!  How do we combine these? Answer: Fisher Information and sequential optimal experimental design (sOED).

Fisher Information Matrix for GLM's (Generalized Linear Models)



For all Generalized linear models, you can write the fisher information matrix in the form:

\begin{eqnarray}
F_{ij}(X, \theta) &=& \frac{\partial^2 \mathcal{L}(X \vert \theta)}{\partial \theta_i \partial \theta_j} \\
&=& \sum_n X_{ni} D_{nn}X_{nj}
\end{eqnarray}

Where $D$ is most often some diagonal matrix representing variance in predictions. This means that it's incredibly easy to evaluate the fisher information gain for a single new observation, $x$. Which would be something like

\begin{eqnarray}
\mathrm{eig} &=& \mathrm{log} \left( \mathrm{det} \vert F_{ij}(X, \theta) + F_{ij}(X^\prime, \theta) \vert \right) -  \mathrm{log} \left( \mathrm{det} \vert F_{ij}(X, \theta) \vert \right) \\
&=& \mathrm{log} \left( \frac{\mathrm{det} \vert F_{ij}(X, \theta) + F_{ij}(X^\prime, \theta) \vert}{\mathrm{det} \vert F_{ij}(X, \theta) \vert} \right)
\end{eqnarray}

Where $X$ is training data and $X^\prime$ is newly observed data. The above equation is exactly the posterior -- $p(\theta \vert X)$ -- volume collapse outlined in this earlier post.

Because I've been thinking more about experimental design these days - below is a simple implementation of active learning for Logistic Regression using the Fisher information matrix to sequentially pick next samples.

Python Jupyter Notebook on Gist: https://gist.github.com/rspeare/fa2f152f0df5f5cd339a94e6f6f71d15


Example Code



Next - Study Design

Next up, relaxed study design and how to sample towards the 'optimal' configuration, $\lambda$ most often got by solving the linear program:

\begin{eqnarray}
\mathrm{minimize} && -\sum_c \mathrm{log}( \mathrm{det} \vert \lambda_c \mathbf{x}_c D_{cc} \mathbf{x}_c^T \vert ) \\
\mathrm{subject} \ \mathrm{to}&& \ \sum_c \lambda_c = 1 \\
&& \ \lambda_c > 0 \forall c
\end{eqnarray}