Tuesday, January 15, 2019

Active Learning through Fisher Information Gain

Typical Active Learning/Expensive Observation Setup

Typically when people are in 'active learning' data environments, there are a few issues:
  • You have a tiny training set, but can expensively review further points (x,y)
  • You can never hope to label all the data points out there -- i.e. total test set is huge
  • The labels themselves (y) are highly skewed / rare
  • The data is highly biased in feature (x) space
And in reaction, people decide to one of a few fairly easy things:
  • Uncertainty sampling: building a simple model p(y|x) and then prioritizing further data point review using the Gini coefficient, which is gini=p(y|x)(1-p(y|x)). Essentially, this is the uncertainty in a binomial random variable
  • Stratified Sampling: getting ridding of bias in some dimension -- be it labels (y) or some categorical combination of features (x) -- such that further expensive (x,y) reviews are well spread across the feature space

Trouble is, uncertainty sampling has no notion of x-space bias, and stratified sampling has no notion of prediction uncertainty!  How do we combine these? Answer: Fisher Information and sequential optimal experimental design (sOED).

Fisher Information Matrix for GLM's (Generalized Linear Models)



For all Generalized linear models, you can write the fisher information matrix in the form:

\begin{eqnarray}
F_{ij}(X, \theta) &=& \frac{\partial^2 \mathcal{L}(X \vert \theta)}{\partial \theta_i \partial \theta_j} \\
&=& \sum_n X_{ni} D_{nn}X_{nj}
\end{eqnarray}

Where $D$ is most often some diagonal matrix representing variance in predictions. This means that it's incredibly easy to evaluate the fisher information gain for a single new observation, $x$. Which would be something like

\begin{eqnarray}
\mathrm{eig} &=& \mathrm{log} \left( \mathrm{det} \vert F_{ij}(X, \theta) + F_{ij}(X^\prime, \theta) \vert \right) -  \mathrm{log} \left( \mathrm{det} \vert F_{ij}(X, \theta) \vert \right) \\
&=& \mathrm{log} \left( \frac{\mathrm{det} \vert F_{ij}(X, \theta) + F_{ij}(X^\prime, \theta) \vert}{\mathrm{det} \vert F_{ij}(X, \theta) \vert} \right)
\end{eqnarray}

Where $X$ is training data and $X^\prime$ is newly observed data. The above equation is exactly the posterior -- $p(\theta \vert X)$ -- volume collapse outlined in this earlier post.

Because I've been thinking more about experimental design these days - below is a simple implementation of active learning for Logistic Regression using the Fisher information matrix to sequentially pick next samples.

Python Jupyter Notebook on Gist: https://gist.github.com/rspeare/fa2f152f0df5f5cd339a94e6f6f71d15


Example Code



Next - Study Design

Next up, relaxed study design and how to sample towards the 'optimal' configuration, $\lambda$ most often got by solving the linear program:

\begin{eqnarray}
\mathrm{minimize} && -\sum_c \mathrm{log}( \mathrm{det} \vert \lambda_c \mathbf{x}_c D_{cc} \mathbf{x}_c^T \vert ) \\
\mathrm{subject} \ \mathrm{to}&& \ \sum_c \lambda_c = 1 \\
&& \ \lambda_c > 0 \forall c
\end{eqnarray}


No comments:

Post a Comment