Tuesday, December 29, 2015

EM Algorithm and restricted K-means for the M-Traveling Salesman Problem

One very popular and very useful algorithm that its taken me a while to get around to is something called the Expectation-Maximization algorithm, or EM. A lot of people treat this thing as a black box, but because I wanted to implement my own handwritten constraints in a K-means and Gaussian mixture model, I had to sit down and read through things.

Turns out the EM algorithm is all about introducing auxiliary variables to your problem -- auxiliary data to be exact. Physicists solve things all the time by ``enriching'' and the integrating out, which is exactly what EM does. Let's say we have a sequence of observations $\vec{x}_i$ for $i=1...N$, and we would like to estimate their density. Problem is, they seem to form some awful distribution, something that would be impossible to model with a Gaussian distribution or a Poisson or any ``sane'' thing. We could resort to non-parametrics, such as Kernel Density estimation but another idea may be to say ``Hey, I bet the first data point came from a Gaussian distribution, but one that was centered over THERE. Let's call it Gaussian  A. And I bet the second was drawn from the same Gaussian A. But the third, which is really far away, was drawn from an entirely different Gaussian distribution, which we can call B \dots" etc.

What we're doing here is introducing ``membership'', which of course has a close relation to clustering and space segmentation, but more on that later. We now have our auxiliary variables in the problem.

\begin{eqnarray}
\vec{x}_i, \vec{z}_i
\end{eqnarray}

For each $i$, $\vec{x}$ is the observable, living in lets say $D$ dimensions, since each data point has D features, while $\vec{z}$ is the membership. For the Gaussian mixture case $\vec{z}$ lives in K dimensions, where K is the number of Gaussians -- or clusters -- we allow to build the space. Now it's pretty simple to write down the log Likelihood, but we have to think about what parameters we are conditioning on. We've got the mean and variance of EACH Gaussian, let's call them $\mathbf{\mu}_k, \mathbf{\Sigma}_k$ and we've also got the probability of membership $\vec{z}_i$, for every $i$, essentially its PDF. Let's call this $\vec{\phi}$, which note, has K components and is in general a multinomial for the whole data set, multinoulli for a single draw. (The attentive reader will note that I'm taking notation directly from Andrew Ng and Kevin Murphy, who both have great notes on this).

\begin{eqnarray}
\mathcal{L}(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &=& \log \left(P(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)\right)
\end{eqnarray}

Let's take this a step at a time. If we assume that every data point $\vec{x}_i$ is independent, then we can write
\begin{eqnarray}
\mathcal{L}(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &=& \sum_i \log \left(P(\vec{x}_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)\right)
\end{eqnarray}

Now, if we want to include our auxiliary variables, we have to include them explicitly inside the log but marginalize over them:

\begin{eqnarray}
\mathcal{L}(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &=& \sum_i \log \left( \sum_{z_i} P(\vec{x}_i, z_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)\right)
\end{eqnarray}

And now, we can use a very interesting trick. My first intuition at this point was to expand the joint inside the log like so

\begin{eqnarray}
\mathcal{L}(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &=& \sum_i \log \left( \sum_{z_i} P(\vec{x}_i \vert \mathbf{\mu}_k,\mathbf{\Sigma}_k)P(z_i \vert \phi) \right)
\end{eqnarray}

and keep working but apparently there's a more useful way to do things. If we divide and multiply by some UNKNOWN PDF in $z_i$, $Q(z_i)$ we can use Jensen's equality to write:

\begin{eqnarray}
\mathcal{L}(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &=& \sum_i \log \left( \sum_{z_i} Q(z_i) \frac{P(\vec{x}_i, z_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)}{Q(z_i)}\right)\\
& \ge & \sum_i \sum_{z_i} Q(z_i)  \log \left( \frac{P(\vec{x}_i, z_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)}{Q(z_i)}\right)
\end{eqnarray}

We can do this because log is a concave function, and for any concave function we have:

\begin{eqnarray}
E\left[f(x) \right] & \le & f(E\left[x \right])
\end{eqnarray}

Our expectation is over $z_i$ and so now we see that our log loss is strictly greater than this somewhat easier expression on the RHS. But, it kind of looks familiar... its the negative KL divergence between the two distributions in $z_i$! If we want maximize our lower bound and make it as small as possible, we need to minimize the KL, divergence, essentially setting:

\begin{eqnarray}
P(\vec{x}_i, z_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &= & \mathrm{const} Q(z_i) \ \ \forall \ z_i
\end{eqnarray}

So if we sum both sides in $z_i$ we get the marginalized distribution on the left and a constant on the right:

\begin{eqnarray}
\sum_{z_i}P(\vec{x}_i, z_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &= & \sum_{z_i} \mathrm{const} Q(z_i) \\
P(\vec{x}_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &=& \mathrm{const}
\end{eqnarray}

i.e. the best candidate for $Q(z_i)$ is the conditional membership, based on the datapoint $x_i$ itself!

\begin{eqnarray}
Q(z_i) &=& \frac{P(\vec{x}_i, z_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)}{P(\vec{x}_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)}\\
&=& P( z_i \vert \vec{x}_i, \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)
\end{eqnarray}

So now, with this $Q$ in hand in we have a tight lower bound on the log likelihood, which we can then maximize in $\mathbf{\mu}_k, \mathbf{\Sigma}_k$:

\begin{eqnarray}
\mathcal{L}(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)&\ge & \sum_i \sum_{z_i} P(z_i \vert \vec{x}_i, \phi, \mu_k ,\Sigma_k )  \log \left( P(\vec{x}_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) \right)
\end{eqnarray}

The EM algorithm simply consists of the following two steps:

(1) Given $\mu_k,\Sigma_k$ and the data $X$, calculate $P(z_i \vert \vec{x}_i, \phi, \mu_k ,\Sigma_k )  $. (E-step)
(2) Given the above, maximize w/r/t $\mu_k,\Sigma_k$. (M-Step)
(*) repeat


So, we are essentially calculating data point memberships -- E step -- and then optimizing the log likelihood, as we always would -- E step. The difference here is that we don't set hard memberships in our model. We let there be fractional or ``soft'' memberships in this likelihood expression.

In algorithms like K-means, which actually works on the same EM principle with some drastic assumptions, we have hard memberships, and identical, diagonal covariance matrices across all Gaussians -- or clusters -- K. These assumptions make things super fast and scalable but of the data has any spatial skewness or inhomogeneity, we could be in trouble. [Differing size of clusters is a problem two, but I haven't come to understand this yet]

The beautiful thing about EM, and I don't have time to get into it now, is that it's gauranteed to monotonically increase the log likelihood iteration by iteration, and this is simply due to the convexity properties we talked about earlier. Local maxima can be reached, but as long things are smartly seeded we can expect to find a good fit.

----------------------------------------------------------------------

The reason I was using all this stuff, was to segment deliveries in a Traveling salesman problem. It's actually called the M travelling salesman problem where multiple men try to visit many different locations with the least amount of effort -- however that is defined. There are some restrictions on how much the men can carry, perhaps how far they can go, and so this constitutes the restriction in my K-means/GMM models.

Oh, and by the way, those men might be Santa making multiple trips to deliver gifts at Christmas

Tuesday, November 24, 2015

A Brief Note on the Bias-Variance Decompositon


When one trains a model that is highly flexible, highly ``articulate'' on a training set, you often get a great training score -- be it AUC, accuracy, MSE, etc. But such models -- as I've talked about before -- often have trouble generalizing to ``test'' sets, or, the real world. One of the easiest ways to see this is by a simple FOIL operation on the following quantity:

\begin{eqnarray}
Y&=& f(X)+\epsilon
\end{eqnarray}

Here $X$ is a random variable -- the independent inputs -- $f(X)$ is the generative process that creates our object of interest $Y$ -- be it cardinal or $\in \mathcal{R}$ -- and $\epsilon$ is drawn from some noise distribution, say a Gaussian process with zero mean, $\mathcal{N}(0,K(X,X^\prime))$. Let $g(X)$ be our picked model for $Y$. (Normally people write $\hat{f}(X)=g(X)$ but I chose $g$ to avoid confusion.) If we take a look at the mean squared error, we get

\begin{eqnarray}
MSE &=& \langle \vert f(X)+\epsilon-g(X) \vert^2 \rangle\\
&=& \langle f(X)^2 \rangle + \langle \epsilon^2 \rangle +\langle g(X)^2 \rangle - 2 \langle \epsilon \rangle \langle f(X) \rangle - 2 \langle \epsilon \rangle \langle g(X) \rangle - 2 \langle f(X) \rangle \langle g(X) \rangle
\end{eqnarray}

Where I've assumed the noise and our $f,g$ are uncorrelated. We see the terms that are linear in $\epsilon$ fall away and we can write:


\begin{eqnarray}
MSE &=& \langle f(X)^2 \rangle + \langle \epsilon^2 \rangle +\langle g(X)^2 \rangle - 2 \langle f(X) \rangle \langle g(X) \rangle
\end{eqnarray}

Adding and subtracting the mean of our model squared $\langle g(X) \rangle^2$ we get:

\begin{eqnarray}
MSE &=& \langle \left(f(X)-\langle g(X)\rangle \right)^2 \rangle +\mathrm{Var}\left(g(X) \right)+\mathrm{Var}\left(\epsilon \right)
\end{eqnarray}

So now, term by term, we see we have: the squared difference between our data and our average model -- the Bias, which quantifies how much our model is ``off'' (a quantity that will be quite low for complex models and high for simple ones); the variance of the model itself, which quantifies how much our $g(X)$ changes given different training inputs (a quantity that will be high for complex models and low for simple ones) ; and the variance of the noise variable $\epsilon$, which is an ineluctable contribution to our error.

Recall from last post we had a very similar bias-variance decomposition:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \left( \min_{h_\theta} \epsilon (h_{\theta}(X)) \right)+\sqrt{\frac{2\log \left( \frac{2K}{\delta} \right)}{N}}
\end{eqnarray}

The bias is the first term and the variance the second, which goes like square root logarithm of the degrees of freedom $K$. This decomposition illustrates the balancing act we have to play, as model builders, between simplicity and goodness of fit. Refer to these when explaining why your highly un-regularized gradient boosted decision tree -- or boosted anything -- did a poor job of generalizing! (This doesn't always happen, of course, just a word of caution.)


Tuesday, November 17, 2015

Learning Theory and an XGboost Failure on Kaggle

After a somewhat embarrassing moment on Kaggle.com, where I blatantly overfit a training set $X$ with a boosted decision tree -- XGboost -- and then got a horrible, horrible, generalized result, I thought it'd be time to consider model complexity and learning theory, and in general the debate that has been bubbling beneath the Machine Learning and Statistics community.

Recently, there has been a lot of success using highly flexible models, such as boosted regressors, classifiers, decisions trees, random forests, etc. to model very complex data sets. (One should lump neural networks in here, too, since they perform highly non-trivial, non-linear transformation to inputs.) Leo Breiman wrote a very interesting paper in 2001 called "Statistical Modeling: The two Cultures" which highlights -- and insults -- the attachment of  old school statisticians to generative models, where the data observed is characterized/explained as being generated by some "process". For instance, we say that the obesity in the United States is generated by a linear superposition of many factors, say income, age, sex, etc.:

\begin{eqnarray}
\mathrm{obsesity} = f(\mathbf{x}) &=& \sum_n \theta_n x_n
\end{eqnarray}

Which is of course a very stupid assumption, since we know there are non-linear interactions between these features of a person, that create complex results. More flexible models can capture this, by multiplying the different features together in some polynomial. Let us for instance, include all quadratic -- second order -- terms in our ``generative process'':

\begin{eqnarray}
f(x) &=& \theta \cdot \mathbf{x} + \mathbf{x} \cdot A \cdot \mathbf{x}
\end{eqnarray}

Where $A$ is some feature mixing matrix. There are way more parameters in our model now, and some will prove to be important, while others will not. These days, the MOST old school statisticians -- it seems to me -- expand features as polynomials, like above, and then use tried and true methods to cut down on the model complexity, getting rid of combinations and/or linear features that are not deterministic of outcome. This is all well and good. One can use the precise framework of $p-$values and hypotheses tests, etc. to characterize fit, but Leo Breiman says that by using such methods -- and I don't pretend to have exhausted all such models he pointed a finger at, or properly explained them -- we really limit ourselves. Essentially, the gist is that if we get away from this generative framework and look more towards flexible algorithms, such as Random forests and boosted decision trees, we can have greater predictive power in the real world.

If Kaggle competitions are treated as valid evidence, this is certainly true, since virtually every winner of these competitions -- outside computer vision -- has used gradient boosted decision trees or Random forests explicitly for their solution or as part of a larger ensemble of models. The trouble is, how do we tune these models to not overfit? How do we interpret and regularize them? Constraining a linear or logistic regression to have small weights with a quadratic penalty term is essentially the same thing as imposing entropy, imposing some simplicity measure on the model ( which turns out to be a Gaussian prior); but for trees and forests, how does one characterize the degrees of freedom? This is the trouble I had over the weekend, trying to fit a very complex algorithm that I didn't understand to a dataset, and then getting burned for it. The nice thing about Logistic Regression is that those linear weights $\theta$ can be read as importance values: when taking the absolute value they tell you how much each feature -- which, hopefully has been properly scaled -- contributes to a decision, or pushing towards a decision in feature space $\mathbf{x}$. (I've talked about this before.) Trees are much harder, and although the choice of split in classification problems, based on the gini index or information gain are useful, they are only a piece of the story. The first split of a decision tree can be viewed as the most important feature of course, but one clever comment I read in an AI book is that pruning trees is much better than limiting their depth, as what often happens, for instance with an XOR classification problem, is you split on $x_1$, getting not so great results, but then the following split on $x_2$ is highly deterministic, implying that the operator $x_1 \ \mathrm{AND} \ x_2$ is a very important feature, which would have been missed had the tree not been able to ``grow''. Clever as this comment is when it comes to training and making practical decisions when using Decision trees, it also highlights the fact that feature importance in a Decision tree can be LATENT. Sometimes the logical conjunction of features or splits is what's important, and that will not show up so easily to the eye when looking at a decision tree.

So, just for fun and practice with the Hoeffding, Chernoff inequalities, and the Union bound, let's take a look at some elementary learning theory (I'm lifting this directly from Andrew Ng's notes at Stanford). Say we are in a binary classification problem, where our data is a sequence of features $\mathbf{x}$ and corresponding classes $y$:

\begin{eqnarray}
D=&=& \lbrace \mathbf{x}_n, y_n \rbrace_{n=1}^N
\end{eqnarray}

We would like to estimate our generalization error -- which is just a specific type of loss function that we're going to treat as a conditional random variable:

\begin{eqnarray}
\epsilon(h(x)) &= & 1_{h(x) \neq y}
\end{eqnarray}

Where $1$ is the indication function. The expectation value of the generalization error is characterized by a probability distribution, which we know nothing about:

\begin{eqnarray}
\epsilon(h(x)) &\sim & P(\epsilon(h(x)) \vert D)
\end{eqnarray}

But, we can estimate the mean of this distribution, given some samples in $D$ by using the Hoeffding bound. The greatest $\epsilon(h(x))$ can be is unity for a given $x$, and the lowest it can be is zero. This means we can write:

\begin{eqnarray}
P( \vert \epsilon(h(x)) - \sum_{n=1}^N \frac{\epsilon(h(x_n))}{N} \vert \ge \gamma ) & \le & 2 e^{-2N\gamma^2}
\end{eqnarray}

where the nasty term in the argument of our probability distribution is just the mean of our errors, what's called the Empirical risk:

\begin{eqnarray}
\hat{\epsilon}(h(x)) &=& \sum_{n=1}^N \frac{\epsilon(h(x_n))}{N}
\end{eqnarray}

Writing this a bit more simply, then, this is really just the standard problem of estimating the mean of a binomial distribution, for example -- but we haven't specificied the PDF, only that all of the samples $x_n, y_n$ are i.i.d!

\begin{equation}
P( \vert \epsilon(h(x)) -\hat{\epsilon}(h(x)) \vert \ge \gamma )  \le  2 e^{-2N\gamma^2}
\end{equation}

We can see that as $N \to \infty$ -- an infinite training set -- we are going to be spot on: our training error $\hat{\epsilon}$ will be precisely the same as our generalization error $\epsilon$. But, recall that when fitting models, $h(x)$ is actually parametrized by some $\theta$. So what we are really worried about is the probability that ANY of our models' train set error, $\hat{\epsilon}(h_\theta(x))$ deviate far from their generalization error. For such a thing we use the union bound:

\begin{eqnarray}
P(A \cup B \cup C \dots ) &\le & P(A)+P(B)+P(C)+\dots
\end{eqnarray}

Now, to make life simpler, let's say our model space is discrete, and we really only have a choice of $K$ $\theta$s. Then we've got:

\begin{eqnarray}
P( \cup_{k=1}^K \vert \epsilon(h_k(x))-\hat{\epsilon}(h_k(x))\vert \ge \gamma) &\le & 2K e^{-2N \gamma^2}
\end{eqnarray}

So, as we've seen before. Things aren't that much worse. Let's write the probability that our training error is more than $\gamma$ different from our generalization error as $\delta$, and then we can state:

\begin{eqnarray}
\gamma &\le & \sqrt{\frac{\log(2K/\delta)}{2N}} \\
N & \ge & \frac{1}{2\gamma^2}\log \left(\frac{2K}{\delta}\right)
\end{eqnarray}

What do these two statements mean? The first that our estimation of generalization error gets tighter as $N$ gets large. More specifically, it scales at worst like $\frac{1}{\sqrt{N}}$, just Monte Carlo Models. We also see that accuracy goes logarithmic with confidence, which is just a remnant of usual Hoeffding/Chernoff bound statements. In the second line, we see that the requisite number of Training examples, for some given confidence $\delta$ and accuracy on empirical risk $\gamma$, goes logarithmic with the degrees of freedom. A priori, this seams very nice, but let's develop things further.

When we're given a training set, we're not gauranteed to fit the optimal model, but we know our empirical risk estimate is at most $2\gamma$ away the OPTIMAL empirical risk. Let's show this. First, call our fitted model $h_{\hat{\theta}}$, and the optimal one $h_{\theta^*}$.  We have, from the above Hoeffding/Chernoff inequality, for a single fitted model:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \hat{\epsilon}(h_{\hat{\theta}})+\gamma
\end{eqnarray}

This is just comparing generalization risk to training risk. Now let's compare to the optimal, by noting that, if we fit to the wrong model,  $\hat{\epsilon}(h_{\hat{\theta}}) \le \hat{\epsilon}(h_{\theta^*})$ -- otherwise we would have chosen something else! Then we can write:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \hat{\epsilon}(h_{\theta^*})+\gamma
\end{eqnarray}

and, now converting the training risk to the generalization risk on the optimal model, we get:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) & \le & \epsilon(h_{\theta^*})+2\gamma
\end{eqnarray}

This is a very nice inequality, because what it's basically saying is that our empirical risk is less than the optimal fit, plus some accuracy term:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \left( \min_{h_\theta} \epsilon (h_{\theta}(X)) \right)+\sqrt{\frac{2\log \left( \frac{2K}{\delta} \right)}{N}}
\end{eqnarray}

With highly complex models, the first term on the right hand side will be quite small -- low bias -- since we can fit our data almost exactly. But conversely, in such a case $K$ becomes large and we have no good hold on how our model will extend into the real world -- high variance. How bad does this scaling of complexity go? A rough argument from Andrew Ng's notes is that for floating point numbers, for $D$ parameters, we have approximately:

\begin{eqnarray}
K &=& 2^{64D}
\end{eqnarray}

and so our inequality goes like:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \left( \min_{h_\theta} \epsilon (h_{\theta}(X)) \right)+\sqrt{\frac{2D\log \left( \frac{2^{65}}{\delta} \right)}{N}}
\end{eqnarray}

and so the requisite number of training examples -- for a given accuracy $\gamma$ and confidence $\delta $ -- goes linearly with parameter dimension.

But this is not entirely correct. One can use something called the vapnik-chervonenkis dimension, which I won't get into here, maybe in a later post, to characterize the complexity of a model. This is all a very active part of research, and it obviously is quite difficult to get a handle on how well an algorithmic solution will generalize, once trained on a training set.

For instance, what is $K$ for a gradient boosted decision tree? How does one characterize its vapnik-chervonenkis dimension? This question is very important if you want to fit such models to data, rather than being a blockhead and just using Randomized Grid Search for Hyper-Parameters in python.  (WHICH, is exactly what I did over the weekend...)

Wednesday, November 11, 2015

Cumulants of the Sample Mean

Let a random variable $X$ be drawn from some probability distribution, $X \sim P(x)$. Then, the sum of $N$ i.i.d. samples, or realizations is drawn from:

\begin{eqnarray}
P(s_N=\sum_{n=1}^N x_n) &=& (P*P*P\cdots P)(s_N)\\
&=& \int dk e^{iks_N}\phi(k)^N = \int dk e^{iks_n+N\log \phi(k)}\\
&=& \int dk e^{iks_N+N\psi(k)}
\end{eqnarray}

This means that the equivalent cumulants for the sample mean -- which is simply $\frac{s_N}{N}$ -- are:

\begin{eqnarray}
\bar{x} &=& \sum_n^N \frac{x_n}{N} \\
\langle \bar{x}^k \rangle_c &=& \frac{1}{N^k}\langle \left(\sum_n x_n \right)^k \rangle_c
\end{eqnarray}
Expanding out the term inside the brackets using the multinomial theorem, we get:

\begin{eqnarray}
\langle \bar{x}^k \rangle_c &=& \frac{1}{N^k}\langle \sum_{\vec{k}} \frac{k!}{k_1!k_2! \cdot k_N!} \prod_{n=1}^N x_n^{k_n}\rangle_c \\
&=& \frac{1}{N^k} \sum_{\vec{k}} \left(\begin{array}{c}
k \\
k_1 \cdots k_N
\end{array}\right) \langle  \prod_{n=1}^N x_n^{k_n} \rangle_c
\end{eqnarray}

any dependence between the samples $x_n$ yields this expression nontrivial, but if we assume i.i.d. we just get:

\begin{eqnarray}
\langle \bar{x}^k \rangle_c &=& \frac{1}{N^{k-1}} \langle x^k \rangle_c
\end{eqnarray}

Where the term in the RHS brackets is just a single realization of our random variable $x$. Why is this important? Because it tells us that the variance, which is the $k=2$ cumulant, scales as $\frac{1}{N}$, and the skewness, which is defined as:

\begin{eqnarray}
\mathrm{skew}&=&\frac{\langle \bar{x}^3 \rangle_c}{\left(\langle \bar{x}^2 \rangle_c \right)^{3/2}} \sim \frac{1}{\sqrt{N}}
\end{eqnarray}

scales as one over square root $N$. This means that our distribution on the mean collapses to a Gaussian AT LEAST as fast as $\frac{1}{\sqrt{N}}$, given some initial asymmetry in the distribution of $x$. (If we have symmetry, things are even better and we need only worry about the kurtosis, which goes like $\frac{1}{N}$. )

Such considerations are important when you ask yourself: at what point can I consider the estimator of the mean to be drawn from a Gaussian? How does my approximation scale with sample size? These are very important questions in the real world, and it nice to have a sense of what's holding you back from being exact -- namely, that $\frac{1}{\sqrt{N}}$ for skewness, which goes along with a hermite polynomial, and $\frac{1}{N}$, which goes with another.)

Monday, November 2, 2015

Differential Regularizers as Priors over Functions $f(x)$: Gaussian Process Kernels as Green's Functions

Often, when we are solving a regression problem, we are given the following functional:

\begin{eqnarray}
J[f] &=& l\left[f,y\right]+\Omega \left[f \right]
\end{eqnarray}

Where $l$ is some loss functional and $\Omega$ is some regularizer. Given some finite dataset, this might look like, in the case of squared loss:

\begin{eqnarray}
X &=& \lbrace \mathbf{x}_n \rbrace_{n=1}^N \\
J &=& \sum_{n=1}^N \left( f(\mathbf{x}_n)-y_n \right)^2 + \Omega(f)
\end{eqnarray}

For a linear basis expansion model, we have the following:

\begin{eqnarray}
f(\mathbf{x}) &=& \mathbf{w}\cdot \phi(x) \\
J(\mathbf{w}) &=& \sum_{n=1}^N \left(  \mathbf{w}\cdot \phi(x_n) -y_n \right)^2 + \lambda \vert \mathbf{w} \vert^2
\end{eqnarray}

where $\lambda \vert \mathbf{w} \vert^2$ plays the role of a prior over functions. The cost function in this example is proportional to the negative log prior, what we have is essentially:

\begin{eqnarray}
-\log P(\mathbf{w} \vert X, \mathbf{y}, \phi) & \sim & J(\mathbf{w}) \\
-\log \mathcal{L}(X, \mathbf{y} \vert \mathbf{w} \phi) & \sim & \sum_{n=1}^N \left(  \mathbf{w}\cdot \phi(x_n) -y_n \right)^2 \\
-\log P(\mathbf{w} \vert \phi) & \sim &  \Omega(f) = \lambda \vert \mathbf{w} \vert^2
\end{eqnarray}

(So that's a Gaussian likelihood and a Gaussian prior). Minimizing the cost with respect to $\mathbf{w}$ is same thing as finding the mode of the posterior, or Maximum a Posteriori. (MAP). We've already talked about how, for such a regularized regression problem, we can right the solution as a linear combination of kernels, centered at the data:

\begin{eqnarray}
f(\mathbf{x}) &=& \sum_{n=1}^N \alpha_n K(\mathbf{x},\mathbf{x}_n )
\end{eqnarray}

a manifestation of the representer theorem. But one important question to ask, given some regularization functional, is, what's the ``best'' Kernel? Let's take for example the regularizer:

\begin{eqnarray}
J &=& \int dx \left(f(x)-y \right)^2 + \lambda \int dx \vert\frac{\partial^2 f(x)}{\partial x^2}\vert^2
\end{eqnarray}

This $\Omega$ functional penalizes curvature in our fitting function $f(x)$, and we can note that what such a regularizer really is, is a prior over functions, since:

\begin{eqnarray}
P(f  \vert f(X), \mathbf{y} ) &=&\frac{P(X,y \vert f) P(f)}{P(X, \mathbf{y})} \\
&=& \frac{\mathrm{exp}\left[-\int dx \delta^D(y-f(X)) \left(f(x)-y \right)^2 - \lambda \int dx \vert\frac{\partial^2 f(x)}{\partial x^2}\vert^2 \right] }{P(X, \mathbf{y})} \\
\end{eqnarray}

We see that the prior on our function is:

\begin{eqnarray}
P[f] &\sim & \mathrm{exp}\left(-\lambda \int \vert f^{\prime \prime}(x)\vert^2 dx \right)
\end{eqnarray}

where $\lambda$ controls the "strength" of our penalty for curvature. To be more general, we could have written the prior on our function as a superposition of differential operators:

\begin{eqnarray}
P[f] &\sim & \mathrm{exp}\left(-\int \sum_{m=1}^\infty a_m \frac{\partial^m }{\partial x^m}\vert f(x)\vert^2 dx \right)
\end{eqnarray}

If we integrate by parts, we note that this prior functional can be put into the form:

\begin{eqnarray}
P[f] &\sim & \mathrm{exp}\left(-\int dx dx^\prime f(x) K(x,x^\prime)^{-1} f(x^\prime) \right)
\end{eqnarray}

Which of course gives us the familiar prior assumptions that $f(x)$ has mean and covariance:

\begin{eqnarray}
\langle f(x) \rangle &=& 0 \\
\mathrm{Var}\left(f(x) \right) &=& K(x,x^\prime)
\end{eqnarray}

But, for a given differential regularizer, how do we find the associated Kernel? The answer is simple, it's just the Green's function of the operator:

\begin{eqnarray}
\hat{L} &=& \sum_m a_m \frac{\partial^m}{\partial x^m}\\
\hat{L}K &=& \sum_m a_m \frac{\partial^m}{\partial x^m} K(x,x^\prime) = \delta(x-x^\prime)
\end{eqnarray}

An easy way to get the green's function -- or in this case Kernel -- is to fourier transform. We can re-write our prior in $s$ space:

\begin{eqnarray}
f(s) &=& \int dx e^{-isx}f(x) \\
P[f] &\sim & \mathrm{exp}\left(-\int ds\sum_{m=1}^\infty a_m \vert\mathbf{s}\cdot \mathbf{s}\vert^m\vert f(s)\vert^2 dx \right)
\end{eqnarray}

We see now the fourier transform of our inverse kernel is:

\begin{eqnarray}
\frac{1}{K(s,s^\prime)} &=& \sum_m a_m (-1)^m \vert \mathbf{s}\cdot \mathbf{s} \vert^m \delta^D(\mathbf{s}+\mathbf{s}^\prime)
\end{eqnarray}

We see that the kernel is diagonal and in $s$ space and semi-positive definite. Which means that we are translationally invariant in $x$ space. We have:

\begin{eqnarray}
K(x,x^\prime) &=& \int ds ds^\prime e^{isx+is^\prime x^\prime} \frac{1}{\sum_m a_m (-1)^m \vert \mathbf{s}\cdot \mathbf{s}^\prime \vert^m}\delta^D(\mathbf{s}+\mathbf{s}^\prime)\\
K(x-x^\prime) &=& \int ds  e^{is(x-x^\prime)} \frac{1}{\sum_m a_m \vert \mathbf{s}\cdot \mathbf{s} \vert^m}
\end{eqnarray}

We see that, indeed, our Kernel will be translationally invariant, and when we put a prior over functions, what is essentially a Lagrangian in physics:

\begin{eqnarray}
\Omega[f] \sim L &=& \int dx f(x) \left(\sum_m a_m \frac{\partial^m}{\partial x^m}\right) f(x)
\end{eqnarray}

We find that the kernel is just the correlation function -- or, the propagator -- for the resulting stochastic process. One example is the massive free particle in higher dimensions -- or the free process in higher dimensional feature space -- for which we get the Yukawa potential:

\begin{eqnarray}
\Omega[f] \sim L &=& \int dx \frac{m^2}{2}f(x)^2+\frac{1}{2}\nabla^2 f(x) \\
K(x-x^\prime) &\sim & \frac{1}{\vert x - x^\prime \vert }e^{-\vert x-x^\prime \vert m}
\end{eqnarray}

So, the Kernel we use when interpolating some field, or, specifying the prior of our Gaussian process, can be viewed as the Green's function to our penalizing "regularizer". If we want smooth functions up to order $m=M$, we know precisely how to formulate the associated $K(x-x^\prime)$. Note that all differential regularizers, such as discussed here will lead to stationary kernels and thus stationary processes.

Tuesday, October 27, 2015

Dual Lagrangians and ordinary Least Squares

The ordinary least squares problem can be set up as just a norm minimization problem:

\begin{eqnarray}
\mathrm{minimize} \ \ \frac{\vert \vert Ax-t \vert \vert^2}{2}+l \frac{x^T x}{2}
\end{eqnarray}

Where $x$ is some vector that in lives in $D$-dimensional ``feature'' space, and the matrix $A$ is $N \times D$ -- essentially the design matrix $\Phi(X)$ common in the Machine learning literature. We can write down the ``primal objective'' of this problem by writing down the ``Lagrangian'':

\begin{eqnarray}
L(x)&=& \frac{(Ax-t)^T (Ax-t)}{2}+l \frac{x^T x}{2} \ \ \ \mathrm{or,}\\
L(z,x\nu) &=& \frac{z^T z}{2}+\nu^T \left(Ax-t-z \right)+l \frac{x^Tx}{2}
\end{eqnarray}

Where $\nu$ is a vector-valued lagrange multiplier, that \textbf{imposes} our constraint $z=Ax-t$. This reformulation of the problem is just one of the many simple tricks you can do using lagrange multipliers. The canonical form of such Lagrangian in optimization theory is the following:
\begin{eqnarray}
L(x,\lambda,\nu)&=& f_0(x)+ \lambda_i f_i(x)+\nu_j h_j(x)
\end{eqnarray}

where Einstein summation convention is used. Pardon the confusion of $\lambda$'s : normally this greek letter is used for regularization, but in the problem I am going to use $l$ as the regularization parameter and $\lambda$ as the Lagrange multiplier that imposes some inequality constraint. We require, for $x$ to be a feasible solution to the problem:

\begin{eqnarray}
h_i(x) &=& 0 \forall i \\
f_i(x) &\le & 0 \forall i
\end{eqnarray}

Notice that in this example we no inequality constraints, so $\lambda$ will be irrelevant. If we were to minimize this ordinary least squares problem directly -- analytically solving what's called the primal objective -- we get:

\begin{eqnarray}
\partial_x L(x) &=& A^T(Ax-t)+l x =0 \\
x_\mathrm{min} &=& (A^TA+l)^{-1} A^T t
\end{eqnarray}

We see that the solution involves the inversion of a $D \times D$ matrix -- quite difficult if the length of our feature vector is large. This is what's called the ordinary least squares solution. We have let $l$ be non-zero in order to create something called a ``ridge'', which smooths out our solution.

Now, we can define the dual lagrangian as:

\begin{eqnarray}
g(\lambda, \nu)&=& \mathrm{min}_{x,z}\left(L(x,z,\lambda,\nu) \right) \\
\end{eqnarray}

which we can find by taking the proper derivatives and plugging in:

\begin{eqnarray}
\partial_x L(x,z,\lambda,\nu) &=& \nu^T A + l x^T \\
\Rightarrow x&=& -\frac{A^T \nu}{l}\\
\partial_z L(x,z,\lambda,\nu) &=& z^T-\nu^T \\
\Rightarrow z=\nu
\end{eqnarray}

So we get

\begin{eqnarray}
g(\lambda, \nu) &=& \frac{\nu^T \nu}{2}+\nu^T \left(-AA^T \nu/l - t - \nu \right) + \frac{l \nu A A^T \nu}{2}\\
&=& -\nu^T \left(\frac{AA^T}{l}+1\right)\nu-\nu^Tt
\end{eqnarray}

Now, we define $\mathrm{max}(g)=d^*$ as the maximum of the dual, which, is in general less than the minimum of the primal, $\mathrm{min}(L)=p^*$. When $d^*=p^*$, we have what's called strong duality. The conditions for this to be the case are called the KKT conditions, and boil down to essentially all derivatives being equal to zero at the solution:

\begin{eqnarray}
\partial_{x,z,\lambda,\nu}L=0
\end{eqnarray}

and all lagrange multiplier terms being equal to zero evaluated at the solution. If we maximize our dual, we find:

\begin{eqnarray}
\partial_\nu g &=& - \left(\frac{AA^T}{l}+1 \right)\nu-t=0 \\
\Rightarrow \nu &=& -\left(AA^T/l+1 \right)^{-1} t
\end{eqnarray}

and, using our expression before, we find the dual $x$ is:

\begin{eqnarray}
x &=& -\frac{A^T \nu}{l} = A^T \left(AA^T+l \right)^{-1}t
\end{eqnarray}

These two solutions are equivalent if $d^*=p^*$, and in this case, they are! Note that the second solution involves the inversion of an $N \times N$ matrix, and may be a much better route if the number of features outweigh the number of data points, $N << D$. This dual solution is precisely the kernel ridge regression solution, because $AA^T$ is our kernel ``Gram Matrix''. The model becomes:

\begin{eqnarray}
\mathrm{model} &=& Ax = AA^T \left(AA^T+l \right)^{-1}t\\
AA^T &=& K \\
\mathrm{model} &=& Ax = K \left(K+l \right)^{-1}t\\
\end{eqnarray}

Friday, October 23, 2015

Gaussian Process and Defining a Stochastic Process (Connection with Rasmussen)

About a week ago I posted about how to define a stochastic process, and the conclusion was, if we wanted to write some function:

\begin{eqnarray}
f(x) \sim P[f(x)] &=& \frac{1}{Z}e^{\int dx dx^\prime f(x) K^{-1}(x,x^\prime)f(x^\prime)}
\end{eqnarray}

with mean zero and covariance $\mathrm{Var}[f(x)]=K(x,x^\prime)$, we can decompose the kernel into basis functions through Mercer's theorem and write

\begin{eqnarray}
K(x,x^\prime) &=& \sum_n \lambda_n \phi_n(x) \phi_{n}(x^\prime) \\
f(x) &=& \sum_m \phi_m(x) \sqrt{2\lambda_m} \mathrm{erf}^{-1}(X_m)
\end{eqnarray}

This is precisely the basis function regression we see so often in Rasmussen:

\begin{eqnarray}
f(x) &=& \vec{\mathbf{w}}\cdot \vec{\phi}(x) \\
\vec{\mathbf{w}}_m &=& \sqrt{2\lambda_m} \mathrm{erf}^{-1}(X_m)\\
\mathrm{Var}(ww^T) &=& 2 \sqrt{\lambda_m \lambda_n} \langle  \mathrm{erf}^{-1}(X_m)  \mathrm{erf}^{-1}(X_n)\rangle
\end{eqnarray}

Since the erf transformed variables are unit normal, we get a kronecker delta in $n,m$ and

\begin{eqnarray}
\mathrm{Var}(w_m w_n)  &=& 2\lambda_m \delta^K_{mn}
\end{eqnarray}

I've missed a factor of two somehow but no matter. The basic point comes across. We can define an arbitrary Gaussian process with arbitrary Kernel by expanding in basis functions $\phi(x)$, much like we did when moving from the univariate Gaussian to the multivariate Gaussian, with non-trivial covariance between the components of our random vector.

In the case of the squared exponential kernel, the set of basis functions is formally infinite, but ultimately this doesn't matter for the conditional distribution,

\begin{eqnarray}
f(x^*) \vert X &=& \sum_{i} \alpha_i K(x^*, X_i)
\end{eqnarray}

which is a piece of wisdom from the representer theorem. For any regularized regression problem with quadratic cost -- which, is the same thing as putting a prior on the weights $\vec{w}$ in our model and maximizing the Gaussian posterior -- we can represent our smooth solution completely in terms of the input data $X$.

This may have a simple connection, in the process view, to the "family of functions", and that when we observe our stochastic process at discrete intervals $X$ with values $\vec{f}$, we are really hitting our original pdf with a series of dirac delta functions.


Wednesday, October 14, 2015

Chernoff Bound, Union Bound, Political Polling and Base rate estimation


Let's say we are trying to estimate the base rates of some classes $C=\lbrace 1,2,3 \dots K \rbrace$. After polling $N$ people, each of which will elect one of theses classes, how confident are we in our results, $P(c)$? The first step to answering this question is to simplify the problem and treat it as a sequence of binomials.

Let us at first restrict ourself to two classes. $c=1$ means ``no'' or ``Democrat'' and $c=0$ means ``yes'', or ``Republican''. If weuse a bernouli random variable $X=0,1$ to represent a single ``vote'' we get

\begin{eqnarray}
P(X \vert p ) &=& p^{I_{x=1}}(1-p)^{I_{x \ne 1}}\\
q &=& 1-p
\end{eqnarray}

Where $I$ is the indicator function. The expectation of this distribution is:

\begin{eqnarray}
\langle X \rangle &=& p \\
\langle X^2 \rangle_c = \mathrm{Var}(X)&=& pq
\end{eqnarray}

Given these first two cumulants, we can write the Markov bound:

\begin{eqnarray}
P(X > a) & \le & \frac{\langle X \rangle}{a}\\
 \le & \frac{p}{a}
\end{eqnarray}

and the improved Chebyshev bound:

\begin{eqnarray}
P( \vert X-p \vert \ge mpq) & \le & \frac{pq}{mpq}=\frac{1}{m}
\end{eqnarray}

Similarly, for a binomial random variable, a sequence of votes:

\begin{eqnarray}
X_N &=& \sum_{i=1}^N X_i \\
X_N & \sim & \left(\begin{array}{c}
N \\
X_N
\end{array}\right) p^{X_N}(1-p)^{X_N}
\end{eqnarray}

we get expectation and variance

\begin{eqnarray}
\langle X_N \rangle &=& Np \\
\langle X_N^2 \rangle_c = \mathrm{Var}(X)&=& Npq
\end{eqnarray}

So our bound becomes

\begin{eqnarray}
P(\vert X_N-Np \vert \ge mNpq) \le \frac{1}{m}
\end{eqnarray}

But that seems repetitive. Let's define a new statistic, the estimator of the mean:

\begin{eqnarray}
\hat{\mu}_N &=& \frac{X_N}{N} = \frac{\sum_{i=1}^N}{N} \\
\langle \hat{\mu}_N \rangle &=& p \\
\langle \hat{\mu}_N^2 \rangle &=& \frac{pq}{N}
\end{eqnarray}

Now we have an inequality that scales with the number of votes, or the survey size, $N$:

\begin{eqnarray}
P(\vert \frac{X_N}{N}-p \vert \ge a ) \le \frac{pq}{Na^2}
\end{eqnarray}

More generally the estimator of any mean, can be written

\begin{eqnarray}
P(\vert \frac{X_N}{N}-\mu \vert \ge \epsilon) &=& \frac{\sigma^2}{N \epsilon^2}
\end{eqnarray}

So, we can now make a few helpful statements. Our confidence that the estimator of the mean is within plus or minus $\epsilon$ of the true value is precisely $1-P(\vert \frac{X_N}{N}-\mu \vert \ge \epsilon)$, for a ``survey'' of $N$ participants. So let's see how accuracy $\epsilon$ scales with survey size and confidence. Writing confidence as:

\begin{eqnarray}
\mathrm{confidence}=1-\delta &=& 1-P(\vert \frac{X_N}{N}-\mu \vert \ge \epsilon)\\
\delta &\le & \frac{\sigma^2}{N\epsilon^2}\\
\epsilon &\le & \frac{\sigma}{\sqrt{N \delta}}
\end{eqnarray}

So we that accuracy scales like one over square root survey size, and that if we want to be twice as confident with a given accuracy, we'd better poll twice as many people. Sometimes in the real world, this implies a cost -- at least in terms of time! -- and so we'd like to do even better, if at all possible.

It turns out we can do much, much better by using something called the Chernoff Bound, which assumes not only pair-wise independence between our random votes $X_i$ -- as we assumed by writing $\mathrm{Var}(\frac{X_N}{N})=\frac{\sigma^2}{N}$ -- but independence across all the variables, for all possible combinations.

How do we do this? Re-write the markov bound with some free parameter:

\begin{eqnarray}
P(x \ge a ) &=& P(e^{sx} \ge e^{as}) \le \frac{\langle e^{sx}\rangle }{e^{sa}}
\end{eqnarray}

We see that the numerator is simply the moment generating function, evaluated at $s$. If we want to swap the inequality, we just send $s \to -s$:
\begin{eqnarray}
P(x \le a ) &=& P(e^{-sx} \ge e^{-as}) \le \frac{\langle e^{-sx} \rangle}{e^{-sa}}
\end{eqnarray}

For our bernouli distribution on a single ``vote'', we have:

\begin{eqnarray}
M(s)=\langle e^{sX_1} \rangle &=& \sum_{X} e^{sX} P(X) \\
&=& e^sp+1-p \\
M(s)&=& p(e^s-1)+1
\end{eqnarray}

Now, if we write down the moment generating function of the sum, we find, if the $X_i$'s are all independent:
\begin{eqnarray}
\langle e^{s X_N} \rangle &=& e^{s \sum_{i} X_i} \\
&=& \langle \prod_{i=1}^N e^{s X_i} \rangle \\
M_N(s) &=& M(s)^N \\
&=& \left(1+p(e^{s_N}-1)\right)^N
\end{eqnarray}

Recall that we want to bound the probability of our random variables $X_N$ being outside some range, so we want to minimize the right handside of this equation:

\begin{eqnarray}
P(X_N \ge a) &=& M_N(s)e^{-sa}
\end{eqnarray}

Minimizing this equation is the same is minimizing the log, since both sides are positive definite:

\begin{eqnarray}
\log P(X_N \ge a) &=& N \log \left(1+p(e^{s_N}-1)\right)-sa
\end{eqnarray}

Taking the derivative with respect to $s$ and setting things equal to zero, we get:

\begin{eqnarray}
s&=&\log \left(\frac{a}{Np} \right)
\end{eqnarray}

Which, for arbitary mean, is the same thing as saying

\begin{eqnarray}
s&=&\log \left(\frac{a}{\mu} \right)
\end{eqnarray}

So now let's choose $a=\mu(1+\theta)$, so that $\theta$ is our fractional accuracy relative to the variable and $s=\log(1+\delta)$. Putting this into our equation gives:

\begin{eqnarray}
\log P(X \ge \mu(1+\theta)  &\le & \mu\left(\theta- (1+\theta)\log (1+\theta) \right) \\
P(X \ge \mu(1+\theta) &\le & \left(\frac{e^{\theta}}{(1+\theta)^{1+\theta}} \right)^\mu
\end{eqnarray}

But, we could have taylor expanded the logarithm above, to get:

\begin{eqnarray*}
\log P(X \ge \mu(1+\theta)  &\le & \mu\left(\theta- (1+\theta)\log (1+\theta) \right)
\le \mu( - \theta^2/3 )\\
P(X \ge \mu(1+\theta) &\le &e^{-\theta^2 \mu/3}
\end{eqnarray*}

So this is the upper bound. What about the lower? We just send $s \to -s$ to get

\begin{eqnarray}
P(X \le \mu(1-\theta) & \le &  e^{-\theta^2 \mu/2}
\end{eqnarray}

So the lower bound, with the same taylor expansion, does slightly better! We can now make a double sided bound:

\begin{eqnarray}
P( \vert X-\mu \vert \ge \mu \theta ) &\le & 2e^{-\theta^2 \mu/3}
\end{eqnarray}

If we specify, once again $1-\delta =1-P( \vert X-\mu \vert \ge \mu \theta ) $ as our survey confidence, we can write
 \begin{eqnarray*}
 P( \vert X_N-pN \vert \ge Np \theta ) &\le & 2e^{-\theta^2 Np/3}
 \end{eqnarray*}

 Let $\theta=\frac{\epsilon}{p}$

 \begin{eqnarray}
  P( \vert X_N-pN \vert \ge N\epsilon )= P( \vert \frac{X_N}{N}-p \vert \ge \epsilon )&\le& 2e^{-\frac{N\epsilon^2}{3p}}
  \end{eqnarray}

  Now for a few tricks,

  \begin{eqnarray}
  \delta &=& P( \vert \frac{X_N}{N}-p \vert \ge \epsilon ) \\
\delta &\le  &  2e^{-\frac{N\epsilon^2}{3p}}\le 2e^{-\frac{N\epsilon^2}{3}}\\
e^{N\epsilon^2/3}&\ge & \frac{2}{\delta} \\
N &\ge &\frac{3}{\epsilon^2} \log (\frac{2}{\delta})
 \end{eqnarray}

 This result is known as the sampling theorem. In order to constrain our underlying ``base rate'' to $p = X_N/N \pm \epsilon$ with confidence $1-\delta$, we need to poll AT LEAST $N$ people. This is an extremely useful result, and has much better scaling properties than the Chebyshev bound, which would suggest:

 \begin{eqnarray}
 N & \ge & \frac{1}{\delta}\frac{\sigma^2}{\epsilon^2}
 \end{eqnarray}

 We see that we pay just as heavily for accuracy $\epsilon$ but much more heavily for confidence $\delta$. We would like to make $\delta$ as small as possible, and so if we taylor expand both expressions, and call $R=1-\delta$ confidence, we find:

 \begin{eqnarray}
\mathrm{Chebyshev:}\ \ N &\sim& 3R \frac{1}{\epsilon^2 } \\\
\mathrm{Chernoff:}\ \ N &\sim & 3\log(2R) \frac{1}{\epsilon^2 }
 \end{eqnarray}

The latter  CH is superior -- logarithmic scaling in confidence rather than linear!

Such considerations are important, because often we are interested in constraining the maximum error of many different random variables -- say, $K$ class base rates, $P(c)=p_c$ -- with the same accuracy $\epsilon$. This can be accomplished by the union bound:

\begin{eqnarray}
P( X_1\cup X_2 \cup X_3 \cdots X_N) & \le & \sum_{i=1}^N P(X_i)
\end{eqnarray}

So that, if we want to write the probability that any of our mean base rate estimates, for some ``poll'' with $K$ possibilities is incorrect by a factor greater than $\epsilon$, we can write:

\begin{eqnarray*}
P( \vert \hat{\mu}_{N_1}-p_1 \vert \ge \epsilon \cup \vert \hat{\mu}_{N_2}-p_2 \vert \ge \epsilon \cup \dots \vert \hat{\mu}_{N_K}-p_K \vert \ge \epsilon ) & \le & \sum_{c=1}^K P(\vert \hat{\mu}_{N_c}-p_c \vert \ge \epsilon)\\
\delta &\le &  \sum_{c=1}^K \delta_c
\end{eqnarray*}

Assuming the same confidence for every independent class bound, we get:

\begin{eqnarray}
\delta &\le &  \sum_{c=1}^K \delta_c \le 2Ke^{-\theta^2 \mu/3}
\end{eqnarray}

which implies:

\begin{eqnarray}
N & \ge & \frac{3}{\epsilon^2}\log(\frac{2K}{\delta})
\end{eqnarray}

So to bound the maximum of $K$ different statistics, or base rates with the same accuracy $\epsilon$, with some confidence $\delta$, we essentially send $\delta \to K \delta$ from a single binomial test; and, since the Chernoff scales logarithmically, this is not such a big deal!

Tuesday, October 13, 2015

Streaky Shooter, Stephen Curry

So, recently I was given the following problem: let's say some basketball player has an a priori probability  $P(y_1=1)=0.35$ of making  a three point shot. This implies the probability of missing the three-pointer is $P(y=0)=0.65$. (Note that $y=1$ denotes ``make'' and $y=0$ ``miss''.) Pretty good odds, for an NBA player -- I think.

But let's say that our basketball player is streaky, in the sense that $P(y_2=1 \vert y_1=1)=0.6$, or the probability of making the next shot, given that he made the previous, is sixty percent. (This implies $P(y_2=0 \vert y_1=1)=0.4$.) And conversely, the probability of making the next shot, given that he missed the previous is a bit lower than the prior, $P(y_2=1 \vert y_0=0)=0.3$, which implies $P(y_2=0 \vert y_0=0)=0.7$. This defines something called a markov process, in the sense that the $n^\mathrm{th}$ shot only depends on the one before it.

\begin{eqnarray}
P(y_n \vert y_{n-1},y_{n-2},\dots y_2,y_1) &=& P(y_n \vert y_{n-1})
\end{eqnarray}

To repeat an old aphorism: ``the future to depends on the past only through the present''. We can represent this conditional probability density with a matrix,

\begin{eqnarray}
P(y_n \vert y_{n-1}) &=& \mathbf{W}= \left(\begin{array}{cc}
.6 & .3 \\
.4 & .7 \\
\end{array}\right)
\end{eqnarray}

and the a priori probability density with a vector:

\begin{eqnarray}
P(y_1) &=& \left(\begin{array}{c}
0.35 \\
0.65
\end{array}\right)
\end{eqnarray}

If we want to write down the joint probability density on a few shots, we can use Bayes' chain rule:

\begin{eqnarray}
P(y_3, y_2,y_1) &=& P(y_3 \vert y_2, y_1) P(y_2 \vert y_1) P(y_1)
\end{eqnarray}

which by markov property simplifies to

\begin{eqnarray}
P(y_3, y_2,y_1) &=& P(y_3 \vert y_2) P(y_2 \vert y_1) P(y_1)
\end{eqnarray}

Marginalizing over $y_2$ we see that we have a matrix multiplication:

\begin{eqnarray}
\sum_{y_2} P(y_3, y_2,y_1) &=& \sum_{y_2}  P(y_3 \vert y_2) P(y_2 \vert y_1) P(y_1)\\
P(y_3,y_1) &=& \sum_{y_2}  P(y_3 \vert y_2) P(y_2 \vert y_1) P(y_1)
\end{eqnarray}

and, dividing through by $P(y_1)$ is the same as conditioning on $y_1$, so we have a new conditional probability, which connects the third shot to the first:

\begin{eqnarray}
P(y_3 \vert y_1) &=& \sum_{y_2}  P(y_3 \vert y_2) P(y_2 \vert y_1)
\end{eqnarray}

This is often called the Chapman Kolgomorov equation, and is the basis for much analysis in physics of the rate of change of probability distributions through the master equation and Fokker-Planck equation.

It's pretty easy to see that any conditional distribution can now be written as $n-1$ matrix contractions:

\begin{eqnarray}
P(y_n \vert y_1) &=& (\mathbf{W})^{n-1}
\end{eqnarray}

If we multiply by the vector $P(y_1)$, we get the distribution on $y_n$ alone, since

\begin{eqnarray}
P(y_2) &=& \sum_{y_1} P(y_2 \vert y_1)P(y_1)\\
&=& \mathbf{W}\cdot \vec{\mathbf{P}}_1 \\
P(y_n) &=& (\mathbf{W})^{n-1} \cdot \vec{\mathbf{P}}_1
\end{eqnarray}

But Markov matrices, since their columns add to unity and have strictly non-negative entries have some very useful properties (the Perron-Frobenius Theorem). Their spectra -- or eigenvalues -- all lie within the unit circle in the complex plane, and there is guaranteed to be some eigenvector $\mathbf{\pi}$ with an eigenvalue $\lambda=1$. This eigenvector is called the stationary distribution, and after $n \to \infty$ time steps, we expect our distribution to converge to $\mathbf{\pi}$. For the matrix given above, this vector is

\begin{eqnarray}
\mathbf{\pi} &=& \left(\begin{array}{c}
0.428 \\
0.572
\end{array}\right)
\end{eqnarray}

So after sufficient shots, our shooter will converge to a 42.8 percent success rate, as compared to 35 -- or whatever you specified. This discrepancy is important, as the expected number of buckets given $m$ shots does not go like a binomial distribution

\begin{eqnarray}
\langle \sum_i y_i \rangle &=& m p_0 \\
p_0=0.35
\end{eqnarray}

but something slightly different. More like a binomial process with changing parameter $p$. As $m$ gets large we expect

\begin{eqnarray}
\langle \sum_i y_i \rangle &\approx & m \pi_0 \\
\pi_0 &=& 0.428 = \frac{3}{7}
\end{eqnarray}

Thursday, October 8, 2015

Brief Note on Information Gain Definition

After being confused for quite some time, I've realized that information gain and mutual information gain are essentially the same thing, yet the first one is not symmetric. Mutual information as an intuitive measure of correlation can be displayed by noting that for a bivariate normal distribution, the mutual information between the two variables is:

\begin{eqnarray}
I(x,y) &=& IG(x \vert y) \\
&=& H(x)-H(x \vert y) \\
&\sim &  -\frac{1}{2}\log(1-\rho^2)
\end{eqnarray}



Where $\rho$ is the Pearson correlation, varying between zero and one. So, when choosing to split on variables in a decision tree, or designing an experiment that probes some underlying model space, we obviously want to choose pairs $x,y$ that have $\rho \to 1$, in order to yield as much information as possible. 

Note On "Information Gain" and the Fisher Information

From last post, http://rspeare.blogspot.com/2015/08/the-fisher-matrix-and-volume-collapse_31.html, I was talking about "volume" collapse in parameter space due to some data, $\vec{x}$. I'd like to relate this to information gain, which can be defined pretty simply as:

\begin{eqnarray}
H[p(\vec{\theta})] - H[p(\vec{\theta} \vert \vec{x})] &=& IG(\vec{\theta} \vert \vec{x})
\end{eqnarray}

Now, using Bayes' rule we can change what we've written in the second term above:

\begin{eqnarray}
H[p(\vec{\theta})] - H[\mathcal{L}(\vec{x} \vert \vec{\theta}) p(\vec{\theta})] &=& IG(\vec{\theta} \vert \vec{x})
\end{eqnarray}

And using the addition property of entropy, we can write:

\begin{eqnarray}
IG(\vec{\theta} \vert \vec{x}) &=& - H[\mathcal{L}(\vec{x} \vert \vec{\theta})]
\end{eqnarray}

But, with the fisher information matrix,

\begin{eqnarray}
\mathbf{F}_{ij} &=&\langle \frac{-\partial^2 \log \mathcal{L}(\mathbf{x} \vert \mathbf{\theta})}{\partial \theta_i \partial \theta_j} \rangle
\end{eqnarray}

 we can estimate the covariance of the likelihood function, and therefore it's entropy -- if we use the laplace approximation and denote the likelihood a gaussian in parameter space:

\begin{eqnarray}
H[\mathcal{L}] &=& \frac{d}{2}\log(2\pi e) + \log \left( \vert \mathbf{F}_{ij} \vert^{-1} \right)
\end{eqnarray}

This means that our information gain on $\vec{\theta}$ given an experiment $\vec{x}$ is proportional to the logarithm of the determinant of the Fisher matrix.

\begin{eqnarray}
IG(\vec{\theta} \vert \vec{x} ) &\sim &\log \left( \vert \mathbf{F}_{ij}\vert \right)
\end{eqnarray}

And so, we now see intuitively why this is \textbf{called} the fisher information. Our ``volume'' collapse on the variables of interest $\vec{\theta}$ given our experiment, is:

\begin{eqnarray}
e^{IG(\mathrm{\theta} \vert \mathrm{x})} & \sim & \mathrm{det}\vert \mathbf{F}_{ij} \vert
\end{eqnarray}




Wednesday, October 7, 2015

Defining Stochastic Processes

 VanKampen says that given a single stochastic variable $X$, we can define an infinite number of other stochastic variables by transformation. For example, let $X$ be uniformly distributed over the interval $[-1,1]$. We can then define a normal variable by the following transformation:

 \begin{eqnarray}
z = f(X) &=& \sqrt{2}\sigma \mathrm{erf}^{-1}\left( X \right) + \mu \\
z &\sim &  \mathcal{N}(\mu, \sigma)
 \end{eqnarray}

 Where $z$ is now drawn from a normal distribution with mean $\mu$ and variance $\sigma^2$. We can also do this for a multivariate Gaussian distribution. Say we want to construct:

 \begin{eqnarray}
\mathbf{x} \sim \mathcal{N}(\vec{\mu}, \Sigma) &=& \frac{1}{(2\pi)^{D/2} \mathrm{det}\vert \sigma \vert}e^{(\mathbf{x}-\mathbf{\mu} )\cdot \Sigma^{-1}\cdot (\mathbf{x}-\mathbf{\mu}) /2 }
 \end{eqnarray}

 We can create this random vector by taking the following steps. First, decompose the covariance matrix $\Sigma$ as:

 \begin{eqnarray}
 \Sigma^{-1} &=& U \Lambda^{-1} U^T
 \end{eqnarray}

 Where $\Lambda$ is diagonal and $U$ is orthonormal. One sees that this is simply the eigendecomposition of our covariance matrix, which surely exists and has $\Lambda_{ii} \ge 0 \ \ \forall i$ since $\Sigma$ is semi-positive definite. Then, construct the following vector:

 \begin{eqnarray}
 \vec{\mathbf{y}}_i &=& \vec{f}_i(X) =\sqrt{2} \Lambda_{ii}  \mathrm{erf}^{-1}\left(X_i \right)
 \end{eqnarray}

I give the uniformly distributed random variable $X$ a subscript $i$ to denote independent draws. Then take, linear combinations of the vector $\vec{y}$ and add the mean:

 \begin{eqnarray}
 \vec{x} &=& U \cdot \vec{y}+\vec{\mu}
 \end{eqnarray}

 So we have simply:

 \begin{eqnarray}
\vec{z}_i &=&  U_{ij} \cdot \sqrt{2} \Lambda_{jj} \mathrm{erf}^{-1}\left(X_j \right)+\vec{\mu}_i \\
&=& U_{ij} \cdot f(X,j) + \vec{\mu}_i \\
&=& g(X,i)
 \end{eqnarray}

 So we see that we created our vector out of a linear combination of functions -- given by $U$ -- which is just another indexed function, $g(X,i)$.

 We can promote this discrete vector index to a continuous one, that of time $t$, and thus define a stochastic process, or a random function:

 \begin{eqnarray}
 Y_X(t) &=& f(X,t)
 \end{eqnarray}

 For every value of $t$ we have a random variable, and for every value $X=x$ we have a \textbf{realization} of the process, a function of time. Let's say we want a Gaussian random function with zero mean and covariance kernel $K$. The probability distribution for such a function is:

 \begin{eqnarray}
 P\left[\eta(t)\right] &=& \frac{1}{Z} e^{-\int dt_1 \int dt_2 \eta(t_1) K^{-1}(t_1,t_2) \eta(t_2)/2}
 \end{eqnarray}

Don't worry about the function normalization constant $Z$ for now. So far we have made no assumption on our covariance across time, encapsulated by the kernel $K(t,t^\prime)$. Let's assume ``stationarity'', in the sense that correlation only depends upon time difference $K(t,t^\prime)=K(t-t^\prime)$. Let's also make the kernel diagonal in that $\eta(t)$ is only correlated with itself at time $t$:

\begin{eqnarray}
K &=& \delta^D(t-t^\prime) \\
 P\left[\eta(t)\right] &=& \frac{1}{Z} e^{-\int dt \eta(t)^2/2}
\end{eqnarray}

The covariance of this random function $\eta(t)$ is now:

\begin{eqnarray}
\langle \eta(t) \eta(t^\prime) \rangle &=& \delta^D(t-t^\prime)
\end{eqnarray}

This is essentially the functional ``unit'' normal. Can we create such a random function, $\eta(t)$ a stochastic process, using only $X$? The answer is yes, and we will need Mercer's theorem to do it. Let's say we want to create:

\begin{eqnarray}
\eta(t) \sim \mathcal{N}\left( \mu(t), K(t,t^\prime) \right)
\end{eqnarray}

Then we need to decompose our kernel into eigenfunctions using Mercer's theorem:

\begin{eqnarray}
K(t,t^\prime) &=& \sum_{n=1}^\infty \lambda_n \phi_n(t)\phi_n(t^\prime)
\end{eqnarray}

where

\begin{eqnarray}
\lambda_n \phi_n(t) &=& \int dt^\prime K(t,t^\prime)\phi_n(t^\prime) \\
\int \phi_n(t) \phi_m(t) dt &=& \delta^K_{mn}
\end{eqnarray}

The $\phi_m(t)$ are a possibly infinite set of orthornormal eigenfunctions of the kernel $K$ and $\lambda_m$ their eigenvalues. We now do a very similar trick to before, yet we promote vectors to functions. Let

\begin{eqnarray}
y_m=f(X,m) &=& \sqrt{2 \lambda_m}\mathrm{erf}^{-1}\left( X \right) \\
\eta(t) &=& \sum_{m=1}^\infty \phi_m(t) y_m + \mu(t) \\
\eta(t) &=& \sum_{m=1}^\infty \phi_m(t) \sqrt{2 \lambda_m} \mathrm{erf}^{-1}\left( X \right)  + \mu(t)
\end{eqnarray}

This decomposition of a kernel into eigenfunctions actually has an intimate connection with quantum field theory. Since the normalization $Z$ for $P[\eta(t)]$ is in fact the determinant of the kernel $K$, which is the product of the eigenvalues:

\begin{eqnarray}
Z &=& \sqrt{\prod_{m=1}^\infty 2\pi \lambda_m }
\end{eqnarray}

Such is the trick Sidney Coleman used to do quite a few path integral calculations -- or at least as I understand from a friend! 

Tuesday, September 8, 2015

Collaborative Filtering as AutoEncoding

After doing some reading it seems Collaborative Filtering, or the process of minimizing the distance $Y_{ij}-\mathbf{\Theta}_{jl}\mathbf{X}_{il}$ by alternately fitting for $\mathbf{\Theta}_{jl}$ and $\mathbf{X}_{il}$ is the same as sparse autoencoding, or using a restricted Boltzman machine. Will have more to say about this, in particular regarding Lasso -- L1 regularization -- regression and non-negative matrix factorization

Second Order Collaborative Filtering: Playing with latent feature dimension

So, after playing around with things, I find -- unsurprisingly, since the winner of the Netflix contest used a very high-dimensional representation of feature vectors $\vec{x} \in \mathcal{R}^D$, $D \approx 1000$ -- that increasing the dimension of the feature vectors improves training fit substantially. Even with a fairly high regularization parameter of $\lambda = 1$ from the last post, I get the following results for $D=200$:

As you can see, we get a much tighter regression fit on the given ratings matrix, $Y_{ij}$ at the cost of extra computation. Inverting the Hessian of the Cost function -- which turns out to be only $D \times D$, thank goodness, due to diagonality in other degrees of freedom -- takes a great deal of time for high dimension $D$, so we are left with a trade off between goodness of fit and computation time.

This algorithm has been a second order "batch" gradient descent, taking in all the data at once. It will be interesting to see how things can be made incremental, or "online", so that data is taken in bit by bit, and our matrices $\mathbf{X}_{il}$, $\mathbf{\theta}_{jl}$ are updated bit by bit.

Saturday, September 5, 2015

Recommender Systems -- Second Order Collaborative Filtering

So I've been working with recommender systems recently, which take in some incomplete set of product ratings, represented by a matrix $Y$

\begin{eqnarray}
Y_{ij} &=& \left(\begin{array}{cc}
\leftarrow \mathrm{ratings\ for\ movie\ 1} \rightarrow \\
\leftarrow \mathrm{ratings\ for\ movie\ 2} \rightarrow \\
\vdots \\
\leftarrow \mathrm{ratings\ for\ movie}\ n_m \rightarrow \end{array}\right)
\end{eqnarray}

and come up with a model to ``recommend'' similar products -- in this case we'll call them movies -- to users. What this essentially is is a regression problem. But we don't have the vector inputs $\vec{x}$. What's very cool about collaborative filtering is that we find we can solve the problem by guessing the regression inputs $\vec{x}$ and weights $\vec{\theta}$ for each individual and each movie, minimize the cost function, and still get a working model! Even more astoundingly, the ``feature vectors'' $\vec{x}$ that emerge might contain some important information as to categorizing products.

Let's say we have $n_m$ movies in our database and $n_u$ users: then this $Y$ matrix is $n_m \times n_u$, and all entries for which the $j^\mathrm{th}$ user has not rated the $i^{\mathrm{th}}$ movie, we put a question mark (?) in our matrix. Or, correspondingly we can make a mask matrix, $R_{ij}$ which has entries of 1 wherever we have a rating, and 0 wherever we don't. When modeling  $Y_{ij}$, we normally use a linear model, giving each movie a feature vector $\vec{x}^{(i)}$, of dimension of our choosing, and each individual a regression coefficient vector $\vec{\theta}^{j}$, of dimension of our choosing. (Let's call it $D$) We have

\begin{eqnarray}
\vec{x}^{(i)} & \in & \mathcal{R}^D \ \ \forall i \\
\vec{\theta}^{(j)} & \in & \mathcal{R}^D \ \ \forall j \\
Y_{ij} &=& \vec{x}^{(i)}  \cdot \vec{\theta}^{(j)} + \epsilon_{ij}
\end{eqnarray}


And our error matrix, $\epsilon_{ij}$ is drawn from some probability distribution. This model can be more concisely -- or perhaps more clearly -- written as a matrix multiplication:

\begin{eqnarray}
Y_{ij} &=& \mathbf{X}_{il}  \mathbf{\Theta}_{jl} + \epsilon_{ij}
\end{eqnarray}

where we use Einstein summation convention above. If we assume Gaussian independent errors, our cost function -- or, the log likelihood -- becomes,

\begin{eqnarray}
2 J\left( \mathbf{X}_{il},  \mathbf{\Theta}_{jl} \right) &=& \sum_{i,j \in R_{ij}} \left[ Y_{ij} - \mathbf{X}_{il}  \mathbf{\Theta}_{jl} \right]^2
\end{eqnarray}

And our sum above only calculates the errors for movies which we have actually received ratings. If we want to regularize our model, we can put simple priors on our movie Feature vectors $\vec{x}$ and individual regression vectors $\vec{\theta}$ to get:

\begin{eqnarray}
2J\left( \mathbf{X}_{il},  \frac{1}{2}\mathbf{\Theta}_{jl} \right) &=& \sum_{i,j \in R_{ij}} \left[ Y_{ij} - \mathbf{X}_{il}  \mathbf{\Theta}_{jl} \right]^2+\lambda \left(\mathbf{X}_{il} \mathbf{X}_{il}  + \mathbf{\Theta}_{jl} \mathbf{\Theta}_{jl}  \right)
\end{eqnarray}

The regularization term is just the sum of two traces. Now, what's interesting about collaborative filtering is that a priori, we don't know our movie feature vectors, and we don't know what the regression weights are either -- we only choose their dimension and hope things work out OK! Since this cost function is quadratic, the Newton-Rhapson method should optimize it in a ``single'' step. We can compute the derivatives of this cost function with respect to matrices in the following way:

\begin{eqnarray}
\frac{\partial J}{\partial \mathbf{X}_{i^\prime l^\prime}} &=& -\sum_{j \in R_{i^\prime j}} \left[ Y_{i^\prime j} - \mathbf{X}_{i^\prime l}  \mathbf{\Theta}_{jl} \right]  \mathbf{\Theta}_{jl^\prime}+ \lambda \mathbf{X}_{i^\prime l^\prime} \\
\frac{\partial^2 J}{\partial \mathbf{X}_{i^\prime l^\prime} \partial \mathbf{X}_{i l}} &=& \sum_{j \in R_{i^\prime j}} \mathbf{\Theta}_{jl} \mathbf{\Theta}_{jl^\prime}\delta^K_{i,i^\prime}+ \lambda  \delta^K_{l,l^\prime}\delta^K_{i,i^\prime}
\end{eqnarray}

Where $\delta^K$ is the Kronecker delta function. A similar expression pops out for partial derivatives with respect $\mathbf{\Theta}$, but the important point to note with Newton-Rhapson, is that we seek to find the root of the gradient. Normally, when setting $f(x)$ to zero one writes:

\begin{eqnarray}
f(x+\delta x) &=& f(x) + \delta x \frac{\partial f}{\partial x}+O(\delta x^2)
\end{eqnarray}

And so if we update according to

\begin{eqnarray}
\frac{-f(x)}{\left( \frac{\partial f}{\partial x} \right)} &=& \delta x
\end{eqnarray}

For the cost function $J$ we are doing the same thing, except with matrices:

\begin{eqnarray}
f \to f_{il} &=& \frac{\partial J}{\partial \mathbf{X}_{i^\prime l^\prime}}
\end{eqnarray}

So our update takes the form of a double matrix multiplication (and inversion):

\begin{eqnarray}
\delta \mathbf{X}_{i l} &=& - \frac{\partial J}{\partial \mathbf{X}_{i^\prime l^\prime}} \left( \frac{\partial^2 J}{\partial \mathbf{X}_{i^\prime l^\prime} \partial \mathbf{X}_{i l}} \right)^{-1}
\end{eqnarray}

A similarly ugly expression for $\mathbf{\Theta}_{jl} $ applies.

Now I find that updating simultaneously the matrices $X$ and $\Theta$ according to the above second order rule causes the model to diverge, not converge, which I suppose makes sense because the gradient of $J$ with respect to $\mathbf{X}_{i l}$ changes with $\mathbf{\Theta}_{j l}$. A more clever set up to solve this problem would be to apply second order updates, alternately, to $X$ and $\Theta$ -- i.e. something that is truly collaborative. This would cause the gradients $\frac{\partial J}{\partial \mathbf{X}_{il}}$, $\frac{\partial J}{\partial \mathbf{\Theta}_{jl}}$, to alternately be set to zero, and over time, hopefully cause the model to converge to a $J$-cost minimum.

One could also use a first order method -- which is mathematically much easier -- where one must choose some arbitrary step size $\alpha$ -- or ``learning rate'' -- and hope that things work with a parameter update of the form

\begin{eqnarray}
\delta \mathbf{X}_{i l} &=& - \alpha \frac{\partial J}{\partial \mathbf{X}_{i^\prime l^\prime}}
\end{eqnarray}

I find that it takes thousands of iterations for a first order method, choosing some step size $\alpha$ -- and hopefully getting lucky -- to achieve the same performance of a second order method with 20 iterations. Here are the learning curves for a collaborative second order method, after 20 iterations:


And for a ``simultaneous'' first order method, after 100 iterations:

The data for these plots came from Andrew Ng's course on Machine learning, using a regularization parameter $\lambda \approx .01$ and a step size $\alpha \approx .005$. With this code fully implemented and vectorized in python, I find that it takes only a few seconds to achieve an almost perfect fit! (And by that I mean the improvement in the cost function $J$ per iteration is less than one part in one thousand). 

This example shows the power of convergence for second order methods when the cost function is truly second order. The one bottleneck that could appear in this example is the matrix inversion of what are essentially $n_m$ and $n_u$ $D \times D$ matrices, which goes like $O\left((n_m+n_u)D^3 \right)$. If one chooses the feature space to be very high-dimensional, with a great number of users and movies, the vectorized second order algorithm may slow down quite a bit. (But there is no replacing being able to set a matrix of derivatives to zero in a single step!) For separate linear regression problems of this type, it can take a great deal of time for things to converge, and in general that step size should not be the same for all regression coefficients, $\mathbf{\Theta}_{jl}$. 

I wonder how this approach could be parallelized, or distributed, choosing the same dimension $D$ for many clusters. After all, the number of movies on Netflix is prohibitively large for such a model, even if it converges on my laptop in a few seconds for $n_m \sim 2000$ and $n_u \sim 1000$. 


Tuesday, September 1, 2015

Equivalence of Shannon Entropy and Logarithm of Posterior Partition function (under Laplace Approximation)

Now, the integral of the posterior over parameter space is called the evidence:

\begin{eqnarray}
P(D) = Z &=& \int d\vec{\theta} \mathcal{L}(D \vert \vec{\theta})P(\vec{\theta})
\end{eqnarray}

If we approximate the un-normalized posterior as a Gaussian -- which I've recently learned from Christopher Bishop is called the Laplace approximation -- by taking logarithmic derivatives, then we get:

\begin{eqnarray}
P(D \vert \vec{\theta})P(\vec{\theta}) &\approx & P(D \vert \vec{\theta}_{\mathrm{MAP}})P(\vec{\theta}_{\mathrm{MAP}})\mathrm{exp}\left[-(\vec{\theta}-\vec{\theta}_{\mathrm{MAP}})_i\left(F_{ij}+\Sigma_{ij}^{-1}\right)(\vec{\theta}-\vec{\theta}_{\mathrm{MAP}})_j \right]
\end{eqnarray}

We already know what the integral of this Gaussian over parameter space will be:

\begin{eqnarray}
\log Z &=& \frac{D}{2}\log(2\pi)+\frac{1}{2}\log \left( \vert F_{ij}+\Sigma_{ij}^{-1} \vert\right)
\end{eqnarray}

Comparing this with our measured entropy of the posterior, we see we're off by just a constant:

\begin{eqnarray}
H\left[ P(\vec{\theta} \vert D \right] &=&  \frac{D}{2}\log(2\pi)+\frac{1}{2}\log \left( \vert F_{ij}+\Sigma_{ij}^{-1} \vert\right) + \frac{D}{2}
\end{eqnarray}

This is another example of why a statistical mechanics interpretation of $Z$, our normalization of the posterior, is right on point. It's logarithm -- up to an additive constant, which can be thrown away -- is equal to the Entropy of our distribution, which is a common point of wisdom in statistical mechanics. So in conclusion, under the laplace approximation, writing our posterior as a Gaussian by expanding in the exponential, collecting the first $\vec{\theta}_{\mathrm{MAP}}$ and second $F_{ij}$ cumulants, we get:

\begin{eqnarray}
\log Z &=& H\left[P(\vec{\theta} \vert D )\right] + \mathrm{const}
\end{eqnarray}

Monday, August 31, 2015

The Fisher Matrix and Volume Collapse in Parameter Space

The starting point for all inference is the likelihood function:

\begin{eqnarray}
\mathcal{L} \left(\mathbf{x} \vert \mathbf{\theta}\right)
\end{eqnarray}

where $\vec{\theta}$ is some set of parameters in our model and $X$ is the data matrix, the aggregate of all our experimental data. We can write the posterior of our experiment as:

\begin{eqnarray}
P(\vec{\theta} \vert X) &=& \frac{\mathcal{L}(X \vert \vec{\theta})P(\vec{\theta})}{P(X)}
\end{eqnarray}

Where the term in the denominator is called the evidence, and the $P(\vec{\theta})$ is our prior, filled with hyperparameters that give our initial beliefs on the model. David Mackay likes to talk about something called the ``Occam Factor'' which he gets by Taylor expanding the posterior in $\theta$ while keeping the Prior constant:

\begin{eqnarray}
P(\vec{\theta} \vert X) &\approx & \frac{\mathcal{L}(X \vert \hat{\theta})P(\hat{\theta})}{P(X)}e^{-(\vec{\theta}-\hat{\theta})_i F_{ij}(\vec{\theta}-\hat{\theta})_j}
\end{eqnarray}

Notice that the gradient of the prior at maximum likelihood, or Maximum a Posterior (MAP) is going to be zero, but the Hessian will not. We can take the Hessian of the negative log likelihood, whose expectation value is the Fisher information matrix:

\begin{eqnarray}
F_{ij} &=& \langle \frac{-\partial^2 \log \left(\mathcal{L}(\mathbf{x}\vert \mathbf{\theta})\right)}{\partial \theta_i \partial \theta_j}\rangle
\end{eqnarray}

One might ask, from an information theory point of view, how much ``help'' has our experiment, given the model and the data set X, really been? How much have we learned, relative to our prior? The answer to this is the information content in our Likelihood function, defined for any probability distribution to be:

\begin{eqnarray}
H\left[ p(x) \right] &=& -\int dx p(x) \log p(x)
\end{eqnarray}

For a multidimensional Gaussian, $x \in \mathcal{R}^d \sim \mathcal{N}(0,A_{ij})$ this is:

\begin{eqnarray}
H \left[p(x) \right] &=& \int d\vec{x} \frac{1}{(2\pi)^{d/2} \vert A_{ij} \vert^{1/2} }e^{\frac{-\vec{x}_i A^{-1}_{ij}\vec{x}_j}{2}} \left(\frac{d}{2}\log(2\pi) + \frac{1}{2}x_i x_j  i A^{-1}_{ij}+\frac{1}{2}\log \vert A_{ij} \vert \right)
\end{eqnarray}

The first and third terms are just constants, and the second is the covariance of our Gaussian distribution: $\langle x_i x_j \rangle = A_{ij}$, so we get the Trace of a $d \times d$ identity matrix:


\begin{eqnarray}
H \left[p(x) \right] &=& \int d\vec{x} \frac{1}{(2\pi)^{d/2} \vert A_{ij} \vert^{1/2} }e^{\frac{-\vec{x}_i A^{-1}_{ij}\vec{x}_j}{2}} \left(\frac{d}{2}\log(2\pi) + \frac{1}{2}A_{ij}  i A^{-1}_{ij}+\frac{1}{2}\log \vert A_{ij} \vert \right)\\
&=& \int d\vec{x} \frac{1}{(2\pi)^{d/2} \vert A_{ij} \vert^{1/2} }e^{\frac{-\vec{x}_i A^{-1}_{ij}\vec{x}_j}{2}} \left(\frac{d}{2}\log(2\pi) + \frac{1}{2}\mathrm{Tr}(I_d)+\frac{1}{2}\log \vert A_{ij} \vert \right)\\
&=& \frac{d}{2}\log(2\pi e) + \frac{1}{2}\log \vert A_{ij} \vert \
\end{eqnarray}

This is the information content for any Gaussian with arbitrary mean $\mu_i$ and Covariance matrix $A_{ij}$. If we assume a Gaussian prior $P(\vec{\theta})=N(0,\sigma_{ij}$, then we can write the information content of our Likelihood function, and prior as:

\begin{eqnarray}
H\left[\mathcal{L} \right] &=& \frac{d}{2}\log(2\pi e) + \log \left( \vert F_{ij}^{-1} \vert \right) \\
H\left[P(\vec{\theta})\right] &=& \frac{d}{2}\log(2\pi e) + \log \left( \vert \Sigma_{ij} \vert \right)
\end{eqnarray}

This is quite interesting, because in statistical mechanics  entropy goes like the Logarithm of accessible volume in phase space. What we have here, this logarithm of a determinant, is precisely the same thing, because:

\begin{eqnarray}
\sqrt{\vert A_{ij} \vert} & \sim & \int d\vec{x} \ \ \mathrm{s.t.} \ x_i A^{-1}_{ij}x_j  \le 1
\end{eqnarray}

Or the approximate volume of an ``ellipsoid'' in phase space (not allowing the - log likelihood to go above unity). If we compare information content between the prior and the Likelihood we get:

\begin{eqnarray}
H \left[\mathcal{L} \right]-H\left[P(\vec{\theta})\right] &=& \frac{1}{2} \log  \frac{\vert F_{ij}^{-1}\vert}{\vert \Sigma_{ij} \vert }
\end{eqnarray}

Exponentiating the difference in Entropies, we get a ratio of the Volumes in parameter space:

\begin{eqnarray}
e^{H \left[\mathcal{L} \right]-H\left[P(\vec{\theta})\right]} &=& \frac{\sqrt{\vert F_{ij}^{-1}\vert}}{\sqrt{\vert \Sigma_{ij}\vert}} = \frac{\mathrm{Vol}(\mathcal{L})}{\mathrm{Vol}(P)}
\end{eqnarray}

This ratio of Volumes is the ``true'' Occam factor, in my opinion as it represents the ``collapse'' in parameter space of our beliefs. But we have not taken the full posterior into account, only our learning from the Likelihood function, separate from the Prior. It would perhaps be more consistent to deal with the prior as a whole, which has a negative log Hessian of the form:

\begin{eqnarray}
P(\vec{\theta} \vert X) &=& \frac{\mathcal{L}(X \vert \vec{\theta})P(\vec{\theta})}{P(X)}\\
P(\vec{\theta}) &=& \mathcal{N}(0, \Sigma_{ij})\\
\mathcal{L}(X \vert \vec{\theta}) & \approx & \mathcal{N}(\hat{\theta}, F_{ij}^{-1}) \\
\langle -\frac{\partial^2 \log P(\vec{\theta}\vert X)}{\partial \theta_i \partial \theta_j} \rangle &=& F_{ij}+\Sigma_{ij}^{-1}
\end{eqnarray}

So the information content of the full posterior goes like:

\begin{eqnarray}
H\left[ P(\vec{\theta} \vert X) \right] &\sim & \frac{1}{2}\log \left( \mathrm{det} \vert (F_{ij}+\Sigma_{ij}^{-1})^{-1} \vert \right)\\
&=& \frac{1}{2}\log \left( \frac{1}{\mathrm{det} \vert (F_{ij}+\Sigma_{ij}^{-1})\vert }\right)
\end{eqnarray}

and if we ask for the information difference, we get:

\begin{eqnarray}
H\left[ P(\vec{\theta} \vert X) \right]-H\left[ P(\vec{\theta}) \right] &=& \frac{1}{2}\log \left( \frac{1}{\mathrm{det} \vert (F_{ij}+\Sigma_{ij}^{-1})\vert  \mathrm{det} \vert\Sigma_{ij} \vert}\right) \\
H\left[ P(\vec{\theta} \vert X) \right]-H\left[ P(\vec{\theta}) \right] &=& \frac{-1}{2}\log \left( \mathrm{det} \vert (F_{ij}+\Sigma_{ij}^{-1})\vert \right) - \frac{1}{2}\log  \mathrm{det} \vert\Sigma_{ij} \vert
\end{eqnarray}

and since $\mathrm{det}\vert A+B \vert \ge \mathrm{det}\vert A \vert+\mathrm{det}\vert B \vert$, we can write the inequality:

\begin{eqnarray}
H\left[ P(\vec{\theta} \vert X) \right]-H\left[ P(\vec{\theta}) \right] &\ge & \frac{-1}{2}\log \left( \mathrm{det} \vert F_{ij}\vert+ \mathrm{det} \vert\Sigma_{ij}^{-1})\vert \right) - \frac{1}{2}\log  \mathrm{det} \vert\Sigma_{ij} \vert \\
H\left[ P(\vec{\theta} \vert X) \right]-H\left[ P(\vec{\theta}) \right] &\ge & \frac{-1}{2}\log \left(1+ \mathrm{det} \vert F_{ij}\vert \mathrm{det} \vert\Sigma_{ij})\vert \right) \\
H\left[ P(\vec{\theta} \vert X) \right]-H\left[ P(\vec{\theta}) \right] &\ge & \frac{-1}{2}\log \left(1+ \mathrm{det} \vert F_{ij}\Sigma_{jl})\vert \right) \\
\mathrm{entropy} \ \mathrm{``destroyed''}& \sim & \frac{\mathrm{det} \vert\Sigma \vert }{ \mathrm{det}\vert F^{-1} \vert}
\end{eqnarray}

Where in the last line I've Taylor expanded the log. This negative quantity is approximately how much entropy we are ``killing'' by doing our experiment $X$. We are gauranteed to learn something by observing $X$ and putting it into our model, but the important consideration is the determinant of the Fisher matrix. If it is very large, our parameter space will ``collapse'' substantially, as we saw before.  

Monday, July 27, 2015

Multi-Class Logistic Regression as Maximum (conditional) Entropy

Recently, I've been thinking about Logistic regression -- i.e. Machine Learning classification and the activation of neurons in neural networks -- and how it can be framed in terms of Statistical Mechanics. If you look at the sigmoid function, you'll notice something a bit suspect -- or at least I did: it looks peculiarly like a Boltzman distribution, or some properly normalized member of the exponential family:

\begin{eqnarray}
\mathrm{sigm}(x)&=& \frac{1}{1+e^{-x}} \\
\mathrm{sigm}(\beta x)&=& \frac{1}{1+e^{-\beta x}}
\end{eqnarray}

If we scale this function by $\beta$, this looks a lot like Boltzman. And, if we use the standard framework for logistic regression, where our exponential argument is the dot product between a vector of parameters $\vec{\theta}$ and a vector of input data, $\vec{x}$, we have:

\begin{eqnarray}
P(y=1 \vert \vec{x},\vec{\theta}) &=&  \frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}}}
\end{eqnarray}

That is, the probability that our feature factor $\vec{x}$ belongs to class $y=1$ is given by the sigmoid function, with a ``linear regression'' as its argument. Could we derive this sigmoid function by different means? Where does it come from?

Recall that the Boltzman distribution can be derived from maximum entropy. Shannon proved that the entropy of a probability distribution is equal to the following strange quantity:

\begin{eqnarray}
S[p] &=& \sum_i -p(x_i) \log \left(p(x_i)\right)
\end{eqnarray}

or, in the continuous PDF case:

\begin{eqnarray}
S[p] &=& \int -p(x) \log \left( p(x) \right) dx
\end{eqnarray}

another way to write this is

\begin{eqnarray}
S &=& \langle -\log\left(p(x) \right) \rangle
\end{eqnarray}

Maximizing this entropy, subject only to normalization, involves setting the gradient of the following function to zero:

\begin{eqnarray}
 G&=& S[p] + \lambda \left( \sum_i p(x_i) -1 \right)\\
 \frac{\partial G}{\partial p(x_i)} &=& 0 \ \forall i
\end{eqnarray}

If we have a discrete distribution over $N$ total points $\lbrace x_i \rbrace_{i=1}^N$, we get the \textbf{uniform distribution}.

\begin{eqnarray}
p(x_i) &=& \frac{1}{N}
\end{eqnarray}

But, once we start specifying certain properties of this distribution, like, say it's moments, we will change things substantially. Constraining on the mean gives us the exponential distribution:

\begin{eqnarray}
G[p]&=& S[p] + \lambda \left( \int p(x) -1 \right)+ \beta \left( \int p(x) x - \tau \right)\\
 \frac{\delta G}{\delta p(x_i)} &=& 0 \ \forall i \\
 \Rightarrow p(x_i) &=& \tau e^{-\tau x_i }
\end{eqnarray}

Where now instead of a gradient we've taken a \textbf{functional derivative}, of the functional $G$. Our result is the exponential distribution, which is the \textbf{maximally entropic} probability distribution with its first moment constrained. (You may not be surprised to learn that the Gaussian is the maximally entropic PDF with its first and second moment constrained!)

So maximizing entropy subject to certain constraints places heavy, deterministic restrictions on the form of our PDF's. We are essentially designing things to be as ``uninformative'' or least biased as possible. As Jaynes once said, `our thinking Robot is non-ideological: it takes in only the information it is given and nothing more'.

When we do classification problems, we are really maximizing the entropy of a probability distribution on class $y$, not on the feature vector $\vec{x}$, and this conditional entropy has a specific form:

\begin{eqnarray}
S[p(y \vert \vec{x})] &=& -\sum_i p(\vec{x}) p(y \vert \vec{x})\log \left( p(y \vert \vec{x}) \right)
\end{eqnarray}

Ok. So now what are our constraints? We would like to define some function of the feature vector $\vec{x}$ that is useful, and require our probability density on class $y$, $p(y \vert \vec{x}$, to exhibit the \textbf{same ensemble averages}:

\begin{eqnarray}
\sum_{\vec{x},y} p(\vec{x}) p(y \vert \vec{x}) f_y(\vec{x}) &=& \langle f_y(\vec{x}) \rangle
\end{eqnarray}

Where on the left we are summing over only $\vec{x}$ that are in the training set. (This is the role of $p(\vec{x})$. I have seen this equation called a ``balance'' equation, but really it is just a constraint on the ensemble average -- over class $y$ and feature vector $\vec{x}$ -- of some function $f_y(\vec{x})$: just like how we constrain energy with the Boltzman distribution:

\begin{eqnarray}
\sum_{\vec{x}} P(\vec{x}) \mathcal{H}(\mathbf{x}) &=& \langle E \rangle
\end{eqnarray}

Putting things into our big expression for entropy, we have:

\begin{eqnarray*}
G[p] &=& -\sum_i p(\vec{x}) p(y \vert \vec{x})\log \left( p(y \vert \vec{x}) \right) + \lambda \left( \sum_y p(y \vert \vec{x}) -1 \right)+ \beta \left(\sum_{\vec{x},y} p(\vec{x}) p(y \vert \vec{x}) f_y(\vec{x}) - \langle f_y(\vec{x}) \rangle \right)
\end{eqnarray*}

So $\lambda$ is the lagrange multiplier that asserts proper normalization on the PDF over class, not feature vector $\vec{x}$, and $\beta$ is the lagrange multiplier that asserts proper expectation value of the function $f_y(\mathbf{x})$ over both class and the training set. We could extend this and have \textbf{multiple} feature functions to be constrained on, with multiple lagrange multpliers by just making them both vector valued:

\begin{eqnarray*}
G[p] &=& -\sum_i p(\vec{x}) p(y \vert \vec{x})\log \left( p(y \vert \vec{x}) \right) + \lambda \left( \sum_y p(y \vert \vec{x}) -1 \right)+ \vec{\beta} \cdot \left(\sum_{\vec{x},y} p(\vec{x}) p(y \vert \vec{x}) \vec{f}_y(\vec{x}) - \langle \vec{f}_y(\vec{x}) \rangle \right)
\end{eqnarray*}

Setting the gradient -- or in the continuous case, functional derivative -- of this expression to zero gives:

\begin{eqnarray}
P(y \vert \vec{x}) &=& Z^{-1} e^{\vec{\beta} \cdot \vec{f_y}(\vec{x})}\\
Z &=& \sum_c e^{\vec{\beta} \cdot \vec{f_y}(\vec{x})}
\end{eqnarray}

This expression already has multi-class classification built in, but let's see what it looks like for simpler logistic regression, when $y=0,1$. What is a good choice for the feature function $\vec{f_y}(\vec{x})$ to get our sigmoid function back? Since logistic regression looks a lot like a ``sigmoid-wrapped'' linear regression in textbooks, it might be a good idea to define:

\begin{eqnarray}
\vec{f_y}(\vec{x}) &=& \vec{x} \ \ \mathrm{if} \ \ y^\prime=y
\end{eqnarray}

Then we have:

\begin{eqnarray}
P(y=1 \vert \vec{x}) &=& \frac{ e^{\vec{\beta_1} \cdot \vec{x}}}{e^{\vec{\beta_0} \cdot \vec{x}}+e^{\vec{\beta_1} \cdot \vec{x}}}
\end{eqnarray}

Simplifying things we get:

\begin{eqnarray}
P(y=1 \vert \vec{x}) &=& \frac{ 1}{1+e^{-(\vec{\beta_1}-\vec{\beta_0}) \cdot \vec{x}}}
\end{eqnarray}

The expression above is precisely logistic regression, with the parameters $\beta_0$ the ``anti'' linear-coefficients and $beta_1$ the non-anti linear coefficients. We combine them both into the normal $\vec{\theta}$ and write

\begin{eqnarray}
P(y=1 \vert \vec{x}) &=& \frac{ 1}{1+e^{-\vec{\theta} \cdot \vec{x}}}
\end{eqnarray}

So what does this mean? It means that $\beta_1$ is constraining the ensemble average of the $i^\mathrm{th}$ component of the feature vector $\vec{x}$, given that class is $y=1$;  And conversely for $\beta_0$, we are constraining the $i^\mathrm{th}$ component of the feature vector when the class is $y=0$. These feature functions can be called marginal counts, or marginal probabilities on $\vec{x}$.

Now in the last post I talked about how the vector $\theta$ gives the direction of \textbf{fastest increase} for the logistic function, or the direction of ``fastest decision'' -- both for yes and no -- from the decision boundary. We can frame this in an even more physical way using Lagrange multipliers, because what $\vec{\beta}$ really represents in feature space is a \textbf{constraint force}, as seen in Lagrangian mechanics. Just like when we are minimizing the action -- a functional -- subject to some constraint:

\begin{eqnarray}
S &=& \int \left[ \mathcal{L(q,\dot{q})}+ \lambda \left(q_x^2+q_y^2 - R^2\right)  \right] dt
\end{eqnarray}

$\lambda$ above gives the ``tension'' force $\lambda=T$ for a massive particle, constrained to move on a circle of radius $R$. We are using the same tactic with $\beta$ to ``push'' our PDF into the proper shape. Stronger components -- i.e. larger scalar values within $\vec{\beta}$ -- correspond to stronger entropic forces on the PDF, and stronger correlation.

It is easy to extend the expressions above to multiple classes $y=0,1,2 \dots$, and important to note that the activation of most all neurons in neural networks use logistic regression: it's all maximum entropy learning. E.T. Jaynes would be very proud.