Tuesday, November 24, 2015

A Brief Note on the Bias-Variance Decompositon


When one trains a model that is highly flexible, highly ``articulate'' on a training set, you often get a great training score -- be it AUC, accuracy, MSE, etc. But such models -- as I've talked about before -- often have trouble generalizing to ``test'' sets, or, the real world. One of the easiest ways to see this is by a simple FOIL operation on the following quantity:

\begin{eqnarray}
Y&=& f(X)+\epsilon
\end{eqnarray}

Here $X$ is a random variable -- the independent inputs -- $f(X)$ is the generative process that creates our object of interest $Y$ -- be it cardinal or $\in \mathcal{R}$ -- and $\epsilon$ is drawn from some noise distribution, say a Gaussian process with zero mean, $\mathcal{N}(0,K(X,X^\prime))$. Let $g(X)$ be our picked model for $Y$. (Normally people write $\hat{f}(X)=g(X)$ but I chose $g$ to avoid confusion.) If we take a look at the mean squared error, we get

\begin{eqnarray}
MSE &=& \langle \vert f(X)+\epsilon-g(X) \vert^2 \rangle\\
&=& \langle f(X)^2 \rangle + \langle \epsilon^2 \rangle +\langle g(X)^2 \rangle - 2 \langle \epsilon \rangle \langle f(X) \rangle - 2 \langle \epsilon \rangle \langle g(X) \rangle - 2 \langle f(X) \rangle \langle g(X) \rangle
\end{eqnarray}

Where I've assumed the noise and our $f,g$ are uncorrelated. We see the terms that are linear in $\epsilon$ fall away and we can write:


\begin{eqnarray}
MSE &=& \langle f(X)^2 \rangle + \langle \epsilon^2 \rangle +\langle g(X)^2 \rangle - 2 \langle f(X) \rangle \langle g(X) \rangle
\end{eqnarray}

Adding and subtracting the mean of our model squared $\langle g(X) \rangle^2$ we get:

\begin{eqnarray}
MSE &=& \langle \left(f(X)-\langle g(X)\rangle \right)^2 \rangle +\mathrm{Var}\left(g(X) \right)+\mathrm{Var}\left(\epsilon \right)
\end{eqnarray}

So now, term by term, we see we have: the squared difference between our data and our average model -- the Bias, which quantifies how much our model is ``off'' (a quantity that will be quite low for complex models and high for simple ones); the variance of the model itself, which quantifies how much our $g(X)$ changes given different training inputs (a quantity that will be high for complex models and low for simple ones) ; and the variance of the noise variable $\epsilon$, which is an ineluctable contribution to our error.

Recall from last post we had a very similar bias-variance decomposition:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \left( \min_{h_\theta} \epsilon (h_{\theta}(X)) \right)+\sqrt{\frac{2\log \left( \frac{2K}{\delta} \right)}{N}}
\end{eqnarray}

The bias is the first term and the variance the second, which goes like square root logarithm of the degrees of freedom $K$. This decomposition illustrates the balancing act we have to play, as model builders, between simplicity and goodness of fit. Refer to these when explaining why your highly un-regularized gradient boosted decision tree -- or boosted anything -- did a poor job of generalizing! (This doesn't always happen, of course, just a word of caution.)


Tuesday, November 17, 2015

Learning Theory and an XGboost Failure on Kaggle

After a somewhat embarrassing moment on Kaggle.com, where I blatantly overfit a training set $X$ with a boosted decision tree -- XGboost -- and then got a horrible, horrible, generalized result, I thought it'd be time to consider model complexity and learning theory, and in general the debate that has been bubbling beneath the Machine Learning and Statistics community.

Recently, there has been a lot of success using highly flexible models, such as boosted regressors, classifiers, decisions trees, random forests, etc. to model very complex data sets. (One should lump neural networks in here, too, since they perform highly non-trivial, non-linear transformation to inputs.) Leo Breiman wrote a very interesting paper in 2001 called "Statistical Modeling: The two Cultures" which highlights -- and insults -- the attachment of  old school statisticians to generative models, where the data observed is characterized/explained as being generated by some "process". For instance, we say that the obesity in the United States is generated by a linear superposition of many factors, say income, age, sex, etc.:

\begin{eqnarray}
\mathrm{obsesity} = f(\mathbf{x}) &=& \sum_n \theta_n x_n
\end{eqnarray}

Which is of course a very stupid assumption, since we know there are non-linear interactions between these features of a person, that create complex results. More flexible models can capture this, by multiplying the different features together in some polynomial. Let us for instance, include all quadratic -- second order -- terms in our ``generative process'':

\begin{eqnarray}
f(x) &=& \theta \cdot \mathbf{x} + \mathbf{x} \cdot A \cdot \mathbf{x}
\end{eqnarray}

Where $A$ is some feature mixing matrix. There are way more parameters in our model now, and some will prove to be important, while others will not. These days, the MOST old school statisticians -- it seems to me -- expand features as polynomials, like above, and then use tried and true methods to cut down on the model complexity, getting rid of combinations and/or linear features that are not deterministic of outcome. This is all well and good. One can use the precise framework of $p-$values and hypotheses tests, etc. to characterize fit, but Leo Breiman says that by using such methods -- and I don't pretend to have exhausted all such models he pointed a finger at, or properly explained them -- we really limit ourselves. Essentially, the gist is that if we get away from this generative framework and look more towards flexible algorithms, such as Random forests and boosted decision trees, we can have greater predictive power in the real world.

If Kaggle competitions are treated as valid evidence, this is certainly true, since virtually every winner of these competitions -- outside computer vision -- has used gradient boosted decision trees or Random forests explicitly for their solution or as part of a larger ensemble of models. The trouble is, how do we tune these models to not overfit? How do we interpret and regularize them? Constraining a linear or logistic regression to have small weights with a quadratic penalty term is essentially the same thing as imposing entropy, imposing some simplicity measure on the model ( which turns out to be a Gaussian prior); but for trees and forests, how does one characterize the degrees of freedom? This is the trouble I had over the weekend, trying to fit a very complex algorithm that I didn't understand to a dataset, and then getting burned for it. The nice thing about Logistic Regression is that those linear weights $\theta$ can be read as importance values: when taking the absolute value they tell you how much each feature -- which, hopefully has been properly scaled -- contributes to a decision, or pushing towards a decision in feature space $\mathbf{x}$. (I've talked about this before.) Trees are much harder, and although the choice of split in classification problems, based on the gini index or information gain are useful, they are only a piece of the story. The first split of a decision tree can be viewed as the most important feature of course, but one clever comment I read in an AI book is that pruning trees is much better than limiting their depth, as what often happens, for instance with an XOR classification problem, is you split on $x_1$, getting not so great results, but then the following split on $x_2$ is highly deterministic, implying that the operator $x_1 \ \mathrm{AND} \ x_2$ is a very important feature, which would have been missed had the tree not been able to ``grow''. Clever as this comment is when it comes to training and making practical decisions when using Decision trees, it also highlights the fact that feature importance in a Decision tree can be LATENT. Sometimes the logical conjunction of features or splits is what's important, and that will not show up so easily to the eye when looking at a decision tree.

So, just for fun and practice with the Hoeffding, Chernoff inequalities, and the Union bound, let's take a look at some elementary learning theory (I'm lifting this directly from Andrew Ng's notes at Stanford). Say we are in a binary classification problem, where our data is a sequence of features $\mathbf{x}$ and corresponding classes $y$:

\begin{eqnarray}
D=&=& \lbrace \mathbf{x}_n, y_n \rbrace_{n=1}^N
\end{eqnarray}

We would like to estimate our generalization error -- which is just a specific type of loss function that we're going to treat as a conditional random variable:

\begin{eqnarray}
\epsilon(h(x)) &= & 1_{h(x) \neq y}
\end{eqnarray}

Where $1$ is the indication function. The expectation value of the generalization error is characterized by a probability distribution, which we know nothing about:

\begin{eqnarray}
\epsilon(h(x)) &\sim & P(\epsilon(h(x)) \vert D)
\end{eqnarray}

But, we can estimate the mean of this distribution, given some samples in $D$ by using the Hoeffding bound. The greatest $\epsilon(h(x))$ can be is unity for a given $x$, and the lowest it can be is zero. This means we can write:

\begin{eqnarray}
P( \vert \epsilon(h(x)) - \sum_{n=1}^N \frac{\epsilon(h(x_n))}{N} \vert \ge \gamma ) & \le & 2 e^{-2N\gamma^2}
\end{eqnarray}

where the nasty term in the argument of our probability distribution is just the mean of our errors, what's called the Empirical risk:

\begin{eqnarray}
\hat{\epsilon}(h(x)) &=& \sum_{n=1}^N \frac{\epsilon(h(x_n))}{N}
\end{eqnarray}

Writing this a bit more simply, then, this is really just the standard problem of estimating the mean of a binomial distribution, for example -- but we haven't specificied the PDF, only that all of the samples $x_n, y_n$ are i.i.d!

\begin{equation}
P( \vert \epsilon(h(x)) -\hat{\epsilon}(h(x)) \vert \ge \gamma )  \le  2 e^{-2N\gamma^2}
\end{equation}

We can see that as $N \to \infty$ -- an infinite training set -- we are going to be spot on: our training error $\hat{\epsilon}$ will be precisely the same as our generalization error $\epsilon$. But, recall that when fitting models, $h(x)$ is actually parametrized by some $\theta$. So what we are really worried about is the probability that ANY of our models' train set error, $\hat{\epsilon}(h_\theta(x))$ deviate far from their generalization error. For such a thing we use the union bound:

\begin{eqnarray}
P(A \cup B \cup C \dots ) &\le & P(A)+P(B)+P(C)+\dots
\end{eqnarray}

Now, to make life simpler, let's say our model space is discrete, and we really only have a choice of $K$ $\theta$s. Then we've got:

\begin{eqnarray}
P( \cup_{k=1}^K \vert \epsilon(h_k(x))-\hat{\epsilon}(h_k(x))\vert \ge \gamma) &\le & 2K e^{-2N \gamma^2}
\end{eqnarray}

So, as we've seen before. Things aren't that much worse. Let's write the probability that our training error is more than $\gamma$ different from our generalization error as $\delta$, and then we can state:

\begin{eqnarray}
\gamma &\le & \sqrt{\frac{\log(2K/\delta)}{2N}} \\
N & \ge & \frac{1}{2\gamma^2}\log \left(\frac{2K}{\delta}\right)
\end{eqnarray}

What do these two statements mean? The first that our estimation of generalization error gets tighter as $N$ gets large. More specifically, it scales at worst like $\frac{1}{\sqrt{N}}$, just Monte Carlo Models. We also see that accuracy goes logarithmic with confidence, which is just a remnant of usual Hoeffding/Chernoff bound statements. In the second line, we see that the requisite number of Training examples, for some given confidence $\delta$ and accuracy on empirical risk $\gamma$, goes logarithmic with the degrees of freedom. A priori, this seams very nice, but let's develop things further.

When we're given a training set, we're not gauranteed to fit the optimal model, but we know our empirical risk estimate is at most $2\gamma$ away the OPTIMAL empirical risk. Let's show this. First, call our fitted model $h_{\hat{\theta}}$, and the optimal one $h_{\theta^*}$.  We have, from the above Hoeffding/Chernoff inequality, for a single fitted model:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \hat{\epsilon}(h_{\hat{\theta}})+\gamma
\end{eqnarray}

This is just comparing generalization risk to training risk. Now let's compare to the optimal, by noting that, if we fit to the wrong model,  $\hat{\epsilon}(h_{\hat{\theta}}) \le \hat{\epsilon}(h_{\theta^*})$ -- otherwise we would have chosen something else! Then we can write:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \hat{\epsilon}(h_{\theta^*})+\gamma
\end{eqnarray}

and, now converting the training risk to the generalization risk on the optimal model, we get:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) & \le & \epsilon(h_{\theta^*})+2\gamma
\end{eqnarray}

This is a very nice inequality, because what it's basically saying is that our empirical risk is less than the optimal fit, plus some accuracy term:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \left( \min_{h_\theta} \epsilon (h_{\theta}(X)) \right)+\sqrt{\frac{2\log \left( \frac{2K}{\delta} \right)}{N}}
\end{eqnarray}

With highly complex models, the first term on the right hand side will be quite small -- low bias -- since we can fit our data almost exactly. But conversely, in such a case $K$ becomes large and we have no good hold on how our model will extend into the real world -- high variance. How bad does this scaling of complexity go? A rough argument from Andrew Ng's notes is that for floating point numbers, for $D$ parameters, we have approximately:

\begin{eqnarray}
K &=& 2^{64D}
\end{eqnarray}

and so our inequality goes like:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \left( \min_{h_\theta} \epsilon (h_{\theta}(X)) \right)+\sqrt{\frac{2D\log \left( \frac{2^{65}}{\delta} \right)}{N}}
\end{eqnarray}

and so the requisite number of training examples -- for a given accuracy $\gamma$ and confidence $\delta $ -- goes linearly with parameter dimension.

But this is not entirely correct. One can use something called the vapnik-chervonenkis dimension, which I won't get into here, maybe in a later post, to characterize the complexity of a model. This is all a very active part of research, and it obviously is quite difficult to get a handle on how well an algorithmic solution will generalize, once trained on a training set.

For instance, what is $K$ for a gradient boosted decision tree? How does one characterize its vapnik-chervonenkis dimension? This question is very important if you want to fit such models to data, rather than being a blockhead and just using Randomized Grid Search for Hyper-Parameters in python.  (WHICH, is exactly what I did over the weekend...)

Wednesday, November 11, 2015

Cumulants of the Sample Mean

Let a random variable $X$ be drawn from some probability distribution, $X \sim P(x)$. Then, the sum of $N$ i.i.d. samples, or realizations is drawn from:

\begin{eqnarray}
P(s_N=\sum_{n=1}^N x_n) &=& (P*P*P\cdots P)(s_N)\\
&=& \int dk e^{iks_N}\phi(k)^N = \int dk e^{iks_n+N\log \phi(k)}\\
&=& \int dk e^{iks_N+N\psi(k)}
\end{eqnarray}

This means that the equivalent cumulants for the sample mean -- which is simply $\frac{s_N}{N}$ -- are:

\begin{eqnarray}
\bar{x} &=& \sum_n^N \frac{x_n}{N} \\
\langle \bar{x}^k \rangle_c &=& \frac{1}{N^k}\langle \left(\sum_n x_n \right)^k \rangle_c
\end{eqnarray}
Expanding out the term inside the brackets using the multinomial theorem, we get:

\begin{eqnarray}
\langle \bar{x}^k \rangle_c &=& \frac{1}{N^k}\langle \sum_{\vec{k}} \frac{k!}{k_1!k_2! \cdot k_N!} \prod_{n=1}^N x_n^{k_n}\rangle_c \\
&=& \frac{1}{N^k} \sum_{\vec{k}} \left(\begin{array}{c}
k \\
k_1 \cdots k_N
\end{array}\right) \langle  \prod_{n=1}^N x_n^{k_n} \rangle_c
\end{eqnarray}

any dependence between the samples $x_n$ yields this expression nontrivial, but if we assume i.i.d. we just get:

\begin{eqnarray}
\langle \bar{x}^k \rangle_c &=& \frac{1}{N^{k-1}} \langle x^k \rangle_c
\end{eqnarray}

Where the term in the RHS brackets is just a single realization of our random variable $x$. Why is this important? Because it tells us that the variance, which is the $k=2$ cumulant, scales as $\frac{1}{N}$, and the skewness, which is defined as:

\begin{eqnarray}
\mathrm{skew}&=&\frac{\langle \bar{x}^3 \rangle_c}{\left(\langle \bar{x}^2 \rangle_c \right)^{3/2}} \sim \frac{1}{\sqrt{N}}
\end{eqnarray}

scales as one over square root $N$. This means that our distribution on the mean collapses to a Gaussian AT LEAST as fast as $\frac{1}{\sqrt{N}}$, given some initial asymmetry in the distribution of $x$. (If we have symmetry, things are even better and we need only worry about the kurtosis, which goes like $\frac{1}{N}$. )

Such considerations are important when you ask yourself: at what point can I consider the estimator of the mean to be drawn from a Gaussian? How does my approximation scale with sample size? These are very important questions in the real world, and it nice to have a sense of what's holding you back from being exact -- namely, that $\frac{1}{\sqrt{N}}$ for skewness, which goes along with a hermite polynomial, and $\frac{1}{N}$, which goes with another.)

Monday, November 2, 2015

Differential Regularizers as Priors over Functions $f(x)$: Gaussian Process Kernels as Green's Functions

Often, when we are solving a regression problem, we are given the following functional:

\begin{eqnarray}
J[f] &=& l\left[f,y\right]+\Omega \left[f \right]
\end{eqnarray}

Where $l$ is some loss functional and $\Omega$ is some regularizer. Given some finite dataset, this might look like, in the case of squared loss:

\begin{eqnarray}
X &=& \lbrace \mathbf{x}_n \rbrace_{n=1}^N \\
J &=& \sum_{n=1}^N \left( f(\mathbf{x}_n)-y_n \right)^2 + \Omega(f)
\end{eqnarray}

For a linear basis expansion model, we have the following:

\begin{eqnarray}
f(\mathbf{x}) &=& \mathbf{w}\cdot \phi(x) \\
J(\mathbf{w}) &=& \sum_{n=1}^N \left(  \mathbf{w}\cdot \phi(x_n) -y_n \right)^2 + \lambda \vert \mathbf{w} \vert^2
\end{eqnarray}

where $\lambda \vert \mathbf{w} \vert^2$ plays the role of a prior over functions. The cost function in this example is proportional to the negative log prior, what we have is essentially:

\begin{eqnarray}
-\log P(\mathbf{w} \vert X, \mathbf{y}, \phi) & \sim & J(\mathbf{w}) \\
-\log \mathcal{L}(X, \mathbf{y} \vert \mathbf{w} \phi) & \sim & \sum_{n=1}^N \left(  \mathbf{w}\cdot \phi(x_n) -y_n \right)^2 \\
-\log P(\mathbf{w} \vert \phi) & \sim &  \Omega(f) = \lambda \vert \mathbf{w} \vert^2
\end{eqnarray}

(So that's a Gaussian likelihood and a Gaussian prior). Minimizing the cost with respect to $\mathbf{w}$ is same thing as finding the mode of the posterior, or Maximum a Posteriori. (MAP). We've already talked about how, for such a regularized regression problem, we can right the solution as a linear combination of kernels, centered at the data:

\begin{eqnarray}
f(\mathbf{x}) &=& \sum_{n=1}^N \alpha_n K(\mathbf{x},\mathbf{x}_n )
\end{eqnarray}

a manifestation of the representer theorem. But one important question to ask, given some regularization functional, is, what's the ``best'' Kernel? Let's take for example the regularizer:

\begin{eqnarray}
J &=& \int dx \left(f(x)-y \right)^2 + \lambda \int dx \vert\frac{\partial^2 f(x)}{\partial x^2}\vert^2
\end{eqnarray}

This $\Omega$ functional penalizes curvature in our fitting function $f(x)$, and we can note that what such a regularizer really is, is a prior over functions, since:

\begin{eqnarray}
P(f  \vert f(X), \mathbf{y} ) &=&\frac{P(X,y \vert f) P(f)}{P(X, \mathbf{y})} \\
&=& \frac{\mathrm{exp}\left[-\int dx \delta^D(y-f(X)) \left(f(x)-y \right)^2 - \lambda \int dx \vert\frac{\partial^2 f(x)}{\partial x^2}\vert^2 \right] }{P(X, \mathbf{y})} \\
\end{eqnarray}

We see that the prior on our function is:

\begin{eqnarray}
P[f] &\sim & \mathrm{exp}\left(-\lambda \int \vert f^{\prime \prime}(x)\vert^2 dx \right)
\end{eqnarray}

where $\lambda$ controls the "strength" of our penalty for curvature. To be more general, we could have written the prior on our function as a superposition of differential operators:

\begin{eqnarray}
P[f] &\sim & \mathrm{exp}\left(-\int \sum_{m=1}^\infty a_m \frac{\partial^m }{\partial x^m}\vert f(x)\vert^2 dx \right)
\end{eqnarray}

If we integrate by parts, we note that this prior functional can be put into the form:

\begin{eqnarray}
P[f] &\sim & \mathrm{exp}\left(-\int dx dx^\prime f(x) K(x,x^\prime)^{-1} f(x^\prime) \right)
\end{eqnarray}

Which of course gives us the familiar prior assumptions that $f(x)$ has mean and covariance:

\begin{eqnarray}
\langle f(x) \rangle &=& 0 \\
\mathrm{Var}\left(f(x) \right) &=& K(x,x^\prime)
\end{eqnarray}

But, for a given differential regularizer, how do we find the associated Kernel? The answer is simple, it's just the Green's function of the operator:

\begin{eqnarray}
\hat{L} &=& \sum_m a_m \frac{\partial^m}{\partial x^m}\\
\hat{L}K &=& \sum_m a_m \frac{\partial^m}{\partial x^m} K(x,x^\prime) = \delta(x-x^\prime)
\end{eqnarray}

An easy way to get the green's function -- or in this case Kernel -- is to fourier transform. We can re-write our prior in $s$ space:

\begin{eqnarray}
f(s) &=& \int dx e^{-isx}f(x) \\
P[f] &\sim & \mathrm{exp}\left(-\int ds\sum_{m=1}^\infty a_m \vert\mathbf{s}\cdot \mathbf{s}\vert^m\vert f(s)\vert^2 dx \right)
\end{eqnarray}

We see now the fourier transform of our inverse kernel is:

\begin{eqnarray}
\frac{1}{K(s,s^\prime)} &=& \sum_m a_m (-1)^m \vert \mathbf{s}\cdot \mathbf{s} \vert^m \delta^D(\mathbf{s}+\mathbf{s}^\prime)
\end{eqnarray}

We see that the kernel is diagonal and in $s$ space and semi-positive definite. Which means that we are translationally invariant in $x$ space. We have:

\begin{eqnarray}
K(x,x^\prime) &=& \int ds ds^\prime e^{isx+is^\prime x^\prime} \frac{1}{\sum_m a_m (-1)^m \vert \mathbf{s}\cdot \mathbf{s}^\prime \vert^m}\delta^D(\mathbf{s}+\mathbf{s}^\prime)\\
K(x-x^\prime) &=& \int ds  e^{is(x-x^\prime)} \frac{1}{\sum_m a_m \vert \mathbf{s}\cdot \mathbf{s} \vert^m}
\end{eqnarray}

We see that, indeed, our Kernel will be translationally invariant, and when we put a prior over functions, what is essentially a Lagrangian in physics:

\begin{eqnarray}
\Omega[f] \sim L &=& \int dx f(x) \left(\sum_m a_m \frac{\partial^m}{\partial x^m}\right) f(x)
\end{eqnarray}

We find that the kernel is just the correlation function -- or, the propagator -- for the resulting stochastic process. One example is the massive free particle in higher dimensions -- or the free process in higher dimensional feature space -- for which we get the Yukawa potential:

\begin{eqnarray}
\Omega[f] \sim L &=& \int dx \frac{m^2}{2}f(x)^2+\frac{1}{2}\nabla^2 f(x) \\
K(x-x^\prime) &\sim & \frac{1}{\vert x - x^\prime \vert }e^{-\vert x-x^\prime \vert m}
\end{eqnarray}

So, the Kernel we use when interpolating some field, or, specifying the prior of our Gaussian process, can be viewed as the Green's function to our penalizing "regularizer". If we want smooth functions up to order $m=M$, we know precisely how to formulate the associated $K(x-x^\prime)$. Note that all differential regularizers, such as discussed here will lead to stationary kernels and thus stationary processes.