Tuesday, November 17, 2015

Learning Theory and an XGboost Failure on Kaggle

After a somewhat embarrassing moment on Kaggle.com, where I blatantly overfit a training set $X$ with a boosted decision tree -- XGboost -- and then got a horrible, horrible, generalized result, I thought it'd be time to consider model complexity and learning theory, and in general the debate that has been bubbling beneath the Machine Learning and Statistics community.

Recently, there has been a lot of success using highly flexible models, such as boosted regressors, classifiers, decisions trees, random forests, etc. to model very complex data sets. (One should lump neural networks in here, too, since they perform highly non-trivial, non-linear transformation to inputs.) Leo Breiman wrote a very interesting paper in 2001 called "Statistical Modeling: The two Cultures" which highlights -- and insults -- the attachment of  old school statisticians to generative models, where the data observed is characterized/explained as being generated by some "process". For instance, we say that the obesity in the United States is generated by a linear superposition of many factors, say income, age, sex, etc.:

\begin{eqnarray}
\mathrm{obsesity} = f(\mathbf{x}) &=& \sum_n \theta_n x_n
\end{eqnarray}

Which is of course a very stupid assumption, since we know there are non-linear interactions between these features of a person, that create complex results. More flexible models can capture this, by multiplying the different features together in some polynomial. Let us for instance, include all quadratic -- second order -- terms in our ``generative process'':

\begin{eqnarray}
f(x) &=& \theta \cdot \mathbf{x} + \mathbf{x} \cdot A \cdot \mathbf{x}
\end{eqnarray}

Where $A$ is some feature mixing matrix. There are way more parameters in our model now, and some will prove to be important, while others will not. These days, the MOST old school statisticians -- it seems to me -- expand features as polynomials, like above, and then use tried and true methods to cut down on the model complexity, getting rid of combinations and/or linear features that are not deterministic of outcome. This is all well and good. One can use the precise framework of $p-$values and hypotheses tests, etc. to characterize fit, but Leo Breiman says that by using such methods -- and I don't pretend to have exhausted all such models he pointed a finger at, or properly explained them -- we really limit ourselves. Essentially, the gist is that if we get away from this generative framework and look more towards flexible algorithms, such as Random forests and boosted decision trees, we can have greater predictive power in the real world.

If Kaggle competitions are treated as valid evidence, this is certainly true, since virtually every winner of these competitions -- outside computer vision -- has used gradient boosted decision trees or Random forests explicitly for their solution or as part of a larger ensemble of models. The trouble is, how do we tune these models to not overfit? How do we interpret and regularize them? Constraining a linear or logistic regression to have small weights with a quadratic penalty term is essentially the same thing as imposing entropy, imposing some simplicity measure on the model ( which turns out to be a Gaussian prior); but for trees and forests, how does one characterize the degrees of freedom? This is the trouble I had over the weekend, trying to fit a very complex algorithm that I didn't understand to a dataset, and then getting burned for it. The nice thing about Logistic Regression is that those linear weights $\theta$ can be read as importance values: when taking the absolute value they tell you how much each feature -- which, hopefully has been properly scaled -- contributes to a decision, or pushing towards a decision in feature space $\mathbf{x}$. (I've talked about this before.) Trees are much harder, and although the choice of split in classification problems, based on the gini index or information gain are useful, they are only a piece of the story. The first split of a decision tree can be viewed as the most important feature of course, but one clever comment I read in an AI book is that pruning trees is much better than limiting their depth, as what often happens, for instance with an XOR classification problem, is you split on $x_1$, getting not so great results, but then the following split on $x_2$ is highly deterministic, implying that the operator $x_1 \ \mathrm{AND} \ x_2$ is a very important feature, which would have been missed had the tree not been able to ``grow''. Clever as this comment is when it comes to training and making practical decisions when using Decision trees, it also highlights the fact that feature importance in a Decision tree can be LATENT. Sometimes the logical conjunction of features or splits is what's important, and that will not show up so easily to the eye when looking at a decision tree.

So, just for fun and practice with the Hoeffding, Chernoff inequalities, and the Union bound, let's take a look at some elementary learning theory (I'm lifting this directly from Andrew Ng's notes at Stanford). Say we are in a binary classification problem, where our data is a sequence of features $\mathbf{x}$ and corresponding classes $y$:

\begin{eqnarray}
D=&=& \lbrace \mathbf{x}_n, y_n \rbrace_{n=1}^N
\end{eqnarray}

We would like to estimate our generalization error -- which is just a specific type of loss function that we're going to treat as a conditional random variable:

\begin{eqnarray}
\epsilon(h(x)) &= & 1_{h(x) \neq y}
\end{eqnarray}

Where $1$ is the indication function. The expectation value of the generalization error is characterized by a probability distribution, which we know nothing about:

\begin{eqnarray}
\epsilon(h(x)) &\sim & P(\epsilon(h(x)) \vert D)
\end{eqnarray}

But, we can estimate the mean of this distribution, given some samples in $D$ by using the Hoeffding bound. The greatest $\epsilon(h(x))$ can be is unity for a given $x$, and the lowest it can be is zero. This means we can write:

\begin{eqnarray}
P( \vert \epsilon(h(x)) - \sum_{n=1}^N \frac{\epsilon(h(x_n))}{N} \vert \ge \gamma ) & \le & 2 e^{-2N\gamma^2}
\end{eqnarray}

where the nasty term in the argument of our probability distribution is just the mean of our errors, what's called the Empirical risk:

\begin{eqnarray}
\hat{\epsilon}(h(x)) &=& \sum_{n=1}^N \frac{\epsilon(h(x_n))}{N}
\end{eqnarray}

Writing this a bit more simply, then, this is really just the standard problem of estimating the mean of a binomial distribution, for example -- but we haven't specificied the PDF, only that all of the samples $x_n, y_n$ are i.i.d!

\begin{equation}
P( \vert \epsilon(h(x)) -\hat{\epsilon}(h(x)) \vert \ge \gamma )  \le  2 e^{-2N\gamma^2}
\end{equation}

We can see that as $N \to \infty$ -- an infinite training set -- we are going to be spot on: our training error $\hat{\epsilon}$ will be precisely the same as our generalization error $\epsilon$. But, recall that when fitting models, $h(x)$ is actually parametrized by some $\theta$. So what we are really worried about is the probability that ANY of our models' train set error, $\hat{\epsilon}(h_\theta(x))$ deviate far from their generalization error. For such a thing we use the union bound:

\begin{eqnarray}
P(A \cup B \cup C \dots ) &\le & P(A)+P(B)+P(C)+\dots
\end{eqnarray}

Now, to make life simpler, let's say our model space is discrete, and we really only have a choice of $K$ $\theta$s. Then we've got:

\begin{eqnarray}
P( \cup_{k=1}^K \vert \epsilon(h_k(x))-\hat{\epsilon}(h_k(x))\vert \ge \gamma) &\le & 2K e^{-2N \gamma^2}
\end{eqnarray}

So, as we've seen before. Things aren't that much worse. Let's write the probability that our training error is more than $\gamma$ different from our generalization error as $\delta$, and then we can state:

\begin{eqnarray}
\gamma &\le & \sqrt{\frac{\log(2K/\delta)}{2N}} \\
N & \ge & \frac{1}{2\gamma^2}\log \left(\frac{2K}{\delta}\right)
\end{eqnarray}

What do these two statements mean? The first that our estimation of generalization error gets tighter as $N$ gets large. More specifically, it scales at worst like $\frac{1}{\sqrt{N}}$, just Monte Carlo Models. We also see that accuracy goes logarithmic with confidence, which is just a remnant of usual Hoeffding/Chernoff bound statements. In the second line, we see that the requisite number of Training examples, for some given confidence $\delta$ and accuracy on empirical risk $\gamma$, goes logarithmic with the degrees of freedom. A priori, this seams very nice, but let's develop things further.

When we're given a training set, we're not gauranteed to fit the optimal model, but we know our empirical risk estimate is at most $2\gamma$ away the OPTIMAL empirical risk. Let's show this. First, call our fitted model $h_{\hat{\theta}}$, and the optimal one $h_{\theta^*}$.  We have, from the above Hoeffding/Chernoff inequality, for a single fitted model:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \hat{\epsilon}(h_{\hat{\theta}})+\gamma
\end{eqnarray}

This is just comparing generalization risk to training risk. Now let's compare to the optimal, by noting that, if we fit to the wrong model,  $\hat{\epsilon}(h_{\hat{\theta}}) \le \hat{\epsilon}(h_{\theta^*})$ -- otherwise we would have chosen something else! Then we can write:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \hat{\epsilon}(h_{\theta^*})+\gamma
\end{eqnarray}

and, now converting the training risk to the generalization risk on the optimal model, we get:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) & \le & \epsilon(h_{\theta^*})+2\gamma
\end{eqnarray}

This is a very nice inequality, because what it's basically saying is that our empirical risk is less than the optimal fit, plus some accuracy term:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \left( \min_{h_\theta} \epsilon (h_{\theta}(X)) \right)+\sqrt{\frac{2\log \left( \frac{2K}{\delta} \right)}{N}}
\end{eqnarray}

With highly complex models, the first term on the right hand side will be quite small -- low bias -- since we can fit our data almost exactly. But conversely, in such a case $K$ becomes large and we have no good hold on how our model will extend into the real world -- high variance. How bad does this scaling of complexity go? A rough argument from Andrew Ng's notes is that for floating point numbers, for $D$ parameters, we have approximately:

\begin{eqnarray}
K &=& 2^{64D}
\end{eqnarray}

and so our inequality goes like:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \left( \min_{h_\theta} \epsilon (h_{\theta}(X)) \right)+\sqrt{\frac{2D\log \left( \frac{2^{65}}{\delta} \right)}{N}}
\end{eqnarray}

and so the requisite number of training examples -- for a given accuracy $\gamma$ and confidence $\delta $ -- goes linearly with parameter dimension.

But this is not entirely correct. One can use something called the vapnik-chervonenkis dimension, which I won't get into here, maybe in a later post, to characterize the complexity of a model. This is all a very active part of research, and it obviously is quite difficult to get a handle on how well an algorithmic solution will generalize, once trained on a training set.

For instance, what is $K$ for a gradient boosted decision tree? How does one characterize its vapnik-chervonenkis dimension? This question is very important if you want to fit such models to data, rather than being a blockhead and just using Randomized Grid Search for Hyper-Parameters in python.  (WHICH, is exactly what I did over the weekend...)

No comments:

Post a Comment