Tuesday, November 24, 2015

A Brief Note on the Bias-Variance Decompositon


When one trains a model that is highly flexible, highly ``articulate'' on a training set, you often get a great training score -- be it AUC, accuracy, MSE, etc. But such models -- as I've talked about before -- often have trouble generalizing to ``test'' sets, or, the real world. One of the easiest ways to see this is by a simple FOIL operation on the following quantity:

\begin{eqnarray}
Y&=& f(X)+\epsilon
\end{eqnarray}

Here $X$ is a random variable -- the independent inputs -- $f(X)$ is the generative process that creates our object of interest $Y$ -- be it cardinal or $\in \mathcal{R}$ -- and $\epsilon$ is drawn from some noise distribution, say a Gaussian process with zero mean, $\mathcal{N}(0,K(X,X^\prime))$. Let $g(X)$ be our picked model for $Y$. (Normally people write $\hat{f}(X)=g(X)$ but I chose $g$ to avoid confusion.) If we take a look at the mean squared error, we get

\begin{eqnarray}
MSE &=& \langle \vert f(X)+\epsilon-g(X) \vert^2 \rangle\\
&=& \langle f(X)^2 \rangle + \langle \epsilon^2 \rangle +\langle g(X)^2 \rangle - 2 \langle \epsilon \rangle \langle f(X) \rangle - 2 \langle \epsilon \rangle \langle g(X) \rangle - 2 \langle f(X) \rangle \langle g(X) \rangle
\end{eqnarray}

Where I've assumed the noise and our $f,g$ are uncorrelated. We see the terms that are linear in $\epsilon$ fall away and we can write:


\begin{eqnarray}
MSE &=& \langle f(X)^2 \rangle + \langle \epsilon^2 \rangle +\langle g(X)^2 \rangle - 2 \langle f(X) \rangle \langle g(X) \rangle
\end{eqnarray}

Adding and subtracting the mean of our model squared $\langle g(X) \rangle^2$ we get:

\begin{eqnarray}
MSE &=& \langle \left(f(X)-\langle g(X)\rangle \right)^2 \rangle +\mathrm{Var}\left(g(X) \right)+\mathrm{Var}\left(\epsilon \right)
\end{eqnarray}

So now, term by term, we see we have: the squared difference between our data and our average model -- the Bias, which quantifies how much our model is ``off'' (a quantity that will be quite low for complex models and high for simple ones); the variance of the model itself, which quantifies how much our $g(X)$ changes given different training inputs (a quantity that will be high for complex models and low for simple ones) ; and the variance of the noise variable $\epsilon$, which is an ineluctable contribution to our error.

Recall from last post we had a very similar bias-variance decomposition:

\begin{eqnarray}
\epsilon(h_{\hat{\theta}}) &\le & \left( \min_{h_\theta} \epsilon (h_{\theta}(X)) \right)+\sqrt{\frac{2\log \left( \frac{2K}{\delta} \right)}{N}}
\end{eqnarray}

The bias is the first term and the variance the second, which goes like square root logarithm of the degrees of freedom $K$. This decomposition illustrates the balancing act we have to play, as model builders, between simplicity and goodness of fit. Refer to these when explaining why your highly un-regularized gradient boosted decision tree -- or boosted anything -- did a poor job of generalizing! (This doesn't always happen, of course, just a word of caution.)


No comments:

Post a Comment