Tuesday, February 7, 2017

Co-linearity (Part 2)

Last post was concerned with co-linearity in regression problems, and how one chooses to deal with it. The Normal equation was mentioned before:

\begin{eqnarray}
\mathbf{\beta}_d &=& \left(X_{nd}X_{nl} \right)^{-1} X_{ml} Y_m
\end{eqnarray}

and, we can also introduce the ``hat'' matrix:

\begin{eqnarray}
\hat{Y}_n &=& X_{nd} \mathbf{\beta}_d \\
&=& X_{nd} \left(X_{n^\prime d}X_{n^\prime l} \right)^{-1} X_{ml} Y_m \\
\mathbf{H}_{nm} &=& X_{nd} \left(X_{n^\prime d}X_{n^\prime l} \right)^{-1} X_{ml} \\
\hat{Y}_n &=& \mathbf{H}_{nm}Y_m
\end{eqnarray}

which, as you can see, puts the ``hat'' on our initial response observations, $y_m$. This smoothing matrix depends on an inversion, and as mentioned before most solvers will fail if the data matrix has too many features and not enough data points. But the way around this is through Bayesian methods. We'll start by noting that the likelihood from last post was the Likelihood of the data, given the model:

\begin{eqnarray}
P(X,Y \vert \mathbf{\beta})
\end{eqnarray}

But, what if we'd like to write -- for some, more intuitively and accurately -- the likelihood of the model, given the data? This can be written as:

\begin{eqnarray}
P(\beta \vert X,Y) &=& \frac{P(X,Y \vert \mathbf{\beta}) P(\mathbf{\beta})}{P(X,Y)}
\end{eqnarray}

The second term on the numerator is something called a prior, and encodes our a priori beliefs on the values of $\mathbf{\beta}$ in our model. If we specify a Normal Prior, with some variance $s$, we get:

\begin{eqnarray}
P(\mathbf{\beta}) &=& \frac{1}{\sqrt{2\pi s^2}} e^{-\beta^2/2s^2}
\end{eqnarray}

The term $P(\mathbf{\beta} \vert X,Y)$ is called the posterior, and represents our ``new'' beliefs on the model after accounting for the data that we have seen. Now, taking the log of the Posterior instead of the log of the likelihood -- and ignoring the term in the denominator since it contains no dependence on $\beta$, we get:

\begin{eqnarray}
\mathcal{L}(\mathbf{\beta} \vert X,Y) & = & -\frac{N}{2} \log(2\pi \sigma^2) - \frac{(\beta_d X_{nd}-Y_n)^2}{ 2\sigma^2 } - \frac{1}{2} \log(2 \pi s^2) - \frac{\beta_d \beta_d}{2s^2} + \mathcal{O}(X,Y)
\end{eqnarray}

Taking the gradient with respect to $\beta$ now, we get an extra term in our equations:

\begin{eqnarray}
\frac{\partial \mathcal{L}(\mathbf{\beta} \vert X,Y)}{\partial \beta_d} &=& \frac{(\beta_l X_{nl}-Y_n)}{ \sigma^2 } X_{nd} + \frac{\beta_d}{s^2} \\
\frac{X_{nd} Y_n}{\sigma^2} &=& \beta_l \left( \frac{X_{nl}X_{nd}}{\sigma^2} + \frac{\delta^K_{ld}}{s^2} \right) \\
X_{nd} Y_n&=& \beta_l \left( X_{nl}X_{nd} + \delta^K_{ld} \frac{\sigma^2}{s^2} \right) \\
\beta_l &=& \left( X_{nl}X_{nd} +  (\sigma/s)^2 \right)^{-1} X_{nd} Y_n
\end{eqnarray}

We see that we've just got an extra term in the inverted matrix -- namely the ratio of sample variance to prior variance -- which adds to the diagonal of the feature $l,d$ covariance estimate. What this does, practically speaking, is make the inverted matrix much more likely to be non-singular, and therefore resilient to have more features that datapoints, $D > N$.

As $s$ get's smaller, what we essentially do is put isotropic, downward pressure on the $\beta$ coefficients, pushing them down towards zero. This $L2$ norm or regularization on our model has lots of nice properties, and depending upon the strength of our prior, we can use it to protect against very ``ill-defined problems'', where $D >> N$.

The standard name for the method is called ridge regression, and people continue to be unaware of its benefits, such as protecting against co-linearity, and getting a sense of what regression coefficients do over varying strengths of regularization -- called a regularization ``path''.  

No comments:

Post a Comment