Friday, February 10, 2017

Co-linearity (Part 3)

As you can imagine, with increasing strength of the prior $s$, mentioned above, comes reduction in the magnitude of the regression coefficients $\beta$. Instead of using an $L2$ norm in the prior, one can also use an L1 norm, and then the log Likelihood becomes:

\begin{eqnarray}
- \mathcal{L}(\beta \vert X,Y) &=& \frac{(X_{ni} \beta_i - Y_n)^2}{2\sigma^2} +  \left(\sum_i \vert \beta_i \vert \right)
\end{eqnarray}

Which can be solved as a Quadratic programming problem as long as one puts an inequality constraint on the sum of the absolute coefficients of $\beta$: $T(\beta) = \sum_i \vert \beta_i \vert < t$. This type of normalization of the regression coefficients is called the LASSO, and by the nature of its $\beta$ penalization, chooses solutions that are sparse in regressors -- i.e. kills off coefficients and features that seem not to matter. With a decreasing value of $t$ comes, fewer and few features, as you can imagine, and just like our parameter $s$ above, we have explicit control over the ``filtering'' pressure of our regression coefficients $\beta$.

L2 and L1 regularization of regression coefficients lead to slightly different solutions. L2 tends to spread coefficient magnitude across clusters of variables that are all correlated with the target, while L1 aggressively prunes coefficient magnitude to the ``winners'' of the feature set. Making a plot of $\beta_i(t), \beta_i(s)$ for both L1 and L2 regularization, reveals this.

The lasso can be very useful when trying to isolate ``what matters'' in a regression problem, and just like ridge regression, helps control linear dependence and colinearity within the data matrix, but one can also use simple clustering techniques to choose the ``best'' set of features. For example, take the normalized correlation matrix:

\begin{eqnarray}
\tilde{X}_{ni} &=& \frac{X_{ni}-\mu_i}{\sigma_i} \\
\rho_{ij} &=& \mathrm{corr}(x_i,x_j) = \frac{1}{N} \tilde{X}_{ni}\tilde{X}_{nj}
\end{eqnarray}

The upper diagonal portion of $\rho_{ij}$ represents a graph, where the nodes are the features $i$ and the edges are the matrix entries -- each between 0 and 1. We can ``cluster'' our set of features very easily, by simply thresholding $\rho_{ij} > \epsilon$ and then picking out connected components from the matrix. The connected components -- depending upon how many of them there are, each represent ``feature groups'', from which one can choose the most highly correlated feature with the target:

\begin{eqnarray}
\lbrace C_n \rbrace &=& \mathrm{conn}(\rho_{ij} > \epsilon) \\
x_\mathrm{c} &=& \mathrm{max}_{i \in c} \left(\mathrm{corr}(x_i, y)\right)  \ \ \forall c \in \lbrace C_n \rbrace
\end{eqnarray}

Obviously, this filtered set of features -- and their multiplicity -- will be a function of $\epsilon$. As $\epsilon \to 1$ we will have all features come out, $x_c = x_i$, and as $\epsilon \to 0$ we will have the single, most highly-correlated feature: $x_c = \max_i \mathrm{corr}(x_i,y)$.

The whole point of doing this, of course, is to find a set of features $x_c$ that are statistically de-coupled from one - another, and it really reduces to a supervised down-sampling of the initial data.

Because regularization paths are so popular, especially from a diagnostic point of view, it's worth mentioning that one of my favorite algorithms for sequentially adding in features to a regression problem is LARS -- or least Angle Regression. The basic idea comes from boosting, but I'll get to it in the next post. 

Tuesday, February 7, 2017

Co-linearity (Part 2)

Last post was concerned with co-linearity in regression problems, and how one chooses to deal with it. The Normal equation was mentioned before:

\begin{eqnarray}
\mathbf{\beta}_d &=& \left(X_{nd}X_{nl} \right)^{-1} X_{ml} Y_m
\end{eqnarray}

and, we can also introduce the ``hat'' matrix:

\begin{eqnarray}
\hat{Y}_n &=& X_{nd} \mathbf{\beta}_d \\
&=& X_{nd} \left(X_{n^\prime d}X_{n^\prime l} \right)^{-1} X_{ml} Y_m \\
\mathbf{H}_{nm} &=& X_{nd} \left(X_{n^\prime d}X_{n^\prime l} \right)^{-1} X_{ml} \\
\hat{Y}_n &=& \mathbf{H}_{nm}Y_m
\end{eqnarray}

which, as you can see, puts the ``hat'' on our initial response observations, $y_m$. This smoothing matrix depends on an inversion, and as mentioned before most solvers will fail if the data matrix has too many features and not enough data points. But the way around this is through Bayesian methods. We'll start by noting that the likelihood from last post was the Likelihood of the data, given the model:

\begin{eqnarray}
P(X,Y \vert \mathbf{\beta})
\end{eqnarray}

But, what if we'd like to write -- for some, more intuitively and accurately -- the likelihood of the model, given the data? This can be written as:

\begin{eqnarray}
P(\beta \vert X,Y) &=& \frac{P(X,Y \vert \mathbf{\beta}) P(\mathbf{\beta})}{P(X,Y)}
\end{eqnarray}

The second term on the numerator is something called a prior, and encodes our a priori beliefs on the values of $\mathbf{\beta}$ in our model. If we specify a Normal Prior, with some variance $s$, we get:

\begin{eqnarray}
P(\mathbf{\beta}) &=& \frac{1}{\sqrt{2\pi s^2}} e^{-\beta^2/2s^2}
\end{eqnarray}

The term $P(\mathbf{\beta} \vert X,Y)$ is called the posterior, and represents our ``new'' beliefs on the model after accounting for the data that we have seen. Now, taking the log of the Posterior instead of the log of the likelihood -- and ignoring the term in the denominator since it contains no dependence on $\beta$, we get:

\begin{eqnarray}
\mathcal{L}(\mathbf{\beta} \vert X,Y) & = & -\frac{N}{2} \log(2\pi \sigma^2) - \frac{(\beta_d X_{nd}-Y_n)^2}{ 2\sigma^2 } - \frac{1}{2} \log(2 \pi s^2) - \frac{\beta_d \beta_d}{2s^2} + \mathcal{O}(X,Y)
\end{eqnarray}

Taking the gradient with respect to $\beta$ now, we get an extra term in our equations:

\begin{eqnarray}
\frac{\partial \mathcal{L}(\mathbf{\beta} \vert X,Y)}{\partial \beta_d} &=& \frac{(\beta_l X_{nl}-Y_n)}{ \sigma^2 } X_{nd} + \frac{\beta_d}{s^2} \\
\frac{X_{nd} Y_n}{\sigma^2} &=& \beta_l \left( \frac{X_{nl}X_{nd}}{\sigma^2} + \frac{\delta^K_{ld}}{s^2} \right) \\
X_{nd} Y_n&=& \beta_l \left( X_{nl}X_{nd} + \delta^K_{ld} \frac{\sigma^2}{s^2} \right) \\
\beta_l &=& \left( X_{nl}X_{nd} +  (\sigma/s)^2 \right)^{-1} X_{nd} Y_n
\end{eqnarray}

We see that we've just got an extra term in the inverted matrix -- namely the ratio of sample variance to prior variance -- which adds to the diagonal of the feature $l,d$ covariance estimate. What this does, practically speaking, is make the inverted matrix much more likely to be non-singular, and therefore resilient to have more features that datapoints, $D > N$.

As $s$ get's smaller, what we essentially do is put isotropic, downward pressure on the $\beta$ coefficients, pushing them down towards zero. This $L2$ norm or regularization on our model has lots of nice properties, and depending upon the strength of our prior, we can use it to protect against very ``ill-defined problems'', where $D >> N$.

The standard name for the method is called ridge regression, and people continue to be unaware of its benefits, such as protecting against co-linearity, and getting a sense of what regression coefficients do over varying strengths of regularization -- called a regularization ``path''.  

Monday, February 6, 2017

Colinearity (Part 1)

I've heard a lot comments -- some captions, some cowardly -- about "co-linearity" recently, from both colleagues at work and friends using statistics in their jobs. And, well, GUESS WHAT? Co-linearity is not as scary as it used to be! Many people don't realize that there are a variety of ways to avoid or control "co-linearity" in data when performing basic regressions, and I  want to take some time to outline them.

Let's begin by saying we've got regression problem on our hands: one where we have $N$ examples of some feature vectors $\mathbf{x}$, of dimension $D$, contained in an $N \times D$ data matrix:

\begin{eqnarray}
X &=& \left( \begin{array}{c}
\leftarrow \vec{x_1} \rightarrow \\
\leftarrow \vec{x_2} \rightarrow \\
\vdots \\
\leftarrow \vec{x_N} \rightarrow
\end{array}\right)
\end{eqnarray}

and, $N$ real-valued response variable examples, $Y$:

\begin{eqnarray}
Y &=& \left( \begin{array}{c}
y_1\\
y_2 \\
\vdots \\
y_N
\end{array}\right)
\end{eqnarray}

Our goal is to write a linear model of some sort:

\begin{eqnarray}
\hat{y_n} &\approx & \mathbf{\beta} \cdot \mathbf{x_n} + \epsilon
\end{eqnarray}

Where we assume the errors are drawn from some probability distribution. In standard regression problems Normal, so we'll keep it that way:

\begin{eqnarray}
\epsilon & \sim & \mathcal{N}(0,\sigma^2)
\end{eqnarray}

Now, as we've covered before in this blog, the likelihood of the data -- or, the co-occurence of features and labels we see in the world $(X,Y)$ --  given some model, specified by $\mathbf{beta}$, is equal to:

\begin{eqnarray}
P(X,Y \vert \mathbf{\beta}) &=& \prod_{n=1}^N \frac{1}{\sqrt{2\pi \sigma^2}}\mathrm{exp}\left(-(y_n - \beta \cdot x_n)^2/ 2\sigma^2 \right)
\end{eqnarray}

Where, by taking a product above, we assume each data instance $n = 1, \dots N$ is independent. Taking the log of this likelihood we get:

\begin{eqnarray}
\mathcal{L}(X,Y \vert \beta) &=& -\frac{N}{2} \log(2\pi \sigma^2) - \sum_{n=1}^N \frac{(y_n - \beta \cdot x_n)^2}{ 2\sigma^2 }
\end{eqnarray}

This is a convex function of $\beta$, meaning that if we set the derivative equal to zero, we are guaranteed to find a global maximum / minimum (very good), and so our MLE or maximum likelihood estimate of the model $\beta$ becomes, if we write things now in matrix notation:

\begin{eqnarray}
\mathcal{L}(X,Y \vert \beta) &=& -\frac{N}{2} \log(2\pi \sigma^2) - \frac{(\beta_d X_{nd}-Y_n)^2}{ 2\sigma^2 }
\end{eqnarray}

Taking the derivative with respect to $\beta_d$ (a gradient) we get:
\begin{eqnarray}
\frac{\partial \mathcal{L}(X,Y \vert \beta)}{\partial \beta_d}\vert_{\beta = \beta^\prime} &=&  (\beta_l X_{nl}-Y_n) X_{nd}  = 0 \\
\beta_l X_{nl}X_{nd} &=& X_{nd} Y_n\\
\beta_l  &=& (X_{nl}X_{nd})^{-1} X_{nd} Y_n
\end{eqnarray}

This is called the Normal Equation, can actually be well understood that noting:

\begin{eqnarray}
X_{nl}X_{nd} & \approx & N \mathrm{Cov}(x_l, x_d) \\
X_{nl}Y_{n} & \approx & N \mathrm{Cov}(x_l, y)
\end{eqnarray}

which, words, is the covariance betwen the $l^{\mathrm{th}}$ $d^{\mathrm{th}}$ components of the feature vector and the covariance between the $l^{\mathrm{th}}$ feature and the target, $y$.

There's a very important thing to notice here, straight off the bat, which is that the Normal Equation -- which is the standard way of solving regression, or OLS (ordinary least squares) problems -- accounts for interactions between the features: we can see it in the "discounting" factor of the inverted matrix, above. Highly correlated features will dampen each other's effect, which is very, very cool. Regression coefficients $\beta_d$ represent the ``net'' effect of the $d$ feature, not the ``gross'' effect, as one would get by doing a single, univariate regression of $x_d$ on $y$. This is important to keep in mind, but we're not out of the -- or even into the -- co-linearity woods yet.


----------------------------------------

People get upset or concerned about colinearity when they want:


  • Interpretable Models
  • Stable Regression Coefficients $\mathbf{\beta}$ in the face of changing data.
  • A bone to pick with a model or feature set that they don't trust or understand.


Now, most of the "spookiness" of co-linearity comes from linear algebra, and the complete absence of Bayesian Statistics in Traditional circles of past Statisticians, where putting priors on regression coefficients is equivalent to regularizing, and therefore controlling and containing, co-linearity.

Take for example a data matrix where we have $N=15$ data points in our set, but $D=45$ features. The old-time statisticians might tell you that the problem is ill-specified or ill-defined, because if we create a regression model with 45 degrees of freedom, there simply aren't enough data points to "figure out" what's going on. And that's true, but it really comes from the fact that when inverting a matrix -- as we're doing above -- with linearly dependent columns, we could run into a lot of numerical trouble.

This has to be the situation in the case I just mentioned above, as it is impossible for the square matrix $X_{nl}X_{nd}$ -- which is $D \times D$ -- to be properly invertible. And this is simply because the column space of $X$ is at most of dimension $N=15$. I won't get into the dirty details, but suffice it to say that when you tried to regress in such a situation, solving using the old methods, you were hosed.

And so how did you to wiggle out of it, in the "old days"?


  1. Sub-select the features, such that $D < N$, and move on with the analysis. 
  2. Use PCA to "represent" the matrix $X_{nl}X_{nd}$ in terms of its most prominent eigenvectors, and once again make sure that $D<N$.
  3. Add some other method I'm not currently remembering here.
But, in modern times this is a brutal way to do things, and especially with genetic data, where the feature space has far more dimensions than the dataset has examples (often by a factor of 10, 100 or even a 1000). We can do better than this.