Saturday, July 11, 2015

Basic Logistic Regression and $\vec{\theta}$ as the gradient of the Decision Boundary

A few months ago I was asked about a classification problem in a phone interview, and, not knowing much about machine learning or basic regression at that time, I didn't realize what was really being asked. I fumbled about in the dark only to suggest basic use of histograms and correlations, when the answer was really quite simple: Logistic regression.

Linear regression takes the following form: you have some input vector $\vec{x}$, which lives in a ``feature space'' and you'd like to map that $\vec{x}$ onto a corresponding $y$, which has some scalar value. If you have a set, or a ``census'' of sorts, of such pairs $\lbrace \vec{x}_i, y_i \rbrace_{i=1}^N$, you'd like to learn from this data and be able derive an analytical relationship between $\vec{x}$ and $y$.

It turns out this problem is analytically solve-able by some nasty linear algebra -- I did it once years ago, following David Hogg and Dustin Lang's paper on Bayesian inference -- if you model the errors, or deviations from your model as being Gaussian. Then the cost function is a $\chi^2$ statistic, essentially, and if you maximize the likelihood function you get something called the normal equation:

\begin{eqnarray}
y_i &=& \vec{\theta} \cdot \vec{x}_i \\
- \log \left(\mathcal{L}(\vec{x},y \vert \vec{\theta})\right) &=& \sum_{i=1}^N
\frac{(\vec{\theta}\cdot \vec{x}_i-y_i)^2}{N}\\
\theta_j &=& \frac{X_{ji} y_i}{(X^TX)}
\end{eqnarray}

Where $X$ is something called the data matrix, which I won't take the time to define here. The point is that for linear regression we get a nice plane in hyper-space, that maps our inputs $\vec{x}$ onto scalar values $y$. And, logistic regression works the same way, but now $y$ can only take on two values, $0$ or $1$, and we wrap our linear model in a sigmoid function:

\begin{eqnarray}
y_i &=& \frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}_i}}
\end{eqnarray}

And what this model really means is the probability of our output $y$ being ``yes'' or ``no'', 0 or 1:

\begin{eqnarray}
P(y=1 \vert \vec{\theta}, \vec{x}) &=& \frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}_i}}
\end{eqnarray}
Now, when you maximize the likelihood, which can no longer be done analytically, by any method I know of, but must be done numerically, with something like stochastic gradient descent, you will find that the same hyperplane defined by the $\vec{\theta}$ in the linear model now describes the gradient of the decision boundary in $\vec{x}$ space. By the properties of the sigmoid function, we will classify $y$ with a ``yes'' or a 1, when $\vec{\theta}\cdot \vec{x}$ is greater than zero and a ``no'' or a 0, when $\vec{\theta}\cdot \vec{x}$ is less than zero. Let's look at when we are right on the ``cusp'' of a decision:

\begin{eqnarray}
 \frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}_i}}& \approx& \frac{1}{2}\\
e^{-\vec{\theta} \cdot \vec{x}_i} & \approx & 1
\end{eqnarray}

So we can taylor expand our hypothesis $P(y=1 \vert \vec{x},\vec{\theta})$ and get

\begin{eqnarray}
P(y=1 \vert \vec{\theta}, \vec{x}) &=&\frac{1}{2}+\vec{\theta} \cdot \vec{x}_i+O()^2
\end{eqnarray}

If we take the derivative of this function with respect to the vector $\vec{x}$, we see that the normal vector to our decision boundary in feature space is precisely $\vec{\theta}$. Mathematically, gradient presents the direction of fastest increase, and so relatively dominant components of our $\vec{\theta}$ vector correspond to relatively dominant features in classification.

\begin{eqnarray}
\frac{P(y=1 \vert \vec{\theta}, \vec{x})}{\partial \vec{x}} &=&\vec{\theta}+O()^2
\end{eqnarray}

So, regression coefficients, at least near the decision boundary, play the same intuitive role. If we have a large coefficient, we know that the associated parameter plays a prominent role in classification.

------------------------------------------------------------------------

Of course, adding higher order terms to the expression in our sigmoid function, say:

\begin{eqnarray}
 \frac{1}{1+e^{\sum_j \theta_{2j} x_j^ + \sum_j \theta_{2j+1} x_j^{2}}}
\end{eqnarray}

allows for a curved decision boundary in feature space $\vec{x}$ -- not just a hyperplane. 

No comments:

Post a Comment