Monday, July 27, 2015

Multi-Class Logistic Regression as Maximum (conditional) Entropy

Recently, I've been thinking about Logistic regression -- i.e. Machine Learning classification and the activation of neurons in neural networks -- and how it can be framed in terms of Statistical Mechanics. If you look at the sigmoid function, you'll notice something a bit suspect -- or at least I did: it looks peculiarly like a Boltzman distribution, or some properly normalized member of the exponential family:

\begin{eqnarray}
\mathrm{sigm}(x)&=& \frac{1}{1+e^{-x}} \\
\mathrm{sigm}(\beta x)&=& \frac{1}{1+e^{-\beta x}}
\end{eqnarray}

If we scale this function by $\beta$, this looks a lot like Boltzman. And, if we use the standard framework for logistic regression, where our exponential argument is the dot product between a vector of parameters $\vec{\theta}$ and a vector of input data, $\vec{x}$, we have:

\begin{eqnarray}
P(y=1 \vert \vec{x},\vec{\theta}) &=&  \frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}}}
\end{eqnarray}

That is, the probability that our feature factor $\vec{x}$ belongs to class $y=1$ is given by the sigmoid function, with a ``linear regression'' as its argument. Could we derive this sigmoid function by different means? Where does it come from?

Recall that the Boltzman distribution can be derived from maximum entropy. Shannon proved that the entropy of a probability distribution is equal to the following strange quantity:

\begin{eqnarray}
S[p] &=& \sum_i -p(x_i) \log \left(p(x_i)\right)
\end{eqnarray}

or, in the continuous PDF case:

\begin{eqnarray}
S[p] &=& \int -p(x) \log \left( p(x) \right) dx
\end{eqnarray}

another way to write this is

\begin{eqnarray}
S &=& \langle -\log\left(p(x) \right) \rangle
\end{eqnarray}

Maximizing this entropy, subject only to normalization, involves setting the gradient of the following function to zero:

\begin{eqnarray}
 G&=& S[p] + \lambda \left( \sum_i p(x_i) -1 \right)\\
 \frac{\partial G}{\partial p(x_i)} &=& 0 \ \forall i
\end{eqnarray}

If we have a discrete distribution over $N$ total points $\lbrace x_i \rbrace_{i=1}^N$, we get the \textbf{uniform distribution}.

\begin{eqnarray}
p(x_i) &=& \frac{1}{N}
\end{eqnarray}

But, once we start specifying certain properties of this distribution, like, say it's moments, we will change things substantially. Constraining on the mean gives us the exponential distribution:

\begin{eqnarray}
G[p]&=& S[p] + \lambda \left( \int p(x) -1 \right)+ \beta \left( \int p(x) x - \tau \right)\\
 \frac{\delta G}{\delta p(x_i)} &=& 0 \ \forall i \\
 \Rightarrow p(x_i) &=& \tau e^{-\tau x_i }
\end{eqnarray}

Where now instead of a gradient we've taken a \textbf{functional derivative}, of the functional $G$. Our result is the exponential distribution, which is the \textbf{maximally entropic} probability distribution with its first moment constrained. (You may not be surprised to learn that the Gaussian is the maximally entropic PDF with its first and second moment constrained!)

So maximizing entropy subject to certain constraints places heavy, deterministic restrictions on the form of our PDF's. We are essentially designing things to be as ``uninformative'' or least biased as possible. As Jaynes once said, `our thinking Robot is non-ideological: it takes in only the information it is given and nothing more'.

When we do classification problems, we are really maximizing the entropy of a probability distribution on class $y$, not on the feature vector $\vec{x}$, and this conditional entropy has a specific form:

\begin{eqnarray}
S[p(y \vert \vec{x})] &=& -\sum_i p(\vec{x}) p(y \vert \vec{x})\log \left( p(y \vert \vec{x}) \right)
\end{eqnarray}

Ok. So now what are our constraints? We would like to define some function of the feature vector $\vec{x}$ that is useful, and require our probability density on class $y$, $p(y \vert \vec{x}$, to exhibit the \textbf{same ensemble averages}:

\begin{eqnarray}
\sum_{\vec{x},y} p(\vec{x}) p(y \vert \vec{x}) f_y(\vec{x}) &=& \langle f_y(\vec{x}) \rangle
\end{eqnarray}

Where on the left we are summing over only $\vec{x}$ that are in the training set. (This is the role of $p(\vec{x})$. I have seen this equation called a ``balance'' equation, but really it is just a constraint on the ensemble average -- over class $y$ and feature vector $\vec{x}$ -- of some function $f_y(\vec{x})$: just like how we constrain energy with the Boltzman distribution:

\begin{eqnarray}
\sum_{\vec{x}} P(\vec{x}) \mathcal{H}(\mathbf{x}) &=& \langle E \rangle
\end{eqnarray}

Putting things into our big expression for entropy, we have:

\begin{eqnarray*}
G[p] &=& -\sum_i p(\vec{x}) p(y \vert \vec{x})\log \left( p(y \vert \vec{x}) \right) + \lambda \left( \sum_y p(y \vert \vec{x}) -1 \right)+ \beta \left(\sum_{\vec{x},y} p(\vec{x}) p(y \vert \vec{x}) f_y(\vec{x}) - \langle f_y(\vec{x}) \rangle \right)
\end{eqnarray*}

So $\lambda$ is the lagrange multiplier that asserts proper normalization on the PDF over class, not feature vector $\vec{x}$, and $\beta$ is the lagrange multiplier that asserts proper expectation value of the function $f_y(\mathbf{x})$ over both class and the training set. We could extend this and have \textbf{multiple} feature functions to be constrained on, with multiple lagrange multpliers by just making them both vector valued:

\begin{eqnarray*}
G[p] &=& -\sum_i p(\vec{x}) p(y \vert \vec{x})\log \left( p(y \vert \vec{x}) \right) + \lambda \left( \sum_y p(y \vert \vec{x}) -1 \right)+ \vec{\beta} \cdot \left(\sum_{\vec{x},y} p(\vec{x}) p(y \vert \vec{x}) \vec{f}_y(\vec{x}) - \langle \vec{f}_y(\vec{x}) \rangle \right)
\end{eqnarray*}

Setting the gradient -- or in the continuous case, functional derivative -- of this expression to zero gives:

\begin{eqnarray}
P(y \vert \vec{x}) &=& Z^{-1} e^{\vec{\beta} \cdot \vec{f_y}(\vec{x})}\\
Z &=& \sum_c e^{\vec{\beta} \cdot \vec{f_y}(\vec{x})}
\end{eqnarray}

This expression already has multi-class classification built in, but let's see what it looks like for simpler logistic regression, when $y=0,1$. What is a good choice for the feature function $\vec{f_y}(\vec{x})$ to get our sigmoid function back? Since logistic regression looks a lot like a ``sigmoid-wrapped'' linear regression in textbooks, it might be a good idea to define:

\begin{eqnarray}
\vec{f_y}(\vec{x}) &=& \vec{x} \ \ \mathrm{if} \ \ y^\prime=y
\end{eqnarray}

Then we have:

\begin{eqnarray}
P(y=1 \vert \vec{x}) &=& \frac{ e^{\vec{\beta_1} \cdot \vec{x}}}{e^{\vec{\beta_0} \cdot \vec{x}}+e^{\vec{\beta_1} \cdot \vec{x}}}
\end{eqnarray}

Simplifying things we get:

\begin{eqnarray}
P(y=1 \vert \vec{x}) &=& \frac{ 1}{1+e^{-(\vec{\beta_1}-\vec{\beta_0}) \cdot \vec{x}}}
\end{eqnarray}

The expression above is precisely logistic regression, with the parameters $\beta_0$ the ``anti'' linear-coefficients and $beta_1$ the non-anti linear coefficients. We combine them both into the normal $\vec{\theta}$ and write

\begin{eqnarray}
P(y=1 \vert \vec{x}) &=& \frac{ 1}{1+e^{-\vec{\theta} \cdot \vec{x}}}
\end{eqnarray}

So what does this mean? It means that $\beta_1$ is constraining the ensemble average of the $i^\mathrm{th}$ component of the feature vector $\vec{x}$, given that class is $y=1$;  And conversely for $\beta_0$, we are constraining the $i^\mathrm{th}$ component of the feature vector when the class is $y=0$. These feature functions can be called marginal counts, or marginal probabilities on $\vec{x}$.

Now in the last post I talked about how the vector $\theta$ gives the direction of \textbf{fastest increase} for the logistic function, or the direction of ``fastest decision'' -- both for yes and no -- from the decision boundary. We can frame this in an even more physical way using Lagrange multipliers, because what $\vec{\beta}$ really represents in feature space is a \textbf{constraint force}, as seen in Lagrangian mechanics. Just like when we are minimizing the action -- a functional -- subject to some constraint:

\begin{eqnarray}
S &=& \int \left[ \mathcal{L(q,\dot{q})}+ \lambda \left(q_x^2+q_y^2 - R^2\right)  \right] dt
\end{eqnarray}

$\lambda$ above gives the ``tension'' force $\lambda=T$ for a massive particle, constrained to move on a circle of radius $R$. We are using the same tactic with $\beta$ to ``push'' our PDF into the proper shape. Stronger components -- i.e. larger scalar values within $\vec{\beta}$ -- correspond to stronger entropic forces on the PDF, and stronger correlation.

It is easy to extend the expressions above to multiple classes $y=0,1,2 \dots$, and important to note that the activation of most all neurons in neural networks use logistic regression: it's all maximum entropy learning. E.T. Jaynes would be very proud. 

Saturday, July 11, 2015

Basic Logistic Regression and $\vec{\theta}$ as the gradient of the Decision Boundary

A few months ago I was asked about a classification problem in a phone interview, and, not knowing much about machine learning or basic regression at that time, I didn't realize what was really being asked. I fumbled about in the dark only to suggest basic use of histograms and correlations, when the answer was really quite simple: Logistic regression.

Linear regression takes the following form: you have some input vector $\vec{x}$, which lives in a ``feature space'' and you'd like to map that $\vec{x}$ onto a corresponding $y$, which has some scalar value. If you have a set, or a ``census'' of sorts, of such pairs $\lbrace \vec{x}_i, y_i \rbrace_{i=1}^N$, you'd like to learn from this data and be able derive an analytical relationship between $\vec{x}$ and $y$.

It turns out this problem is analytically solve-able by some nasty linear algebra -- I did it once years ago, following David Hogg and Dustin Lang's paper on Bayesian inference -- if you model the errors, or deviations from your model as being Gaussian. Then the cost function is a $\chi^2$ statistic, essentially, and if you maximize the likelihood function you get something called the normal equation:

\begin{eqnarray}
y_i &=& \vec{\theta} \cdot \vec{x}_i \\
- \log \left(\mathcal{L}(\vec{x},y \vert \vec{\theta})\right) &=& \sum_{i=1}^N
\frac{(\vec{\theta}\cdot \vec{x}_i-y_i)^2}{N}\\
\theta_j &=& \frac{X_{ji} y_i}{(X^TX)}
\end{eqnarray}

Where $X$ is something called the data matrix, which I won't take the time to define here. The point is that for linear regression we get a nice plane in hyper-space, that maps our inputs $\vec{x}$ onto scalar values $y$. And, logistic regression works the same way, but now $y$ can only take on two values, $0$ or $1$, and we wrap our linear model in a sigmoid function:

\begin{eqnarray}
y_i &=& \frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}_i}}
\end{eqnarray}

And what this model really means is the probability of our output $y$ being ``yes'' or ``no'', 0 or 1:

\begin{eqnarray}
P(y=1 \vert \vec{\theta}, \vec{x}) &=& \frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}_i}}
\end{eqnarray}
Now, when you maximize the likelihood, which can no longer be done analytically, by any method I know of, but must be done numerically, with something like stochastic gradient descent, you will find that the same hyperplane defined by the $\vec{\theta}$ in the linear model now describes the gradient of the decision boundary in $\vec{x}$ space. By the properties of the sigmoid function, we will classify $y$ with a ``yes'' or a 1, when $\vec{\theta}\cdot \vec{x}$ is greater than zero and a ``no'' or a 0, when $\vec{\theta}\cdot \vec{x}$ is less than zero. Let's look at when we are right on the ``cusp'' of a decision:

\begin{eqnarray}
 \frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}_i}}& \approx& \frac{1}{2}\\
e^{-\vec{\theta} \cdot \vec{x}_i} & \approx & 1
\end{eqnarray}

So we can taylor expand our hypothesis $P(y=1 \vert \vec{x},\vec{\theta})$ and get

\begin{eqnarray}
P(y=1 \vert \vec{\theta}, \vec{x}) &=&\frac{1}{2}+\vec{\theta} \cdot \vec{x}_i+O()^2
\end{eqnarray}

If we take the derivative of this function with respect to the vector $\vec{x}$, we see that the normal vector to our decision boundary in feature space is precisely $\vec{\theta}$. Mathematically, gradient presents the direction of fastest increase, and so relatively dominant components of our $\vec{\theta}$ vector correspond to relatively dominant features in classification.

\begin{eqnarray}
\frac{P(y=1 \vert \vec{\theta}, \vec{x})}{\partial \vec{x}} &=&\vec{\theta}+O()^2
\end{eqnarray}

So, regression coefficients, at least near the decision boundary, play the same intuitive role. If we have a large coefficient, we know that the associated parameter plays a prominent role in classification.

------------------------------------------------------------------------

Of course, adding higher order terms to the expression in our sigmoid function, say:

\begin{eqnarray}
 \frac{1}{1+e^{\sum_j \theta_{2j} x_j^ + \sum_j \theta_{2j+1} x_j^{2}}}
\end{eqnarray}

allows for a curved decision boundary in feature space $\vec{x}$ -- not just a hyperplane.