Monday, July 27, 2015

Multi-Class Logistic Regression as Maximum (conditional) Entropy

Recently, I've been thinking about Logistic regression -- i.e. Machine Learning classification and the activation of neurons in neural networks -- and how it can be framed in terms of Statistical Mechanics. If you look at the sigmoid function, you'll notice something a bit suspect -- or at least I did: it looks peculiarly like a Boltzman distribution, or some properly normalized member of the exponential family:

\begin{eqnarray}
\mathrm{sigm}(x)&=& \frac{1}{1+e^{-x}} \\
\mathrm{sigm}(\beta x)&=& \frac{1}{1+e^{-\beta x}}
\end{eqnarray}

If we scale this function by $\beta$, this looks a lot like Boltzman. And, if we use the standard framework for logistic regression, where our exponential argument is the dot product between a vector of parameters $\vec{\theta}$ and a vector of input data, $\vec{x}$, we have:

\begin{eqnarray}
P(y=1 \vert \vec{x},\vec{\theta}) &=&  \frac{1}{1+e^{-\vec{\theta} \cdot \vec{x}}}
\end{eqnarray}

That is, the probability that our feature factor $\vec{x}$ belongs to class $y=1$ is given by the sigmoid function, with a ``linear regression'' as its argument. Could we derive this sigmoid function by different means? Where does it come from?

Recall that the Boltzman distribution can be derived from maximum entropy. Shannon proved that the entropy of a probability distribution is equal to the following strange quantity:

\begin{eqnarray}
S[p] &=& \sum_i -p(x_i) \log \left(p(x_i)\right)
\end{eqnarray}

or, in the continuous PDF case:

\begin{eqnarray}
S[p] &=& \int -p(x) \log \left( p(x) \right) dx
\end{eqnarray}

another way to write this is

\begin{eqnarray}
S &=& \langle -\log\left(p(x) \right) \rangle
\end{eqnarray}

Maximizing this entropy, subject only to normalization, involves setting the gradient of the following function to zero:

\begin{eqnarray}
 G&=& S[p] + \lambda \left( \sum_i p(x_i) -1 \right)\\
 \frac{\partial G}{\partial p(x_i)} &=& 0 \ \forall i
\end{eqnarray}

If we have a discrete distribution over $N$ total points $\lbrace x_i \rbrace_{i=1}^N$, we get the \textbf{uniform distribution}.

\begin{eqnarray}
p(x_i) &=& \frac{1}{N}
\end{eqnarray}

But, once we start specifying certain properties of this distribution, like, say it's moments, we will change things substantially. Constraining on the mean gives us the exponential distribution:

\begin{eqnarray}
G[p]&=& S[p] + \lambda \left( \int p(x) -1 \right)+ \beta \left( \int p(x) x - \tau \right)\\
 \frac{\delta G}{\delta p(x_i)} &=& 0 \ \forall i \\
 \Rightarrow p(x_i) &=& \tau e^{-\tau x_i }
\end{eqnarray}

Where now instead of a gradient we've taken a \textbf{functional derivative}, of the functional $G$. Our result is the exponential distribution, which is the \textbf{maximally entropic} probability distribution with its first moment constrained. (You may not be surprised to learn that the Gaussian is the maximally entropic PDF with its first and second moment constrained!)

So maximizing entropy subject to certain constraints places heavy, deterministic restrictions on the form of our PDF's. We are essentially designing things to be as ``uninformative'' or least biased as possible. As Jaynes once said, `our thinking Robot is non-ideological: it takes in only the information it is given and nothing more'.

When we do classification problems, we are really maximizing the entropy of a probability distribution on class $y$, not on the feature vector $\vec{x}$, and this conditional entropy has a specific form:

\begin{eqnarray}
S[p(y \vert \vec{x})] &=& -\sum_i p(\vec{x}) p(y \vert \vec{x})\log \left( p(y \vert \vec{x}) \right)
\end{eqnarray}

Ok. So now what are our constraints? We would like to define some function of the feature vector $\vec{x}$ that is useful, and require our probability density on class $y$, $p(y \vert \vec{x}$, to exhibit the \textbf{same ensemble averages}:

\begin{eqnarray}
\sum_{\vec{x},y} p(\vec{x}) p(y \vert \vec{x}) f_y(\vec{x}) &=& \langle f_y(\vec{x}) \rangle
\end{eqnarray}

Where on the left we are summing over only $\vec{x}$ that are in the training set. (This is the role of $p(\vec{x})$. I have seen this equation called a ``balance'' equation, but really it is just a constraint on the ensemble average -- over class $y$ and feature vector $\vec{x}$ -- of some function $f_y(\vec{x})$: just like how we constrain energy with the Boltzman distribution:

\begin{eqnarray}
\sum_{\vec{x}} P(\vec{x}) \mathcal{H}(\mathbf{x}) &=& \langle E \rangle
\end{eqnarray}

Putting things into our big expression for entropy, we have:

\begin{eqnarray*}
G[p] &=& -\sum_i p(\vec{x}) p(y \vert \vec{x})\log \left( p(y \vert \vec{x}) \right) + \lambda \left( \sum_y p(y \vert \vec{x}) -1 \right)+ \beta \left(\sum_{\vec{x},y} p(\vec{x}) p(y \vert \vec{x}) f_y(\vec{x}) - \langle f_y(\vec{x}) \rangle \right)
\end{eqnarray*}

So $\lambda$ is the lagrange multiplier that asserts proper normalization on the PDF over class, not feature vector $\vec{x}$, and $\beta$ is the lagrange multiplier that asserts proper expectation value of the function $f_y(\mathbf{x})$ over both class and the training set. We could extend this and have \textbf{multiple} feature functions to be constrained on, with multiple lagrange multpliers by just making them both vector valued:

\begin{eqnarray*}
G[p] &=& -\sum_i p(\vec{x}) p(y \vert \vec{x})\log \left( p(y \vert \vec{x}) \right) + \lambda \left( \sum_y p(y \vert \vec{x}) -1 \right)+ \vec{\beta} \cdot \left(\sum_{\vec{x},y} p(\vec{x}) p(y \vert \vec{x}) \vec{f}_y(\vec{x}) - \langle \vec{f}_y(\vec{x}) \rangle \right)
\end{eqnarray*}

Setting the gradient -- or in the continuous case, functional derivative -- of this expression to zero gives:

\begin{eqnarray}
P(y \vert \vec{x}) &=& Z^{-1} e^{\vec{\beta} \cdot \vec{f_y}(\vec{x})}\\
Z &=& \sum_c e^{\vec{\beta} \cdot \vec{f_y}(\vec{x})}
\end{eqnarray}

This expression already has multi-class classification built in, but let's see what it looks like for simpler logistic regression, when $y=0,1$. What is a good choice for the feature function $\vec{f_y}(\vec{x})$ to get our sigmoid function back? Since logistic regression looks a lot like a ``sigmoid-wrapped'' linear regression in textbooks, it might be a good idea to define:

\begin{eqnarray}
\vec{f_y}(\vec{x}) &=& \vec{x} \ \ \mathrm{if} \ \ y^\prime=y
\end{eqnarray}

Then we have:

\begin{eqnarray}
P(y=1 \vert \vec{x}) &=& \frac{ e^{\vec{\beta_1} \cdot \vec{x}}}{e^{\vec{\beta_0} \cdot \vec{x}}+e^{\vec{\beta_1} \cdot \vec{x}}}
\end{eqnarray}

Simplifying things we get:

\begin{eqnarray}
P(y=1 \vert \vec{x}) &=& \frac{ 1}{1+e^{-(\vec{\beta_1}-\vec{\beta_0}) \cdot \vec{x}}}
\end{eqnarray}

The expression above is precisely logistic regression, with the parameters $\beta_0$ the ``anti'' linear-coefficients and $beta_1$ the non-anti linear coefficients. We combine them both into the normal $\vec{\theta}$ and write

\begin{eqnarray}
P(y=1 \vert \vec{x}) &=& \frac{ 1}{1+e^{-\vec{\theta} \cdot \vec{x}}}
\end{eqnarray}

So what does this mean? It means that $\beta_1$ is constraining the ensemble average of the $i^\mathrm{th}$ component of the feature vector $\vec{x}$, given that class is $y=1$;  And conversely for $\beta_0$, we are constraining the $i^\mathrm{th}$ component of the feature vector when the class is $y=0$. These feature functions can be called marginal counts, or marginal probabilities on $\vec{x}$.

Now in the last post I talked about how the vector $\theta$ gives the direction of \textbf{fastest increase} for the logistic function, or the direction of ``fastest decision'' -- both for yes and no -- from the decision boundary. We can frame this in an even more physical way using Lagrange multipliers, because what $\vec{\beta}$ really represents in feature space is a \textbf{constraint force}, as seen in Lagrangian mechanics. Just like when we are minimizing the action -- a functional -- subject to some constraint:

\begin{eqnarray}
S &=& \int \left[ \mathcal{L(q,\dot{q})}+ \lambda \left(q_x^2+q_y^2 - R^2\right)  \right] dt
\end{eqnarray}

$\lambda$ above gives the ``tension'' force $\lambda=T$ for a massive particle, constrained to move on a circle of radius $R$. We are using the same tactic with $\beta$ to ``push'' our PDF into the proper shape. Stronger components -- i.e. larger scalar values within $\vec{\beta}$ -- correspond to stronger entropic forces on the PDF, and stronger correlation.

It is easy to extend the expressions above to multiple classes $y=0,1,2 \dots$, and important to note that the activation of most all neurons in neural networks use logistic regression: it's all maximum entropy learning. E.T. Jaynes would be very proud. 

No comments:

Post a Comment