Tuesday, December 29, 2015

EM Algorithm and restricted K-means for the M-Traveling Salesman Problem

One very popular and very useful algorithm that its taken me a while to get around to is something called the Expectation-Maximization algorithm, or EM. A lot of people treat this thing as a black box, but because I wanted to implement my own handwritten constraints in a K-means and Gaussian mixture model, I had to sit down and read through things.

Turns out the EM algorithm is all about introducing auxiliary variables to your problem -- auxiliary data to be exact. Physicists solve things all the time by ``enriching'' and the integrating out, which is exactly what EM does. Let's say we have a sequence of observations $\vec{x}_i$ for $i=1...N$, and we would like to estimate their density. Problem is, they seem to form some awful distribution, something that would be impossible to model with a Gaussian distribution or a Poisson or any ``sane'' thing. We could resort to non-parametrics, such as Kernel Density estimation but another idea may be to say ``Hey, I bet the first data point came from a Gaussian distribution, but one that was centered over THERE. Let's call it Gaussian  A. And I bet the second was drawn from the same Gaussian A. But the third, which is really far away, was drawn from an entirely different Gaussian distribution, which we can call B \dots" etc.

What we're doing here is introducing ``membership'', which of course has a close relation to clustering and space segmentation, but more on that later. We now have our auxiliary variables in the problem.

\begin{eqnarray}
\vec{x}_i, \vec{z}_i
\end{eqnarray}

For each $i$, $\vec{x}$ is the observable, living in lets say $D$ dimensions, since each data point has D features, while $\vec{z}$ is the membership. For the Gaussian mixture case $\vec{z}$ lives in K dimensions, where K is the number of Gaussians -- or clusters -- we allow to build the space. Now it's pretty simple to write down the log Likelihood, but we have to think about what parameters we are conditioning on. We've got the mean and variance of EACH Gaussian, let's call them $\mathbf{\mu}_k, \mathbf{\Sigma}_k$ and we've also got the probability of membership $\vec{z}_i$, for every $i$, essentially its PDF. Let's call this $\vec{\phi}$, which note, has K components and is in general a multinomial for the whole data set, multinoulli for a single draw. (The attentive reader will note that I'm taking notation directly from Andrew Ng and Kevin Murphy, who both have great notes on this).

\begin{eqnarray}
\mathcal{L}(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &=& \log \left(P(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)\right)
\end{eqnarray}

Let's take this a step at a time. If we assume that every data point $\vec{x}_i$ is independent, then we can write
\begin{eqnarray}
\mathcal{L}(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &=& \sum_i \log \left(P(\vec{x}_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)\right)
\end{eqnarray}

Now, if we want to include our auxiliary variables, we have to include them explicitly inside the log but marginalize over them:

\begin{eqnarray}
\mathcal{L}(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &=& \sum_i \log \left( \sum_{z_i} P(\vec{x}_i, z_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)\right)
\end{eqnarray}

And now, we can use a very interesting trick. My first intuition at this point was to expand the joint inside the log like so

\begin{eqnarray}
\mathcal{L}(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &=& \sum_i \log \left( \sum_{z_i} P(\vec{x}_i \vert \mathbf{\mu}_k,\mathbf{\Sigma}_k)P(z_i \vert \phi) \right)
\end{eqnarray}

and keep working but apparently there's a more useful way to do things. If we divide and multiply by some UNKNOWN PDF in $z_i$, $Q(z_i)$ we can use Jensen's equality to write:

\begin{eqnarray}
\mathcal{L}(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &=& \sum_i \log \left( \sum_{z_i} Q(z_i) \frac{P(\vec{x}_i, z_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)}{Q(z_i)}\right)\\
& \ge & \sum_i \sum_{z_i} Q(z_i)  \log \left( \frac{P(\vec{x}_i, z_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)}{Q(z_i)}\right)
\end{eqnarray}

We can do this because log is a concave function, and for any concave function we have:

\begin{eqnarray}
E\left[f(x) \right] & \le & f(E\left[x \right])
\end{eqnarray}

Our expectation is over $z_i$ and so now we see that our log loss is strictly greater than this somewhat easier expression on the RHS. But, it kind of looks familiar... its the negative KL divergence between the two distributions in $z_i$! If we want maximize our lower bound and make it as small as possible, we need to minimize the KL, divergence, essentially setting:

\begin{eqnarray}
P(\vec{x}_i, z_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &= & \mathrm{const} Q(z_i) \ \ \forall \ z_i
\end{eqnarray}

So if we sum both sides in $z_i$ we get the marginalized distribution on the left and a constant on the right:

\begin{eqnarray}
\sum_{z_i}P(\vec{x}_i, z_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &= & \sum_{z_i} \mathrm{const} Q(z_i) \\
P(\vec{x}_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) &=& \mathrm{const}
\end{eqnarray}

i.e. the best candidate for $Q(z_i)$ is the conditional membership, based on the datapoint $x_i$ itself!

\begin{eqnarray}
Q(z_i) &=& \frac{P(\vec{x}_i, z_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)}{P(\vec{x}_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)}\\
&=& P( z_i \vert \vec{x}_i, \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)
\end{eqnarray}

So now, with this $Q$ in hand in we have a tight lower bound on the log likelihood, which we can then maximize in $\mathbf{\mu}_k, \mathbf{\Sigma}_k$:

\begin{eqnarray}
\mathcal{L}(X \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k)&\ge & \sum_i \sum_{z_i} P(z_i \vert \vec{x}_i, \phi, \mu_k ,\Sigma_k )  \log \left( P(\vec{x}_i \vert \phi, \mathbf{\mu}_k,\mathbf{\Sigma}_k) \right)
\end{eqnarray}

The EM algorithm simply consists of the following two steps:

(1) Given $\mu_k,\Sigma_k$ and the data $X$, calculate $P(z_i \vert \vec{x}_i, \phi, \mu_k ,\Sigma_k )  $. (E-step)
(2) Given the above, maximize w/r/t $\mu_k,\Sigma_k$. (M-Step)
(*) repeat


So, we are essentially calculating data point memberships -- E step -- and then optimizing the log likelihood, as we always would -- E step. The difference here is that we don't set hard memberships in our model. We let there be fractional or ``soft'' memberships in this likelihood expression.

In algorithms like K-means, which actually works on the same EM principle with some drastic assumptions, we have hard memberships, and identical, diagonal covariance matrices across all Gaussians -- or clusters -- K. These assumptions make things super fast and scalable but of the data has any spatial skewness or inhomogeneity, we could be in trouble. [Differing size of clusters is a problem two, but I haven't come to understand this yet]

The beautiful thing about EM, and I don't have time to get into it now, is that it's gauranteed to monotonically increase the log likelihood iteration by iteration, and this is simply due to the convexity properties we talked about earlier. Local maxima can be reached, but as long things are smartly seeded we can expect to find a good fit.

----------------------------------------------------------------------

The reason I was using all this stuff, was to segment deliveries in a Traveling salesman problem. It's actually called the M travelling salesman problem where multiple men try to visit many different locations with the least amount of effort -- however that is defined. There are some restrictions on how much the men can carry, perhaps how far they can go, and so this constitutes the restriction in my K-means/GMM models.

Oh, and by the way, those men might be Santa making multiple trips to deliver gifts at Christmas