Friday, January 29, 2016

Label Propagation and Semi-Supervised Learning: Gaussian Random Field Method

So, recently I've been reading up on label propagation in semi-supervised learning, which is when you have a great deal of data, but most of it is unlabeled. To put some notation on things, lets say way have a set $L$:

\begin{eqnarray}
L &:& \lbrace \mathbf{x}, y \rbrace_{n=1}^{L}
\end{eqnarray}

which is a set of pairs of input vectors $x$ and output labels $y$, be they scalar or categorical. And then we have a huge unlabeled set:

\begin{eqnarray}
U &:& \lbrace \mathbf{x}, \_\_  \rbrace_{n=1}^{U}
\end{eqnarray}

which we would like to infer on. Normally, this use case is motivated when the unlabeled set is much, much larger, $\vert L \vert << \vert U \vert$. If we are talking about classification, one way to view this problem is through clustering. If we assume that close vectors $\mathbf{x}$, under some metric, have close labels $y$, we that might motivate a loss function of the form:

\begin{eqnarray}
E(\lbrace y \rbrace) &=& \sum_{i,j \ne i} W_{ij} (y_i - y_j)^2
\end{eqnarray}

Where, we're summing over all pairs of data points $i,j$, and weighting their difference in label with the matrix $W_{ij}$. For sanity's sake, $W_{ij}$ should be large when $x_i,x_j$ are close. So $W_{ij}$ goes like one over distance between $i,j$. For example:

\begin{eqnarray}
E(\lbrace y \rbrace) &=& \sum_{i,j \ne i} W_{ij} (y_i - y_j)^2\\
W_{ij} &=& e^{-\vert x_i - x_j \vert^2/2\sigma^2}
\end{eqnarray}

This weighting matrix is simply a function of the euclidean metric, and actually reminds one of an RBF kernel or covariance function... Is there a connection here?

Absolutely. \\

If we frame our clustering/labeling problem as trying to minimize this loss function, or energy $E$, it means we can actually frame the likelihood of the labels with a boltzman distribution:

\begin{eqnarray}
P(\lbrace y \rbrace) &=& \frac{1}{Z}e^{-\sum_{i,j \ne i} W_{ij} (y_i - y_j)^2}
\end{eqnarray}

(Where $Z$ is the partition function, summing over all configurations of labels). This is extremely interesting, because if you do a little matrix algebra on our energy, we you find that one can re-write the loss as:

\begin{eqnarray}
E = \sum_{i,j\ne i} W_{ij} (y_i - y_j)^2 &=& \sum_{i,j\ne i} y_i (D_{ij}-W_{ij}) y_j \\
&=& \frac{1}{2} y_i L_{ij} y_j \\
D_{ii} &=& \sum_{j^\prime} W_{ij} \\
L_{ij} &=& \mathbf{D}-\mathbf{W}
\end{eqnarray}

The matrix $L_{ij}$, above is actually a close cousin of the laplacian operator $\nabla^2$, but we have embedded things in a high-dimensional space because of exponentiation. Notice that our likelihood on the configuration of labels now looks exactly like a Gaussian random field:

\begin{eqnarray}
P(\lbrace y \rbrace) &=& \frac{1}{Z}e^{-y_i L_{ij} y_j/2}
\end{eqnarray}

such that $\langle y_i \rangle =0$ and $\langle y_i y_j \rangle_c = L^{-1}_{ij}$. This discrete pdf on labels is precisely the same as if we had made everything continuous from the get-go:

\begin{eqnarray}
E[y(x)] &=& \frac{1}{2}\int dx dx^\prime y(x) y(x^\prime) K(x,x^\prime) \\
K(x,x^\prime) &=& e^{-\vert x-x^\prime \vert^2/2\sigma^2} \\
P(y(x)) &=& \frac{1}{Z}e^{-\frac{1}{2}\int dx dx^\prime y(x) y(x^\prime) K(x,x^\prime)}
\end{eqnarray}

which is a Gaussian random field on the labels, $y(x)$, imposing an RBF correlation function between points $x$. When integrate the lagrangian in by parts we would get the continuous equivalent of $L_{ij}$, which is essentially $\nabla^2$ in some new space.

So why do we care about all of this? Well, it turns out that the algorithms people use to propagate labels work exactly like the Helmholtz equation. For instance, one of the easiest things you can do given labeled examples $L$, is to propagate or ``flow'' the $y$'s to unlabeled points by the following procedure:

\begin{eqnarray}
y_{t+1} &=& \mathbf{D}^{-1} \mathbf{W} y_t
\end{eqnarray}

which, is the same as the helmholtz equation:

\begin{eqnarray}
\left(\frac{\partial}{\partial t}+\nabla^2 \right) y_t &=& 0 \\
y_{t+1}-y_t &=& -\nabla^2 y_t \\
y_{t+1} &=& \left(1-\nabla^2 \right) y_t
\end{eqnarray}

and now note, if we replace $\nabla^2$ with $1-D^{-1/2}\mathbf{W} D^{-1/2}$, we get

\begin{eqnarray}
y_{t+1} &=& \mathbf{D}^{-1}\mathbf{W}  y_t
\end{eqnarray}

This is PRECISELY the update scheme -- in discrete form -- of Helmholtz dynamics, or heat diffusion. Notice that when $\frac{\partial}{\partial t}y_t =0$, we're going to have a Harmonic field, which is fancy word for saying the laplacian of the labels is zero. (this implies some other nice things, like all labels are the average of labels around them, the maximum of the labels y will always exist on the boundary, etc.) Zhu, Ghahramani and Lafferty explain this algorithm best when they say the labels are undergoing diffusion, with dirichlet -- fixed label constraints -- on the labeled points $\mathbf{x}_l$.

Although, we are doing things not in euclidean space but somewhere else, do to our choice of metric. Because for instance, we could have chosen $W_{ij}$ to be whatever we liked, as long as it goes inverse with distance. Let $g(x_i,x_j)$ be some metric, then we have more generally:

\begin{eqnarray}
W_{ij} &=& f\left[g(x_i,x_j) \right]
\end{eqnarray}

and so $g$ defines the space in which we're taking derivatives. Things don't have to be euclidean! 

No comments:

Post a Comment