Friday, January 29, 2016

Entity Resolution through Markov Random Fields and Logistic Regression

So, along with label propagation, I've also been thinking about entity resolution, which is basically the same thing if you frame the problem correctly.

Let's say you have a set of all labeled data:

\begin{eqnarray}
\lbrace \mathbf{x}_i, \mathbf{y}_i \rbrace_{i=1}^N
\end{eqnarray}

Where $y_i$ can be a class -- zero or one -- as we were talking about earlier, or a unique ID. What we would like to do is compare pair-wise our datapoints and see if the $y_i,y_j$'s are equal. This means that every pair is a probability instance, we 'd like to assign them a ``two-peas-in-a-pod'' probability. One way of doing this is with our similarity matrix, mentioned before:

\begin{eqnarray}
P(y_i = y_j \vert x_i,x_j) &=& \mathbf{W}_{ij} = e^{-\alpha_n g_n(x_i,x_j)}
\end{eqnarray}

Where in the exponent we have a linear combination of metrics. They can be Euclidean, Minkowski, cosine, what have you -- each with a weight $\alpha_n$. (This is particularly useful with string comparison, as some metrics are more informative of others). We can also use simple logistic regression:

\begin{eqnarray}
P(y_i = y_j \vert x_i,x_j) &=& \sigma\left(-\mathbf{W}_{ij}\right) = \frac{1}{1+e^{\alpha_n g_n(x_i,x_j)}}
\end{eqnarray}

(it turns out that this probability is flipped the wrong way if we kept the negative sign in the exponent, which can be seen by a simple plot). If we want to learn the optimal $\alpha_n$'s we can use gradient descent on some specified objective function. The graph based formulation is motivated by ``Hidden Markov Random Field'' which penalizes different labels between close points -- as specified by $g$.

\begin{eqnarray}
E &=& \sum_{i,j\neq i} W_{ij} (y_i-y_j)^2 = \sum_{i,j \neq i}  e^{-\alpha_n g_n(x_i,x_j)}(y_i-y_j)^2\\
&=& \sum_{i,j \neq i} P(y_i = y_j ) (y_i-y_j)^2\\
E&=& \mathcal{E}\left((y_i-y_j)^2 \right)
\end{eqnarray}

we see that this energy $E$ is just the expectation value of the pairwise label distance, a certain type of empirical Risk! ($E$ can also be treated as the log loss or negative log probability of the configuration $\lbrace y \rbrace$).

Similarly, for logistic regression we just have our log loss. Both objective functions are differentiable with respect to the metric weights $\alpha_n$, so if we want to LEARN what comparators between $x_i,x_j$ are important, we simply use gradient descent on our labeled examples! To extend labels/matches to unlabeled points, we use the pairwise probabilities specified above.

If one implemented this label matching, it would be wise to not compare every data instance -- which would be an $N^2$ blow up -- but only some restricted subset of "likely" pairs. Whether we are comparing for unique ID's in entity resolution or shared classes in label propagation, this algorithm/framework is be the same!

Label Propagation in SSL through a Markov Random Walk

Szummer and Jaakola (2002) used the exact same frame work written in the last post label-propagation-and-semi-supervised.html to propagate labels outwards from a training set via some distance measure. But, they used a Markov random walk, with transition matrix:

\begin{eqnarray}
p_{ij} &=& \frac{W_{ij}}{\sum_{ik} W_{ik}} \\
&=& \mathbf{D}^{-1}\mathbf{W}
\end{eqnarray}

Notice, this is exactly the same as our transition matrix from before. The best way to view $p_{ij}$ is a conditional probability that a particle lives at position $i$, given that it was at position $j$ the moment before:

\begin{eqnarray}
p_{ij} &=& P(x_{t+1}=\mathbf{x}_i \vert x_{t}=\mathbf{x}_j)
\end{eqnarray}

Now, this is only a single step. By markov property we can extend to any number of steps in time $t$:

\begin{eqnarray}
p^{(2)}_{ij} &=& P(x_{t+2}=\mathbf{x}_i \vert x_{t}=\mathbf{x}_j)\\
&=& \sum_{x_{t+1}} P(x_{t+2}=\mathbf{x}_i \vert x_{t+1}=\mathbf{x}_j)P(x_{t+1}=\mathbf{x}_i \vert x_{t}=\mathbf{x}_j)
\end{eqnarray}

or, in matrix form:

\begin{eqnarray}
p_{ij}^{(2)} &=& \sum_k p_{ik} p_{kj} = \mathbf{p}^2 \\
p_{ij}^{(t)} &=& \mathbf{p}^t
\end{eqnarray}

this type of framework allows for traversal of particles through our ``graph'', consisting of the labeled and unlabeled datapoints. It is precisely the same as the Helmholtz algorithm given before, but instead of soft labels $y(x)$ being propagated we have representative ``hard'' particles.

Label Propagation and Semi-Supervised Learning: Gaussian Random Field Method

So, recently I've been reading up on label propagation in semi-supervised learning, which is when you have a great deal of data, but most of it is unlabeled. To put some notation on things, lets say way have a set $L$:

\begin{eqnarray}
L &:& \lbrace \mathbf{x}, y \rbrace_{n=1}^{L}
\end{eqnarray}

which is a set of pairs of input vectors $x$ and output labels $y$, be they scalar or categorical. And then we have a huge unlabeled set:

\begin{eqnarray}
U &:& \lbrace \mathbf{x}, \_\_  \rbrace_{n=1}^{U}
\end{eqnarray}

which we would like to infer on. Normally, this use case is motivated when the unlabeled set is much, much larger, $\vert L \vert << \vert U \vert$. If we are talking about classification, one way to view this problem is through clustering. If we assume that close vectors $\mathbf{x}$, under some metric, have close labels $y$, we that might motivate a loss function of the form:

\begin{eqnarray}
E(\lbrace y \rbrace) &=& \sum_{i,j \ne i} W_{ij} (y_i - y_j)^2
\end{eqnarray}

Where, we're summing over all pairs of data points $i,j$, and weighting their difference in label with the matrix $W_{ij}$. For sanity's sake, $W_{ij}$ should be large when $x_i,x_j$ are close. So $W_{ij}$ goes like one over distance between $i,j$. For example:

\begin{eqnarray}
E(\lbrace y \rbrace) &=& \sum_{i,j \ne i} W_{ij} (y_i - y_j)^2\\
W_{ij} &=& e^{-\vert x_i - x_j \vert^2/2\sigma^2}
\end{eqnarray}

This weighting matrix is simply a function of the euclidean metric, and actually reminds one of an RBF kernel or covariance function... Is there a connection here?

Absolutely. \\

If we frame our clustering/labeling problem as trying to minimize this loss function, or energy $E$, it means we can actually frame the likelihood of the labels with a boltzman distribution:

\begin{eqnarray}
P(\lbrace y \rbrace) &=& \frac{1}{Z}e^{-\sum_{i,j \ne i} W_{ij} (y_i - y_j)^2}
\end{eqnarray}

(Where $Z$ is the partition function, summing over all configurations of labels). This is extremely interesting, because if you do a little matrix algebra on our energy, we you find that one can re-write the loss as:

\begin{eqnarray}
E = \sum_{i,j\ne i} W_{ij} (y_i - y_j)^2 &=& \sum_{i,j\ne i} y_i (D_{ij}-W_{ij}) y_j \\
&=& \frac{1}{2} y_i L_{ij} y_j \\
D_{ii} &=& \sum_{j^\prime} W_{ij} \\
L_{ij} &=& \mathbf{D}-\mathbf{W}
\end{eqnarray}

The matrix $L_{ij}$, above is actually a close cousin of the laplacian operator $\nabla^2$, but we have embedded things in a high-dimensional space because of exponentiation. Notice that our likelihood on the configuration of labels now looks exactly like a Gaussian random field:

\begin{eqnarray}
P(\lbrace y \rbrace) &=& \frac{1}{Z}e^{-y_i L_{ij} y_j/2}
\end{eqnarray}

such that $\langle y_i \rangle =0$ and $\langle y_i y_j \rangle_c = L^{-1}_{ij}$. This discrete pdf on labels is precisely the same as if we had made everything continuous from the get-go:

\begin{eqnarray}
E[y(x)] &=& \frac{1}{2}\int dx dx^\prime y(x) y(x^\prime) K(x,x^\prime) \\
K(x,x^\prime) &=& e^{-\vert x-x^\prime \vert^2/2\sigma^2} \\
P(y(x)) &=& \frac{1}{Z}e^{-\frac{1}{2}\int dx dx^\prime y(x) y(x^\prime) K(x,x^\prime)}
\end{eqnarray}

which is a Gaussian random field on the labels, $y(x)$, imposing an RBF correlation function between points $x$. When integrate the lagrangian in by parts we would get the continuous equivalent of $L_{ij}$, which is essentially $\nabla^2$ in some new space.

So why do we care about all of this? Well, it turns out that the algorithms people use to propagate labels work exactly like the Helmholtz equation. For instance, one of the easiest things you can do given labeled examples $L$, is to propagate or ``flow'' the $y$'s to unlabeled points by the following procedure:

\begin{eqnarray}
y_{t+1} &=& \mathbf{D}^{-1} \mathbf{W} y_t
\end{eqnarray}

which, is the same as the helmholtz equation:

\begin{eqnarray}
\left(\frac{\partial}{\partial t}+\nabla^2 \right) y_t &=& 0 \\
y_{t+1}-y_t &=& -\nabla^2 y_t \\
y_{t+1} &=& \left(1-\nabla^2 \right) y_t
\end{eqnarray}

and now note, if we replace $\nabla^2$ with $1-D^{-1/2}\mathbf{W} D^{-1/2}$, we get

\begin{eqnarray}
y_{t+1} &=& \mathbf{D}^{-1}\mathbf{W}  y_t
\end{eqnarray}

This is PRECISELY the update scheme -- in discrete form -- of Helmholtz dynamics, or heat diffusion. Notice that when $\frac{\partial}{\partial t}y_t =0$, we're going to have a Harmonic field, which is fancy word for saying the laplacian of the labels is zero. (this implies some other nice things, like all labels are the average of labels around them, the maximum of the labels y will always exist on the boundary, etc.) Zhu, Ghahramani and Lafferty explain this algorithm best when they say the labels are undergoing diffusion, with dirichlet -- fixed label constraints -- on the labeled points $\mathbf{x}_l$.

Although, we are doing things not in euclidean space but somewhere else, do to our choice of metric. Because for instance, we could have chosen $W_{ij}$ to be whatever we liked, as long as it goes inverse with distance. Let $g(x_i,x_j)$ be some metric, then we have more generally:

\begin{eqnarray}
W_{ij} &=& f\left[g(x_i,x_j) \right]
\end{eqnarray}

and so $g$ defines the space in which we're taking derivatives. Things don't have to be euclidean! 

Sunday, January 3, 2016

Facility Location as Adaptive K-means

In linear programming, a typical problem is the following: we have $N$ facilities, which we can choose to open or not open, in order to serve $M$ customers, who are strewn throughout the world. We would like to minimize the operational cost of serving our customers from these facilities, which can be summarized as:

\begin{eqnarray}
\mathrm{startup} \ \mathrm{costs} &=& \sum_f s_f y_f\\
y_f &=& 0,1
\end{eqnarray}

Where $y_f$ is a binary variable -- zero or one -- based on whether we choose to open facility $f$. $s_f$ is obviously the associated cost for opening that specific facility. We also have transit costs from serving our customers, which might be of the very simple form:

\begin{eqnarray}
\mathrm{transit} \ \mathrm{costs} &=& \sum_c \sum_f X_{cf} D(\mathbf{x}_c, \mathbf{x}_f)\\
X_{cf} &=& 0,1
\end{eqnarray}

Where $X_{cf}$ is a binary matrix that denotes whether customer $c$ was assigned to facility $f$. $D(x_1,x_2)$, is our distance metric, which could be Euclidean or "Manhattan" -- where we only take steps to the left or right, up or down, no diagonal lines between destinations -- or maybe just a query in google maps.

Adding these to costs together we get a loss function of sorts:

\begin{eqnarray}
J &=&  \sum_c \sum_f X_{cf} D(\mathbf{x}_c, \mathbf{x}_f)+ \sum_f s_f y_f
\end{eqnarray}

which we would like to minimize. There are some simple constraints on our binary variables that we can articulate mathematically. Say each facility has an associated production capacity $c_f$ and each customer has an associated demand $d_c$. Then we have to make sure that the assignments are not overworking our facilities:

\begin{eqnarray}
\sum_c d_c X_{cf} & \le & c_f y_f \ \forall \ f
\end{eqnarray}

Notice I've multiplied by $y_f$ on the RHS. Another constraint we need to have is that every customer is served by exactly one facility -- no redundancy:

\begin{eqnarray}
\sum_f X_{cf} &=& 1\ \forall \ c
\end{eqnarray}

We can also require more explicitly that facilities who are turned off do not get any assignments:

\begin{eqnarray}
X_{cf} & \le & y_f \ \ \ \forall \ c,f
\end{eqnarray}

That's a ton of constraints, but each of them are linear and so we have well defined linear program. Given $N$ facility locations and $M$ customer locations, with the associated capacities, demands, and start up costs, we can find the global optimum of this thing with $N(M+1)$ decision variables, $X$ and $y$.

Turns out this problem is very similar to K means. Because what we are essentially doing is breaking down our customers into "facility clusters". What if we promoted all customers to facilities themselves? (Or at least gave them the option). Then the distance function $D(\mathbf{x}_1,\mathbf{x}_2)$ is simply the square root of the $L_2$ norm associated with a mixture of Gaussians with a diagonal covariance matrix. Let $\mathbf{x}_2$ be the facility, or centroid of the $k^\mathrm{th}$ cluster, then we have:

\begin{eqnarray}
\mathrm{Let} \ \ \mathbf{x}_2 &=& \mu_k \\
\mathbf{\Sigma} &=& \mathbf{1} \\
\mathrm{Then} \ \ D(\mathbf{x}_1,\mathbf{x}_1) &=& \sqrt{\left(\mathbf{x}_1-\mathbf{\vec{\mu}}_k\right) \mathbf{\Sigma}^{-1}\left(\mathbf{x}_1-\mathbf{\vec{\mu}}_k\right)}
\end{eqnarray}

For a mixture of Gaussians density Estimation, our PDF is:

\begin{eqnarray}
P(\mathbf{x}) &=& \sum_{k=1}^K \mathcal{N}\left(\mu_k,\mathbf{\Sigma} \right)
\end{eqnarray}

and the log likelihood for our sequence of customers/facilities is:

\begin{eqnarray}
-\log \mathcal{L}(X \vert \lbrace \mu \rbrace_{k=1}^K )&=& \sum_k \sum_{ n \in c_k} \frac{\left(\mathbf{x}_n-\mathbf{\mu}_k \right)^2}{2\sigma^2}+K\sqrt{2\pi \sigma^2}
\end{eqnarray}

I've kept the $\sigma^2$ terms for clarity. Notice that the second term in this negative log likelihood, or loss function is the normalization factor: it penalizes a high level of total clusters $K$, preventing overfitting. This loss function -- apart from the quadratic nature of the summand -- is exactly like the objective in our facility location problem. The startup cost for every "facility" in this example would be

\begin{eqnarray}
s_f &=& \sqrt{2\pi \sigma^2} \ \ \forall f
\end{eqnarray}

And the variance of our gaussians we would set to unity: $\sigma \to 1$. Let's rewrite the negative log likelihood, but in decision variable language:

\begin{eqnarray}
J^\prime &=&  \frac{1}{2}\sum_i \sum_j X_{ij} D(\mathbf{x}_i, \mathbf{x}_j)^2+ \sum_j s_j y_j \\
s_i &=& \sqrt{2\pi}\\
\end{eqnarray}

We have not yet specified the capacity of our clusters or the "demand" of each data point. A simple choice would be to set some upper limit on the size of each cluster, and give every data point the same weight or demand:

\begin{eqnarray}
d_i &=& 1 \ \ \forall \ i \\
c_j &=& c_\mathrm{max} \ \ \forall \ j
\end{eqnarray}

This is still a linear program, just with a slightly different list of coefficients on the first term! To solve such a system we would need $N(N+1)$ decision variables, which could be prohibitive with large systems, but thanks to interior point solvers and very clever routines for integer programming we can actually run this "adaptive" K-means algorithm in polynomial time!