Friday, January 29, 2016

Entity Resolution through Markov Random Fields and Logistic Regression

So, along with label propagation, I've also been thinking about entity resolution, which is basically the same thing if you frame the problem correctly.

Let's say you have a set of all labeled data:

\begin{eqnarray}
\lbrace \mathbf{x}_i, \mathbf{y}_i \rbrace_{i=1}^N
\end{eqnarray}

Where $y_i$ can be a class -- zero or one -- as we were talking about earlier, or a unique ID. What we would like to do is compare pair-wise our datapoints and see if the $y_i,y_j$'s are equal. This means that every pair is a probability instance, we 'd like to assign them a ``two-peas-in-a-pod'' probability. One way of doing this is with our similarity matrix, mentioned before:

\begin{eqnarray}
P(y_i = y_j \vert x_i,x_j) &=& \mathbf{W}_{ij} = e^{-\alpha_n g_n(x_i,x_j)}
\end{eqnarray}

Where in the exponent we have a linear combination of metrics. They can be Euclidean, Minkowski, cosine, what have you -- each with a weight $\alpha_n$. (This is particularly useful with string comparison, as some metrics are more informative of others). We can also use simple logistic regression:

\begin{eqnarray}
P(y_i = y_j \vert x_i,x_j) &=& \sigma\left(-\mathbf{W}_{ij}\right) = \frac{1}{1+e^{\alpha_n g_n(x_i,x_j)}}
\end{eqnarray}

(it turns out that this probability is flipped the wrong way if we kept the negative sign in the exponent, which can be seen by a simple plot). If we want to learn the optimal $\alpha_n$'s we can use gradient descent on some specified objective function. The graph based formulation is motivated by ``Hidden Markov Random Field'' which penalizes different labels between close points -- as specified by $g$.

\begin{eqnarray}
E &=& \sum_{i,j\neq i} W_{ij} (y_i-y_j)^2 = \sum_{i,j \neq i}  e^{-\alpha_n g_n(x_i,x_j)}(y_i-y_j)^2\\
&=& \sum_{i,j \neq i} P(y_i = y_j ) (y_i-y_j)^2\\
E&=& \mathcal{E}\left((y_i-y_j)^2 \right)
\end{eqnarray}

we see that this energy $E$ is just the expectation value of the pairwise label distance, a certain type of empirical Risk! ($E$ can also be treated as the log loss or negative log probability of the configuration $\lbrace y \rbrace$).

Similarly, for logistic regression we just have our log loss. Both objective functions are differentiable with respect to the metric weights $\alpha_n$, so if we want to LEARN what comparators between $x_i,x_j$ are important, we simply use gradient descent on our labeled examples! To extend labels/matches to unlabeled points, we use the pairwise probabilities specified above.

If one implemented this label matching, it would be wise to not compare every data instance -- which would be an $N^2$ blow up -- but only some restricted subset of "likely" pairs. Whether we are comparing for unique ID's in entity resolution or shared classes in label propagation, this algorithm/framework is be the same!

1 comment: