So, along with label propagation, I've also been thinking about entity resolution, which is basically the same thing if you frame the problem correctly.
Let's say you have a set of all labeled data:
\begin{eqnarray}
\lbrace \mathbf{x}_i, \mathbf{y}_i \rbrace_{i=1}^N
\end{eqnarray}
Where $y_i$ can be a class -- zero or one -- as we were talking about earlier, or a unique ID. What we would like to do is compare pair-wise our datapoints and see if the $y_i,y_j$'s are equal. This means that every pair is a probability instance, we 'd like to assign them a ``two-peas-in-a-pod'' probability. One way of doing this is with our similarity matrix, mentioned before:
\begin{eqnarray}
P(y_i = y_j \vert x_i,x_j) &=& \mathbf{W}_{ij} = e^{-\alpha_n g_n(x_i,x_j)}
\end{eqnarray}
Where in the exponent we have a linear combination of metrics. They can be Euclidean, Minkowski, cosine, what have you -- each with a weight $\alpha_n$. (This is particularly useful with string comparison, as some metrics are more informative of others). We can also use simple logistic regression:
\begin{eqnarray}
P(y_i = y_j \vert x_i,x_j) &=& \sigma\left(-\mathbf{W}_{ij}\right) = \frac{1}{1+e^{\alpha_n g_n(x_i,x_j)}}
\end{eqnarray}
(it turns out that this probability is flipped the wrong way if we kept the negative sign in the exponent, which can be seen by a simple plot). If we want to learn the optimal $\alpha_n$'s we can use gradient descent on some specified objective function. The graph based formulation is motivated by ``Hidden Markov Random Field'' which penalizes different labels between close points -- as specified by $g$.
\begin{eqnarray}
E &=& \sum_{i,j\neq i} W_{ij} (y_i-y_j)^2 = \sum_{i,j \neq i} e^{-\alpha_n g_n(x_i,x_j)}(y_i-y_j)^2\\
&=& \sum_{i,j \neq i} P(y_i = y_j ) (y_i-y_j)^2\\
E&=& \mathcal{E}\left((y_i-y_j)^2 \right)
\end{eqnarray}
we see that this energy $E$ is just the expectation value of the pairwise label distance, a certain type of empirical Risk! ($E$ can also be treated as the log loss or negative log probability of the configuration $\lbrace y \rbrace$).
Similarly, for logistic regression we just have our log loss. Both objective functions are differentiable with respect to the metric weights $\alpha_n$, so if we want to LEARN what comparators between $x_i,x_j$ are important, we simply use gradient descent on our labeled examples! To extend labels/matches to unlabeled points, we use the pairwise probabilities specified above.
If one implemented this label matching, it would be wise to not compare every data instance -- which would be an $N^2$ blow up -- but only some restricted subset of "likely" pairs. Whether we are comparing for unique ID's in entity resolution or shared classes in label propagation, this algorithm/framework is be the same!
Let's say you have a set of all labeled data:
\begin{eqnarray}
\lbrace \mathbf{x}_i, \mathbf{y}_i \rbrace_{i=1}^N
\end{eqnarray}
Where $y_i$ can be a class -- zero or one -- as we were talking about earlier, or a unique ID. What we would like to do is compare pair-wise our datapoints and see if the $y_i,y_j$'s are equal. This means that every pair is a probability instance, we 'd like to assign them a ``two-peas-in-a-pod'' probability. One way of doing this is with our similarity matrix, mentioned before:
\begin{eqnarray}
P(y_i = y_j \vert x_i,x_j) &=& \mathbf{W}_{ij} = e^{-\alpha_n g_n(x_i,x_j)}
\end{eqnarray}
Where in the exponent we have a linear combination of metrics. They can be Euclidean, Minkowski, cosine, what have you -- each with a weight $\alpha_n$. (This is particularly useful with string comparison, as some metrics are more informative of others). We can also use simple logistic regression:
\begin{eqnarray}
P(y_i = y_j \vert x_i,x_j) &=& \sigma\left(-\mathbf{W}_{ij}\right) = \frac{1}{1+e^{\alpha_n g_n(x_i,x_j)}}
\end{eqnarray}
(it turns out that this probability is flipped the wrong way if we kept the negative sign in the exponent, which can be seen by a simple plot). If we want to learn the optimal $\alpha_n$'s we can use gradient descent on some specified objective function. The graph based formulation is motivated by ``Hidden Markov Random Field'' which penalizes different labels between close points -- as specified by $g$.
\begin{eqnarray}
E &=& \sum_{i,j\neq i} W_{ij} (y_i-y_j)^2 = \sum_{i,j \neq i} e^{-\alpha_n g_n(x_i,x_j)}(y_i-y_j)^2\\
&=& \sum_{i,j \neq i} P(y_i = y_j ) (y_i-y_j)^2\\
E&=& \mathcal{E}\left((y_i-y_j)^2 \right)
\end{eqnarray}
we see that this energy $E$ is just the expectation value of the pairwise label distance, a certain type of empirical Risk! ($E$ can also be treated as the log loss or negative log probability of the configuration $\lbrace y \rbrace$).
Similarly, for logistic regression we just have our log loss. Both objective functions are differentiable with respect to the metric weights $\alpha_n$, so if we want to LEARN what comparators between $x_i,x_j$ are important, we simply use gradient descent on our labeled examples! To extend labels/matches to unlabeled points, we use the pairwise probabilities specified above.
If one implemented this label matching, it would be wise to not compare every data instance -- which would be an $N^2$ blow up -- but only some restricted subset of "likely" pairs. Whether we are comparing for unique ID's in entity resolution or shared classes in label propagation, this algorithm/framework is be the same!