Saturday, July 16, 2016

Estimating the Variance of the Survival Function: Full History vs. Snapshots

In survival analysis, the key quantity of interest is something called the survival function, $S(t)$, which is the probability that I'm going to live, at least as long as I've lived already:

\begin{eqnarray}
S(t) &=& P(T \ge t)
\end{eqnarray}

along with something called the hazard function, which is the probability that I die today, at time $t$, given that I've lived up until now:

\begin{eqnarray}
\lambda(t) &=& P(T=t \vert T \ge t) = P(t)/S(t)
\end{eqnarray}

This hazard is a conditional probability, and comes about because survival analysis and survival-like problems are implicitly \textit{sequential}.

When estimating $S(t)$ from data, one often uses the Kaplan Meier Estimator, which is a cumulative product of the number of people who ``died'' at time $t$, $d_t$, and the number of people who were ``alive'' at time $t$, $n_t$:

\begin{eqnarray}
\hat{S}(t) &=& \prod_{t^\prime < t} \left(1- \frac{d_{t^\prime}}{n_{t^\prime}}\right)
\end{eqnarray}

This is actually just a cumulative product of time-step survival probabilities, or one minus the hazard:
\begin{eqnarray}
\hat{S}(t) &=& \prod_{t^\prime < t} \left(1- \lambda_{t^\prime} \right)\\
&=&  \prod_{t^\prime < t} p_{t^\prime}
\end{eqnarray}

If we were to ask ourselves, ``what's the variance of this estimator?'', we'd have to use some fancy tricks. The first of which is noticing that we don't have good ways of combining variances in a \textbf{product}, but we do have good ways of combining variance for \textbf{sums}. So let's take the log transform of our estimator:

\begin{eqnarray}
\log\left(\hat{S}(t)\right) &=& \sum_{t^\prime < t} \log \left(1- \lambda_{t^\prime} \right)\\
&=&  \sum_{t^\prime < t} \log(p_{t^\prime})
\end{eqnarray}

And note that, the variance of the log can be computed by a simple taylor expandsion of a random variable about its mean:

\begin{eqnarray}
X & \sim & ? \\
\langle X \rangle &=& \mu \\
\mathbf{Var}(X) &=& \sigma^2 \\
\log(X) &\approx & \mu + \frac{X-\mu}{\mu}+O((X-\mu)^2)+\dots\\
\mathbf{Var}\left(\log(X)\right)  &=& 0 + \frac{\mathbf{Var}(X)}{\mu^2}\\
&=& \frac{\sigma^2}{\mu^2}
\end{eqnarray}

So we have:

\begin{eqnarray}
\log\left(\hat{S}(t)\right) &=& \sum_{t^\prime < t} \log \left(1- \lambda_{t^\prime} \right)\\
&=&  \sum_{t^\prime < t} \log(p_{t^\prime}) = \frac{1}{\hat{S}(t)^2} \mathbf{Var}(\hat{S}(t))
\end{eqnarray}

Using this transform on our formula above, we have

\begin{eqnarray}
\mathbf{Var}\left(\hat{S}(t) \right) &=& \hat{S}(t)^2  \mathrm{Var}\left(\sum_{t^\prime < t} \log(p_{t^\prime})\right)
\end{eqnarray}
Luckily, if we assume independence, the variance of the sum is the sum of the variances, so we can treat each $p_t$ as an independent binomial draw, with variance $p_t(1-p_t)/n_t$, where $n_t$ is the ``sample size'' of our survival curve at time $t$.

Working through some nasty algebra, and another use of the variance of the log identity we get:

\begin{eqnarray}
\mathbf{Var}\left(\hat{S}(t) \right) &=& \hat{S}(t)^2  \sum_{t^\prime < t} \mathrm{Var}\left( \log(p_{t^\prime})\right) \\
&=& \hat{S}(t)^2  \sum_{t^\prime < t} \frac{1}{p_{t^\prime}^2}\mathrm{Var}\left( p_{t^\prime}\right)\\
&=& \hat{S}(t)^2 \sum_{t^\prime < t} \frac{p_{t^\prime}}{n_{t^\prime}(1-p_{t^\prime})}
\end{eqnarray}

We see that variance of the estimator goes like the cumulative sum of one over the sample size at each time $t$:

\begin{eqnarray}
\mathbf{Var}\left(\hat{S}(t)\right) &\sim & \sum_{t^\prime < t}  \frac{1}{n_{t^\prime}}
\end{eqnarray}

Now, when dealing with very large data, say billions of survival events, it can be difficult to get these death counts as a function of time, due to a few implementation details, and the resistance of cumulative sums to parallelization. So, what people often do, is the estimate the survival curve at multiple snapshots, $M$, and then take the average of the snapshot estimates:
\begin{eqnarray}
\hat{S}_M(t) &=& \sum_{m=1}^M \frac{\hat{S}_m(t)}{M} \\
&=& \sum_{m=1}^M \frac{\prod_{t^\prime < t} \left(1-\lambda_{mt^\prime} \right)}{M}
\end{eqnarray}

This estimator will have the same mean as our ``full history'' estimator, but slightly different variance properties. As we know, the variance of a mean goes like one over the sample size:

\begin{eqnarray}
\mathbf{Var}\left(S_M(t) \right) &=& \frac{1}{M}\mathbf{Var}\left(S_m(t) \right)
\end{eqnarray}

But what's the variance of each snapshot estimate? Simply our old formula, with the population count $n_{mt}$ instead of $n_t$. Or, in english, the number of people who were ``alive'' at time $t$ in snapshot $m$, rather than the total number of people who were alive at time $t$. Strictly, $n_{mt} < n_t$. If we assume our snapshots are evenly populated with ``alive'' people at each time $t$, we will have $n_{mt}M \approx n_t$.

And so, comparing the variance of our estimators, we see:

\begin{eqnarray}
\mathbf{Var}\left(S(t) \right) &=& \hat{S}(t)^2 \sum_{t^\prime < t} \frac{p_{t^\prime}}{n_{t^\prime}(1-p_{t^\prime})} \\
\mathbf{Var}\left(S_M(t) \right) &=& \frac{1}{M}\hat{S}_M(t)^2 \sum_{t^\prime < t} \frac{p_{mt^\prime}}{n_{mt^\prime}(1-p_{mt^\prime})}
\end{eqnarray}

Taking the ratio of the variances, we get, since the means are equal $(\hat{S}(t)^2=\hat{S}_M(t)^2)$:

\begin{eqnarray}
\frac{\mathbf{Var}\left(S_M(t) \right)}{\mathbf{Var}\left(S(t) \right)} &=&  \frac{\sum_{t^\prime < t} \frac{p_{mt^\prime}}{n_{mt^\prime}(1-p_{mt^\prime})}}{\sum_{t^\prime < t} \frac{p_{t^\prime}}{n_{t^\prime}(1-p_{t^\prime})}}
\end{eqnarray}

Assuming equal sample size across snapshots, we can make the replacement, $n_{mt} = n_t/M$:

\begin{eqnarray}
\frac{\mathbf{Var}\left(S_M(t) \right)}{\mathbf{Var}\left(S(t) \right)} &\approx & \frac{\sum_{t^\prime < t} \frac{p_{mt^\prime}}{n_{t^\prime}(1-p_{mt^\prime})}}{\sum_{t^\prime < t} \frac{p_{t^\prime}}{n_{t^\prime}(1-p_{t^\prime})}}
\end{eqnarray}

And, assuming the $p_{mt} \approx p_t \forall t$, we get the very simple ratio:

\begin{eqnarray}
\frac{\mathbf{Var}\left(S_M(t) \right)}{\mathbf{Var}\left(S(t) \right)} &\approx & 1
\end{eqnarray}

How well does this mean we're doing? Well, it means that the variances of both methods are comparable. Which is surprising! If we want to probe deeper, and understand whether or not there is a difference between the two sampling strategies, we would have to closely inspect the cumulative sum:

\begin{eqnarray}
\sum_{t^\prime < t} \frac{p_{mt^\prime}}{n_{mt^\prime}(1-p_{mt^\prime})}
\end{eqnarray} 

Wednesday, May 18, 2016

Fisher's Exact Test, Lift, and P-Values for Feature Exploration


Let's say you are asked to solve a binary classification problem ($y=0,1$) with very few training examples ($N<1000$) and quite a few, possibly predictive features ($d>1000$). The standard question of "how the heck do I feature select?" becomes very relevant, and in particular, "how the heck do I feature select with so few training examples?!?".

For Categorical features, one of the best ways to test for significance -- i.e. a non-null relationship between a feature column and a label column -- is Fisher's exact test and Laplace-smoothed lift.

Fisher's exact test is a combinatoric way of examining a contingency (or pivot) table. Let's say we have two columns $x$ and $y$, which both take on, in the simplest case, only two values: true and false. If we were to make pivot table, we'd have the number of pair-wise events in a 2x2 grid.

\begin{eqnarray}
\mathrm{pivot}(x,y) &=& \left( \begin{array}{cc} n_{00} & n_{01} \\
n_{10} & n_{11}
\end{array}\right)
\end{eqnarray}

Where $n_{11}$ corresponds to the number of times $x$ and $y$ were both True, $n_{00}$ corresponds to the number of times $x$ and $y$ were both false, $n_{10}$ the number times $x$ was true and $y$ false, etc. (This can be done easily in pandas by writing something like  \textit{pandas.pivot(dataFrame,index=x,columns=y,aggFunc=len,fillna=0}).)

We can see that the sum of the entries $n_{11}+n_{01}+n_{10}+n_{00}=N$ equals the number of training examples, and that the sum over rows or columns equals the marginal counts. $n_{11}+n_{10}=R_1$ equals the number times $x$ was true; $n_{01}+n_{11}=C_1$ equals the number times $y$ was true; $n_{01}+n_{00}=R_0$ equals the number times $x$ was false, etc.

What fisher proposed is to take this matrix and ask, given the marginal counts, $R_0,R_1,C_0,C_1$ -- which, if you think about it, correspond to the prior probabilities on $x$ and $y$: $P(x)=\frac{R_1}{N},P(y)=\frac{C_1}{N}$ -- how likely is the resulting contingency matrix if $x$ and $y$ are independent?

The most naive way to answer that question is to take the prior probabilities on $x$ and $y$ that's given two us by the data:

\begin{eqnarray}
P(x)=p &=& \frac{R_1}{N} \\
P(y)=q &=& \frac{C_1}{N}
\end{eqnarray}

and quote our good old multinomial:

\begin{eqnarray}
P(\mathbf{n}) &=& \frac{N!}{n_{00}!\ n_{10}!\ n_{11}!\ n_{01}!} \left[pq\right]^{n_{11}}\left[p(1-q)\right]^{n_{10}}\left[(1-p)q\right]^{n_{01}}\left[(1-p)(1-q)\right]^{n_{00}}
\end{eqnarray}

which, can be simplified to:

\begin{eqnarray}
P(\mathbf{n}) &=& \frac{N!}{n_{00}!\ n_{10}!\ n_{11}!\ n_{01}!} (p)^{n_{10}+n_{11}}(1-p)^{n_{00}+n_{01}}(q)^{n_{11}+n_{01}}(1-q)^{n_{10}+n_{00}}\\
P(\mathbf{n}) &=& \frac{N!}{n_{00}!\ n_{10}!\ n_{11}!\ n_{01}!} (p)^{R_1}(1-p)^{R_0}(q)^{C_1}(1-q)^{C_0}
\end{eqnarray}


The probability distribution above assumes that there is no relationship between $x$ and $y$, and so, if we see a contingency table that is very unlikely given the above pdf, we know something is up! But, how sure are we that the prior distribution estimates, $p,q$ are correct? Our sample size $N$ was very small. That's a troubling question, which can be solved by Laplace Smoothing, which sets a uniform prior distribution on $P(x)$ and $P(y)$:

\begin{eqnarray}
P(x)=p &=& \frac{R_1+\alpha}{N+\alpha d_x}
\end{eqnarray}
where $d_x$ is the number of distinct values $x$ can take on -- in this case two. And similarly, for $y$ we'd have the prior:

\begin{eqnarray}
P(y)=q &=& \frac{C_1+\alpha}{N+\alpha d_y}
\end{eqnarray}

This helps things a little bit, where $\alpha$ is the hyper parameter between 0 and 1 that controls the ``strength'' of our uniform prior. But one also might worry if using a multinomial is even appropriate, given, for very few datapoints $N$, the highly discrete nature of our contingency table.

Fisher's exact test explicitly addresses this discreteness aspect through combinatorics.

Let's recall an experiment where one has a drunken man throw $N$ darts at a dartboard with $d$ cells. The number of different ways in which this drunken dart player can get $n_1$ darts in the first cell $n_2$ in the second, $n_3$ in the third, etc. is:

\begin{eqnarray}
W &=& \frac{N!}{n_1!n_2!\cdots n_d!}
\end{eqnarray}

Taking the log of this combinatoric factor and applying stirling's approximation, we get the Shannon entropy:

\begin{eqnarray}
\log W &=& -\sum_{i=1}^d p_i \log p_i \\
p_i &=& \frac{n_i}{N}
\end{eqnarray}

This is all very interesting because, if one looks at the contingency table above, we need only promote our counts $n_{00},n_{01},n_{10},n_{11}$ to a compound index: $n_{1},n_2,n_3,n_4$ and we get the same formula:

\begin{eqnarray}
W &=& \frac{N!}{n_{00}!n_{01}!n_{10}!n_{11}!}
\end{eqnarray}

This is the number of ways one can get a contingency table with the  counts $n_{ij}$. But, this is NOT a probability. It is simply a multiplicity count of some "state" in phase space, $n_{ij}$. (You'll see above that it's a normalization factor for the multinomial). If we want to convert this multiplicity count to a probability, we have to be like Kolgomorov and divide by the multiplicity of the entire sample space $\Omega$. After all,

\begin{eqnarray}
P(x \in X) &=& \frac{\vert X \vert}{\vert \Omega \vert}
\end{eqnarray}

Where I'm using bars for ``multiplicity'', or the count of phase space cells within some region.

For our contingency table, above, we can define precisely what the sample space $\Omega$ is: all contingency tables with the marginal sums $R_0,R_1$ and $C_0,C_1$. This can be written as compound combinatoric factor:

\begin{eqnarray}
\vert \Omega \vert &=& \left(\begin{array}{c}
N \\ C_0
\end{array}\right)
\left(\begin{array}{c}
N \\ R_0
\end{array}\right)\\
&=& \frac{N! N!}{R_0!R_1!C_0!C_1!}
\end{eqnarray}

And so we have, doing our division:

\begin{eqnarray}
P(n_{00},n_{01},n_{10},n_{11} \vert R_1,R_0,C_1,C_0 ) &=& \frac{R_0!R_1!C_0!C_1!}{N! n_{00}!n_{01}!n_{10}!n_{11}!}
\end{eqnarray}

Let the joint event $n_{00},n_{01},n_{10},n_{11}$ be specified by $\mathbf{n}$. Then we can write

\begin{eqnarray}
P(\mathbf{n} \vert \mathbf{R},\mathbf{C} ) &=& \frac{R_0!R_1!C_0!C_1!}{N! n_{00}!n_{01}!n_{10}!n_{11}!}
\end{eqnarray}

or, more generally, for non-binary categorical variables (check it!):

\begin{eqnarray}
P(\mathbf{n} \vert \mathbf{R},\mathbf{C} ) &=& \frac{\prod_{i=1}^{d_x}R_i! \prod_{j=1}^{d_y} C_j!}{N! \prod_{i,j} n_{ij}!}
\end{eqnarray}

This is a very interesting formula, because it gives the precise, discrete probability of seeing some contingency table, conforming to marginal counts $\mathbf{R}$ and $\mathbf{C}$. With a little bit of algebra, one will see that this combinatoric probability converges to the multinomial we quoted above, by noting:

\begin{eqnarray}
\lim_{N \to \infty} N! &\approx & \left(\frac{N}{e}\right)^N
\end{eqnarray}

and so we get:

\begin{eqnarray}
P(n_{00},n_{01},n_{10},n_{11} \vert R_1,R_0,C_1,C_0 ) &\approx & \frac{(\frac{R_0}{e})^{R_0}(\frac{R_1}{e})^{R_1}(\frac{C_0}{e})^{C_0}(\frac{C_1}{e})^{C_1}}{(N/e)^N (\frac{n_{00}}{e})^{n_{00}}(\frac{n_{01}}{e})^{n_{01}}(\frac{n_{10}}{e})^{n_{10}}(\frac{n_{11}}{e})^{n_{11}}}
\end{eqnarray}

All the factors of $e$ cancel out, and we can simplify to get:

\begin{eqnarray}
pq &=& \frac{n_{11}}{N}\\
p(1-q) &=& \frac{n_{10}}{N}\\
(1-p)q &=& \frac{n_{01}}{N}\\
(1-p)(1-q) &=& \frac{n_{00}}{N} \\
P(\mathbf{n} \vert \mathbf{R},\mathbf{C} ) &=& \frac{N!}{\prod_{i,j} n_{ij}! }(\frac{R_0}{N})^{R_0}(\frac{R_1}{N})^{R_1}(\frac{C_0}{N})^{C_0}(\frac{C_0}{N})^{C_0} \\
&=& \frac{N!}{n_{00}!\ n_{10}!\ n_{11}!\ n_{01}!} (p)^{n_{10}+n_{11}}(1-p)^{n_{00}+n_{01}}(q)^{n_{11}+n_{01}}(1-q)^{n_{10}+n_{00}}
\end{eqnarray}

The same multinomial formula we found above!

This really isn't so surprising, as it says the combinatoric probability converges to the multinomial with fully continuous priors $p,q$ in the large sample $N \to \infty$ limit, but it is interesting to note.

Now, Fisher, when quoting p-values, or significance tests for a relationship between $x,y$, would simply use the count of contingency tables that had table counts $\mathbf{n}$ more extreme than what's observed. For instance, let's say we observe a True/True $x,y$ occurence that is higher than expected under the marginals: $n_{11}>N p q$ or $n_{11} > R_1 C_1 / N $: what's the sum of the probabilities of tables that have an even \textit{higher} $n_{11}$? This is called a one-tailed pvalue significance test, and for the fisher exact test and the multinomial method, corresponds to a simple sum.

I won't get too into the details of implementation now, but suffice to say, scipy's got a fisher test calculation all on its own!


------------------------------------------------------------------------------------------------------------------------

Now what hasn't been mentioned is lift. And it relates directly to the laplace smoothed priors discussed earlier. Lift is simply:

\begin{eqnarray}
l(x\vert y) &=& \frac{P(x \vert y)}{P(x)} = \frac{P(x,y)}{P(x)P(y)}
\end{eqnarray}

or, the probability of $x$ taking on some value given $y$ relative to $x$ occurring independently. Lift is a number between zero and infinity, and basically means: how many more times likely is $x$ going to occur given $y$? For low sample size $N < 1000$, it's probably a good idea to smooth the priors $P(x),P(y)$ giving us:

\begin{eqnarray}
l(x\vert y) &=&  \frac{P(x,y)}{P(x)P(y)} \\
&=& \frac{(N+\alpha d_x)(N+\alpha d_y)}{N}\frac{n_{xy}}{n_x n_y }
\end{eqnarray}

Where $n_x,n_y$ is the event count of $x$ and $y$. $n_{xy}$ is th joint event count of $x,y$.
------------------------------------------------------------------------------------------------------------------------

For those who are heavily implementation-focused, you'll note that this fisher test can be done for any type of contingency table -- it doesn't have to be two binomial marginal distributions, they could easily be multinomials! -- but often the best way to do things is to OneHotEncode the scikit-learn first, and then do simple fisher exact tests on each of the pairwise 2x2 contingency tables for feature selection. I hope that helps!

Friday, January 29, 2016

Entity Resolution through Markov Random Fields and Logistic Regression

So, along with label propagation, I've also been thinking about entity resolution, which is basically the same thing if you frame the problem correctly.

Let's say you have a set of all labeled data:

\begin{eqnarray}
\lbrace \mathbf{x}_i, \mathbf{y}_i \rbrace_{i=1}^N
\end{eqnarray}

Where $y_i$ can be a class -- zero or one -- as we were talking about earlier, or a unique ID. What we would like to do is compare pair-wise our datapoints and see if the $y_i,y_j$'s are equal. This means that every pair is a probability instance, we 'd like to assign them a ``two-peas-in-a-pod'' probability. One way of doing this is with our similarity matrix, mentioned before:

\begin{eqnarray}
P(y_i = y_j \vert x_i,x_j) &=& \mathbf{W}_{ij} = e^{-\alpha_n g_n(x_i,x_j)}
\end{eqnarray}

Where in the exponent we have a linear combination of metrics. They can be Euclidean, Minkowski, cosine, what have you -- each with a weight $\alpha_n$. (This is particularly useful with string comparison, as some metrics are more informative of others). We can also use simple logistic regression:

\begin{eqnarray}
P(y_i = y_j \vert x_i,x_j) &=& \sigma\left(-\mathbf{W}_{ij}\right) = \frac{1}{1+e^{\alpha_n g_n(x_i,x_j)}}
\end{eqnarray}

(it turns out that this probability is flipped the wrong way if we kept the negative sign in the exponent, which can be seen by a simple plot). If we want to learn the optimal $\alpha_n$'s we can use gradient descent on some specified objective function. The graph based formulation is motivated by ``Hidden Markov Random Field'' which penalizes different labels between close points -- as specified by $g$.

\begin{eqnarray}
E &=& \sum_{i,j\neq i} W_{ij} (y_i-y_j)^2 = \sum_{i,j \neq i}  e^{-\alpha_n g_n(x_i,x_j)}(y_i-y_j)^2\\
&=& \sum_{i,j \neq i} P(y_i = y_j ) (y_i-y_j)^2\\
E&=& \mathcal{E}\left((y_i-y_j)^2 \right)
\end{eqnarray}

we see that this energy $E$ is just the expectation value of the pairwise label distance, a certain type of empirical Risk! ($E$ can also be treated as the log loss or negative log probability of the configuration $\lbrace y \rbrace$).

Similarly, for logistic regression we just have our log loss. Both objective functions are differentiable with respect to the metric weights $\alpha_n$, so if we want to LEARN what comparators between $x_i,x_j$ are important, we simply use gradient descent on our labeled examples! To extend labels/matches to unlabeled points, we use the pairwise probabilities specified above.

If one implemented this label matching, it would be wise to not compare every data instance -- which would be an $N^2$ blow up -- but only some restricted subset of "likely" pairs. Whether we are comparing for unique ID's in entity resolution or shared classes in label propagation, this algorithm/framework is be the same!

Label Propagation in SSL through a Markov Random Walk

Szummer and Jaakola (2002) used the exact same frame work written in the last post label-propagation-and-semi-supervised.html to propagate labels outwards from a training set via some distance measure. But, they used a Markov random walk, with transition matrix:

\begin{eqnarray}
p_{ij} &=& \frac{W_{ij}}{\sum_{ik} W_{ik}} \\
&=& \mathbf{D}^{-1}\mathbf{W}
\end{eqnarray}

Notice, this is exactly the same as our transition matrix from before. The best way to view $p_{ij}$ is a conditional probability that a particle lives at position $i$, given that it was at position $j$ the moment before:

\begin{eqnarray}
p_{ij} &=& P(x_{t+1}=\mathbf{x}_i \vert x_{t}=\mathbf{x}_j)
\end{eqnarray}

Now, this is only a single step. By markov property we can extend to any number of steps in time $t$:

\begin{eqnarray}
p^{(2)}_{ij} &=& P(x_{t+2}=\mathbf{x}_i \vert x_{t}=\mathbf{x}_j)\\
&=& \sum_{x_{t+1}} P(x_{t+2}=\mathbf{x}_i \vert x_{t+1}=\mathbf{x}_j)P(x_{t+1}=\mathbf{x}_i \vert x_{t}=\mathbf{x}_j)
\end{eqnarray}

or, in matrix form:

\begin{eqnarray}
p_{ij}^{(2)} &=& \sum_k p_{ik} p_{kj} = \mathbf{p}^2 \\
p_{ij}^{(t)} &=& \mathbf{p}^t
\end{eqnarray}

this type of framework allows for traversal of particles through our ``graph'', consisting of the labeled and unlabeled datapoints. It is precisely the same as the Helmholtz algorithm given before, but instead of soft labels $y(x)$ being propagated we have representative ``hard'' particles.

Label Propagation and Semi-Supervised Learning: Gaussian Random Field Method

So, recently I've been reading up on label propagation in semi-supervised learning, which is when you have a great deal of data, but most of it is unlabeled. To put some notation on things, lets say way have a set $L$:

\begin{eqnarray}
L &:& \lbrace \mathbf{x}, y \rbrace_{n=1}^{L}
\end{eqnarray}

which is a set of pairs of input vectors $x$ and output labels $y$, be they scalar or categorical. And then we have a huge unlabeled set:

\begin{eqnarray}
U &:& \lbrace \mathbf{x}, \_\_  \rbrace_{n=1}^{U}
\end{eqnarray}

which we would like to infer on. Normally, this use case is motivated when the unlabeled set is much, much larger, $\vert L \vert << \vert U \vert$. If we are talking about classification, one way to view this problem is through clustering. If we assume that close vectors $\mathbf{x}$, under some metric, have close labels $y$, we that might motivate a loss function of the form:

\begin{eqnarray}
E(\lbrace y \rbrace) &=& \sum_{i,j \ne i} W_{ij} (y_i - y_j)^2
\end{eqnarray}

Where, we're summing over all pairs of data points $i,j$, and weighting their difference in label with the matrix $W_{ij}$. For sanity's sake, $W_{ij}$ should be large when $x_i,x_j$ are close. So $W_{ij}$ goes like one over distance between $i,j$. For example:

\begin{eqnarray}
E(\lbrace y \rbrace) &=& \sum_{i,j \ne i} W_{ij} (y_i - y_j)^2\\
W_{ij} &=& e^{-\vert x_i - x_j \vert^2/2\sigma^2}
\end{eqnarray}

This weighting matrix is simply a function of the euclidean metric, and actually reminds one of an RBF kernel or covariance function... Is there a connection here?

Absolutely. \\

If we frame our clustering/labeling problem as trying to minimize this loss function, or energy $E$, it means we can actually frame the likelihood of the labels with a boltzman distribution:

\begin{eqnarray}
P(\lbrace y \rbrace) &=& \frac{1}{Z}e^{-\sum_{i,j \ne i} W_{ij} (y_i - y_j)^2}
\end{eqnarray}

(Where $Z$ is the partition function, summing over all configurations of labels). This is extremely interesting, because if you do a little matrix algebra on our energy, we you find that one can re-write the loss as:

\begin{eqnarray}
E = \sum_{i,j\ne i} W_{ij} (y_i - y_j)^2 &=& \sum_{i,j\ne i} y_i (D_{ij}-W_{ij}) y_j \\
&=& \frac{1}{2} y_i L_{ij} y_j \\
D_{ii} &=& \sum_{j^\prime} W_{ij} \\
L_{ij} &=& \mathbf{D}-\mathbf{W}
\end{eqnarray}

The matrix $L_{ij}$, above is actually a close cousin of the laplacian operator $\nabla^2$, but we have embedded things in a high-dimensional space because of exponentiation. Notice that our likelihood on the configuration of labels now looks exactly like a Gaussian random field:

\begin{eqnarray}
P(\lbrace y \rbrace) &=& \frac{1}{Z}e^{-y_i L_{ij} y_j/2}
\end{eqnarray}

such that $\langle y_i \rangle =0$ and $\langle y_i y_j \rangle_c = L^{-1}_{ij}$. This discrete pdf on labels is precisely the same as if we had made everything continuous from the get-go:

\begin{eqnarray}
E[y(x)] &=& \frac{1}{2}\int dx dx^\prime y(x) y(x^\prime) K(x,x^\prime) \\
K(x,x^\prime) &=& e^{-\vert x-x^\prime \vert^2/2\sigma^2} \\
P(y(x)) &=& \frac{1}{Z}e^{-\frac{1}{2}\int dx dx^\prime y(x) y(x^\prime) K(x,x^\prime)}
\end{eqnarray}

which is a Gaussian random field on the labels, $y(x)$, imposing an RBF correlation function between points $x$. When integrate the lagrangian in by parts we would get the continuous equivalent of $L_{ij}$, which is essentially $\nabla^2$ in some new space.

So why do we care about all of this? Well, it turns out that the algorithms people use to propagate labels work exactly like the Helmholtz equation. For instance, one of the easiest things you can do given labeled examples $L$, is to propagate or ``flow'' the $y$'s to unlabeled points by the following procedure:

\begin{eqnarray}
y_{t+1} &=& \mathbf{D}^{-1} \mathbf{W} y_t
\end{eqnarray}

which, is the same as the helmholtz equation:

\begin{eqnarray}
\left(\frac{\partial}{\partial t}+\nabla^2 \right) y_t &=& 0 \\
y_{t+1}-y_t &=& -\nabla^2 y_t \\
y_{t+1} &=& \left(1-\nabla^2 \right) y_t
\end{eqnarray}

and now note, if we replace $\nabla^2$ with $1-D^{-1/2}\mathbf{W} D^{-1/2}$, we get

\begin{eqnarray}
y_{t+1} &=& \mathbf{D}^{-1}\mathbf{W}  y_t
\end{eqnarray}

This is PRECISELY the update scheme -- in discrete form -- of Helmholtz dynamics, or heat diffusion. Notice that when $\frac{\partial}{\partial t}y_t =0$, we're going to have a Harmonic field, which is fancy word for saying the laplacian of the labels is zero. (this implies some other nice things, like all labels are the average of labels around them, the maximum of the labels y will always exist on the boundary, etc.) Zhu, Ghahramani and Lafferty explain this algorithm best when they say the labels are undergoing diffusion, with dirichlet -- fixed label constraints -- on the labeled points $\mathbf{x}_l$.

Although, we are doing things not in euclidean space but somewhere else, do to our choice of metric. Because for instance, we could have chosen $W_{ij}$ to be whatever we liked, as long as it goes inverse with distance. Let $g(x_i,x_j)$ be some metric, then we have more generally:

\begin{eqnarray}
W_{ij} &=& f\left[g(x_i,x_j) \right]
\end{eqnarray}

and so $g$ defines the space in which we're taking derivatives. Things don't have to be euclidean! 

Sunday, January 3, 2016

Facility Location as Adaptive K-means

In linear programming, a typical problem is the following: we have $N$ facilities, which we can choose to open or not open, in order to serve $M$ customers, who are strewn throughout the world. We would like to minimize the operational cost of serving our customers from these facilities, which can be summarized as:

\begin{eqnarray}
\mathrm{startup} \ \mathrm{costs} &=& \sum_f s_f y_f\\
y_f &=& 0,1
\end{eqnarray}

Where $y_f$ is a binary variable -- zero or one -- based on whether we choose to open facility $f$. $s_f$ is obviously the associated cost for opening that specific facility. We also have transit costs from serving our customers, which might be of the very simple form:

\begin{eqnarray}
\mathrm{transit} \ \mathrm{costs} &=& \sum_c \sum_f X_{cf} D(\mathbf{x}_c, \mathbf{x}_f)\\
X_{cf} &=& 0,1
\end{eqnarray}

Where $X_{cf}$ is a binary matrix that denotes whether customer $c$ was assigned to facility $f$. $D(x_1,x_2)$, is our distance metric, which could be Euclidean or "Manhattan" -- where we only take steps to the left or right, up or down, no diagonal lines between destinations -- or maybe just a query in google maps.

Adding these to costs together we get a loss function of sorts:

\begin{eqnarray}
J &=&  \sum_c \sum_f X_{cf} D(\mathbf{x}_c, \mathbf{x}_f)+ \sum_f s_f y_f
\end{eqnarray}

which we would like to minimize. There are some simple constraints on our binary variables that we can articulate mathematically. Say each facility has an associated production capacity $c_f$ and each customer has an associated demand $d_c$. Then we have to make sure that the assignments are not overworking our facilities:

\begin{eqnarray}
\sum_c d_c X_{cf} & \le & c_f y_f \ \forall \ f
\end{eqnarray}

Notice I've multiplied by $y_f$ on the RHS. Another constraint we need to have is that every customer is served by exactly one facility -- no redundancy:

\begin{eqnarray}
\sum_f X_{cf} &=& 1\ \forall \ c
\end{eqnarray}

We can also require more explicitly that facilities who are turned off do not get any assignments:

\begin{eqnarray}
X_{cf} & \le & y_f \ \ \ \forall \ c,f
\end{eqnarray}

That's a ton of constraints, but each of them are linear and so we have well defined linear program. Given $N$ facility locations and $M$ customer locations, with the associated capacities, demands, and start up costs, we can find the global optimum of this thing with $N(M+1)$ decision variables, $X$ and $y$.

Turns out this problem is very similar to K means. Because what we are essentially doing is breaking down our customers into "facility clusters". What if we promoted all customers to facilities themselves? (Or at least gave them the option). Then the distance function $D(\mathbf{x}_1,\mathbf{x}_2)$ is simply the square root of the $L_2$ norm associated with a mixture of Gaussians with a diagonal covariance matrix. Let $\mathbf{x}_2$ be the facility, or centroid of the $k^\mathrm{th}$ cluster, then we have:

\begin{eqnarray}
\mathrm{Let} \ \ \mathbf{x}_2 &=& \mu_k \\
\mathbf{\Sigma} &=& \mathbf{1} \\
\mathrm{Then} \ \ D(\mathbf{x}_1,\mathbf{x}_1) &=& \sqrt{\left(\mathbf{x}_1-\mathbf{\vec{\mu}}_k\right) \mathbf{\Sigma}^{-1}\left(\mathbf{x}_1-\mathbf{\vec{\mu}}_k\right)}
\end{eqnarray}

For a mixture of Gaussians density Estimation, our PDF is:

\begin{eqnarray}
P(\mathbf{x}) &=& \sum_{k=1}^K \mathcal{N}\left(\mu_k,\mathbf{\Sigma} \right)
\end{eqnarray}

and the log likelihood for our sequence of customers/facilities is:

\begin{eqnarray}
-\log \mathcal{L}(X \vert \lbrace \mu \rbrace_{k=1}^K )&=& \sum_k \sum_{ n \in c_k} \frac{\left(\mathbf{x}_n-\mathbf{\mu}_k \right)^2}{2\sigma^2}+K\sqrt{2\pi \sigma^2}
\end{eqnarray}

I've kept the $\sigma^2$ terms for clarity. Notice that the second term in this negative log likelihood, or loss function is the normalization factor: it penalizes a high level of total clusters $K$, preventing overfitting. This loss function -- apart from the quadratic nature of the summand -- is exactly like the objective in our facility location problem. The startup cost for every "facility" in this example would be

\begin{eqnarray}
s_f &=& \sqrt{2\pi \sigma^2} \ \ \forall f
\end{eqnarray}

And the variance of our gaussians we would set to unity: $\sigma \to 1$. Let's rewrite the negative log likelihood, but in decision variable language:

\begin{eqnarray}
J^\prime &=&  \frac{1}{2}\sum_i \sum_j X_{ij} D(\mathbf{x}_i, \mathbf{x}_j)^2+ \sum_j s_j y_j \\
s_i &=& \sqrt{2\pi}\\
\end{eqnarray}

We have not yet specified the capacity of our clusters or the "demand" of each data point. A simple choice would be to set some upper limit on the size of each cluster, and give every data point the same weight or demand:

\begin{eqnarray}
d_i &=& 1 \ \ \forall \ i \\
c_j &=& c_\mathrm{max} \ \ \forall \ j
\end{eqnarray}

This is still a linear program, just with a slightly different list of coefficients on the first term! To solve such a system we would need $N(N+1)$ decision variables, which could be prohibitive with large systems, but thanks to interior point solvers and very clever routines for integer programming we can actually run this "adaptive" K-means algorithm in polynomial time!