Saturday, July 16, 2016

Estimating the Variance of the Survival Function: Full History vs. Snapshots

In survival analysis, the key quantity of interest is something called the survival function, $S(t)$, which is the probability that I'm going to live, at least as long as I've lived already:

\begin{eqnarray}
S(t) &=& P(T \ge t)
\end{eqnarray}

along with something called the hazard function, which is the probability that I die today, at time $t$, given that I've lived up until now:

\begin{eqnarray}
\lambda(t) &=& P(T=t \vert T \ge t) = P(t)/S(t)
\end{eqnarray}

This hazard is a conditional probability, and comes about because survival analysis and survival-like problems are implicitly \textit{sequential}.

When estimating $S(t)$ from data, one often uses the Kaplan Meier Estimator, which is a cumulative product of the number of people who ``died'' at time $t$, $d_t$, and the number of people who were ``alive'' at time $t$, $n_t$:

\begin{eqnarray}
\hat{S}(t) &=& \prod_{t^\prime < t} \left(1- \frac{d_{t^\prime}}{n_{t^\prime}}\right)
\end{eqnarray}

This is actually just a cumulative product of time-step survival probabilities, or one minus the hazard:
\begin{eqnarray}
\hat{S}(t) &=& \prod_{t^\prime < t} \left(1- \lambda_{t^\prime} \right)\\
&=&  \prod_{t^\prime < t} p_{t^\prime}
\end{eqnarray}

If we were to ask ourselves, ``what's the variance of this estimator?'', we'd have to use some fancy tricks. The first of which is noticing that we don't have good ways of combining variances in a \textbf{product}, but we do have good ways of combining variance for \textbf{sums}. So let's take the log transform of our estimator:

\begin{eqnarray}
\log\left(\hat{S}(t)\right) &=& \sum_{t^\prime < t} \log \left(1- \lambda_{t^\prime} \right)\\
&=&  \sum_{t^\prime < t} \log(p_{t^\prime})
\end{eqnarray}

And note that, the variance of the log can be computed by a simple taylor expandsion of a random variable about its mean:

\begin{eqnarray}
X & \sim & ? \\
\langle X \rangle &=& \mu \\
\mathbf{Var}(X) &=& \sigma^2 \\
\log(X) &\approx & \mu + \frac{X-\mu}{\mu}+O((X-\mu)^2)+\dots\\
\mathbf{Var}\left(\log(X)\right)  &=& 0 + \frac{\mathbf{Var}(X)}{\mu^2}\\
&=& \frac{\sigma^2}{\mu^2}
\end{eqnarray}

So we have:

\begin{eqnarray}
\log\left(\hat{S}(t)\right) &=& \sum_{t^\prime < t} \log \left(1- \lambda_{t^\prime} \right)\\
&=&  \sum_{t^\prime < t} \log(p_{t^\prime}) = \frac{1}{\hat{S}(t)^2} \mathbf{Var}(\hat{S}(t))
\end{eqnarray}

Using this transform on our formula above, we have

\begin{eqnarray}
\mathbf{Var}\left(\hat{S}(t) \right) &=& \hat{S}(t)^2  \mathrm{Var}\left(\sum_{t^\prime < t} \log(p_{t^\prime})\right)
\end{eqnarray}
Luckily, if we assume independence, the variance of the sum is the sum of the variances, so we can treat each $p_t$ as an independent binomial draw, with variance $p_t(1-p_t)/n_t$, where $n_t$ is the ``sample size'' of our survival curve at time $t$.

Working through some nasty algebra, and another use of the variance of the log identity we get:

\begin{eqnarray}
\mathbf{Var}\left(\hat{S}(t) \right) &=& \hat{S}(t)^2  \sum_{t^\prime < t} \mathrm{Var}\left( \log(p_{t^\prime})\right) \\
&=& \hat{S}(t)^2  \sum_{t^\prime < t} \frac{1}{p_{t^\prime}^2}\mathrm{Var}\left( p_{t^\prime}\right)\\
&=& \hat{S}(t)^2 \sum_{t^\prime < t} \frac{p_{t^\prime}}{n_{t^\prime}(1-p_{t^\prime})}
\end{eqnarray}

We see that variance of the estimator goes like the cumulative sum of one over the sample size at each time $t$:

\begin{eqnarray}
\mathbf{Var}\left(\hat{S}(t)\right) &\sim & \sum_{t^\prime < t}  \frac{1}{n_{t^\prime}}
\end{eqnarray}

Now, when dealing with very large data, say billions of survival events, it can be difficult to get these death counts as a function of time, due to a few implementation details, and the resistance of cumulative sums to parallelization. So, what people often do, is the estimate the survival curve at multiple snapshots, $M$, and then take the average of the snapshot estimates:
\begin{eqnarray}
\hat{S}_M(t) &=& \sum_{m=1}^M \frac{\hat{S}_m(t)}{M} \\
&=& \sum_{m=1}^M \frac{\prod_{t^\prime < t} \left(1-\lambda_{mt^\prime} \right)}{M}
\end{eqnarray}

This estimator will have the same mean as our ``full history'' estimator, but slightly different variance properties. As we know, the variance of a mean goes like one over the sample size:

\begin{eqnarray}
\mathbf{Var}\left(S_M(t) \right) &=& \frac{1}{M}\mathbf{Var}\left(S_m(t) \right)
\end{eqnarray}

But what's the variance of each snapshot estimate? Simply our old formula, with the population count $n_{mt}$ instead of $n_t$. Or, in english, the number of people who were ``alive'' at time $t$ in snapshot $m$, rather than the total number of people who were alive at time $t$. Strictly, $n_{mt} < n_t$. If we assume our snapshots are evenly populated with ``alive'' people at each time $t$, we will have $n_{mt}M \approx n_t$.

And so, comparing the variance of our estimators, we see:

\begin{eqnarray}
\mathbf{Var}\left(S(t) \right) &=& \hat{S}(t)^2 \sum_{t^\prime < t} \frac{p_{t^\prime}}{n_{t^\prime}(1-p_{t^\prime})} \\
\mathbf{Var}\left(S_M(t) \right) &=& \frac{1}{M}\hat{S}_M(t)^2 \sum_{t^\prime < t} \frac{p_{mt^\prime}}{n_{mt^\prime}(1-p_{mt^\prime})}
\end{eqnarray}

Taking the ratio of the variances, we get, since the means are equal $(\hat{S}(t)^2=\hat{S}_M(t)^2)$:

\begin{eqnarray}
\frac{\mathbf{Var}\left(S_M(t) \right)}{\mathbf{Var}\left(S(t) \right)} &=&  \frac{\sum_{t^\prime < t} \frac{p_{mt^\prime}}{n_{mt^\prime}(1-p_{mt^\prime})}}{\sum_{t^\prime < t} \frac{p_{t^\prime}}{n_{t^\prime}(1-p_{t^\prime})}}
\end{eqnarray}

Assuming equal sample size across snapshots, we can make the replacement, $n_{mt} = n_t/M$:

\begin{eqnarray}
\frac{\mathbf{Var}\left(S_M(t) \right)}{\mathbf{Var}\left(S(t) \right)} &\approx & \frac{\sum_{t^\prime < t} \frac{p_{mt^\prime}}{n_{t^\prime}(1-p_{mt^\prime})}}{\sum_{t^\prime < t} \frac{p_{t^\prime}}{n_{t^\prime}(1-p_{t^\prime})}}
\end{eqnarray}

And, assuming the $p_{mt} \approx p_t \forall t$, we get the very simple ratio:

\begin{eqnarray}
\frac{\mathbf{Var}\left(S_M(t) \right)}{\mathbf{Var}\left(S(t) \right)} &\approx & 1
\end{eqnarray}

How well does this mean we're doing? Well, it means that the variances of both methods are comparable. Which is surprising! If we want to probe deeper, and understand whether or not there is a difference between the two sampling strategies, we would have to closely inspect the cumulative sum:

\begin{eqnarray}
\sum_{t^\prime < t} \frac{p_{mt^\prime}}{n_{mt^\prime}(1-p_{mt^\prime})}
\end{eqnarray}