Wednesday, May 18, 2016

Fisher's Exact Test, Lift, and P-Values for Feature Exploration


Let's say you are asked to solve a binary classification problem ($y=0,1$) with very few training examples ($N<1000$) and quite a few, possibly predictive features ($d>1000$). The standard question of "how the heck do I feature select?" becomes very relevant, and in particular, "how the heck do I feature select with so few training examples?!?".

For Categorical features, one of the best ways to test for significance -- i.e. a non-null relationship between a feature column and a label column -- is Fisher's exact test and Laplace-smoothed lift.

Fisher's exact test is a combinatoric way of examining a contingency (or pivot) table. Let's say we have two columns $x$ and $y$, which both take on, in the simplest case, only two values: true and false. If we were to make pivot table, we'd have the number of pair-wise events in a 2x2 grid.

\begin{eqnarray}
\mathrm{pivot}(x,y) &=& \left( \begin{array}{cc} n_{00} & n_{01} \\
n_{10} & n_{11}
\end{array}\right)
\end{eqnarray}

Where $n_{11}$ corresponds to the number of times $x$ and $y$ were both True, $n_{00}$ corresponds to the number of times $x$ and $y$ were both false, $n_{10}$ the number times $x$ was true and $y$ false, etc. (This can be done easily in pandas by writing something like  \textit{pandas.pivot(dataFrame,index=x,columns=y,aggFunc=len,fillna=0}).)

We can see that the sum of the entries $n_{11}+n_{01}+n_{10}+n_{00}=N$ equals the number of training examples, and that the sum over rows or columns equals the marginal counts. $n_{11}+n_{10}=R_1$ equals the number times $x$ was true; $n_{01}+n_{11}=C_1$ equals the number times $y$ was true; $n_{01}+n_{00}=R_0$ equals the number times $x$ was false, etc.

What fisher proposed is to take this matrix and ask, given the marginal counts, $R_0,R_1,C_0,C_1$ -- which, if you think about it, correspond to the prior probabilities on $x$ and $y$: $P(x)=\frac{R_1}{N},P(y)=\frac{C_1}{N}$ -- how likely is the resulting contingency matrix if $x$ and $y$ are independent?

The most naive way to answer that question is to take the prior probabilities on $x$ and $y$ that's given two us by the data:

\begin{eqnarray}
P(x)=p &=& \frac{R_1}{N} \\
P(y)=q &=& \frac{C_1}{N}
\end{eqnarray}

and quote our good old multinomial:

\begin{eqnarray}
P(\mathbf{n}) &=& \frac{N!}{n_{00}!\ n_{10}!\ n_{11}!\ n_{01}!} \left[pq\right]^{n_{11}}\left[p(1-q)\right]^{n_{10}}\left[(1-p)q\right]^{n_{01}}\left[(1-p)(1-q)\right]^{n_{00}}
\end{eqnarray}

which, can be simplified to:

\begin{eqnarray}
P(\mathbf{n}) &=& \frac{N!}{n_{00}!\ n_{10}!\ n_{11}!\ n_{01}!} (p)^{n_{10}+n_{11}}(1-p)^{n_{00}+n_{01}}(q)^{n_{11}+n_{01}}(1-q)^{n_{10}+n_{00}}\\
P(\mathbf{n}) &=& \frac{N!}{n_{00}!\ n_{10}!\ n_{11}!\ n_{01}!} (p)^{R_1}(1-p)^{R_0}(q)^{C_1}(1-q)^{C_0}
\end{eqnarray}


The probability distribution above assumes that there is no relationship between $x$ and $y$, and so, if we see a contingency table that is very unlikely given the above pdf, we know something is up! But, how sure are we that the prior distribution estimates, $p,q$ are correct? Our sample size $N$ was very small. That's a troubling question, which can be solved by Laplace Smoothing, which sets a uniform prior distribution on $P(x)$ and $P(y)$:

\begin{eqnarray}
P(x)=p &=& \frac{R_1+\alpha}{N+\alpha d_x}
\end{eqnarray}
where $d_x$ is the number of distinct values $x$ can take on -- in this case two. And similarly, for $y$ we'd have the prior:

\begin{eqnarray}
P(y)=q &=& \frac{C_1+\alpha}{N+\alpha d_y}
\end{eqnarray}

This helps things a little bit, where $\alpha$ is the hyper parameter between 0 and 1 that controls the ``strength'' of our uniform prior. But one also might worry if using a multinomial is even appropriate, given, for very few datapoints $N$, the highly discrete nature of our contingency table.

Fisher's exact test explicitly addresses this discreteness aspect through combinatorics.

Let's recall an experiment where one has a drunken man throw $N$ darts at a dartboard with $d$ cells. The number of different ways in which this drunken dart player can get $n_1$ darts in the first cell $n_2$ in the second, $n_3$ in the third, etc. is:

\begin{eqnarray}
W &=& \frac{N!}{n_1!n_2!\cdots n_d!}
\end{eqnarray}

Taking the log of this combinatoric factor and applying stirling's approximation, we get the Shannon entropy:

\begin{eqnarray}
\log W &=& -\sum_{i=1}^d p_i \log p_i \\
p_i &=& \frac{n_i}{N}
\end{eqnarray}

This is all very interesting because, if one looks at the contingency table above, we need only promote our counts $n_{00},n_{01},n_{10},n_{11}$ to a compound index: $n_{1},n_2,n_3,n_4$ and we get the same formula:

\begin{eqnarray}
W &=& \frac{N!}{n_{00}!n_{01}!n_{10}!n_{11}!}
\end{eqnarray}

This is the number of ways one can get a contingency table with the  counts $n_{ij}$. But, this is NOT a probability. It is simply a multiplicity count of some "state" in phase space, $n_{ij}$. (You'll see above that it's a normalization factor for the multinomial). If we want to convert this multiplicity count to a probability, we have to be like Kolgomorov and divide by the multiplicity of the entire sample space $\Omega$. After all,

\begin{eqnarray}
P(x \in X) &=& \frac{\vert X \vert}{\vert \Omega \vert}
\end{eqnarray}

Where I'm using bars for ``multiplicity'', or the count of phase space cells within some region.

For our contingency table, above, we can define precisely what the sample space $\Omega$ is: all contingency tables with the marginal sums $R_0,R_1$ and $C_0,C_1$. This can be written as compound combinatoric factor:

\begin{eqnarray}
\vert \Omega \vert &=& \left(\begin{array}{c}
N \\ C_0
\end{array}\right)
\left(\begin{array}{c}
N \\ R_0
\end{array}\right)\\
&=& \frac{N! N!}{R_0!R_1!C_0!C_1!}
\end{eqnarray}

And so we have, doing our division:

\begin{eqnarray}
P(n_{00},n_{01},n_{10},n_{11} \vert R_1,R_0,C_1,C_0 ) &=& \frac{R_0!R_1!C_0!C_1!}{N! n_{00}!n_{01}!n_{10}!n_{11}!}
\end{eqnarray}

Let the joint event $n_{00},n_{01},n_{10},n_{11}$ be specified by $\mathbf{n}$. Then we can write

\begin{eqnarray}
P(\mathbf{n} \vert \mathbf{R},\mathbf{C} ) &=& \frac{R_0!R_1!C_0!C_1!}{N! n_{00}!n_{01}!n_{10}!n_{11}!}
\end{eqnarray}

or, more generally, for non-binary categorical variables (check it!):

\begin{eqnarray}
P(\mathbf{n} \vert \mathbf{R},\mathbf{C} ) &=& \frac{\prod_{i=1}^{d_x}R_i! \prod_{j=1}^{d_y} C_j!}{N! \prod_{i,j} n_{ij}!}
\end{eqnarray}

This is a very interesting formula, because it gives the precise, discrete probability of seeing some contingency table, conforming to marginal counts $\mathbf{R}$ and $\mathbf{C}$. With a little bit of algebra, one will see that this combinatoric probability converges to the multinomial we quoted above, by noting:

\begin{eqnarray}
\lim_{N \to \infty} N! &\approx & \left(\frac{N}{e}\right)^N
\end{eqnarray}

and so we get:

\begin{eqnarray}
P(n_{00},n_{01},n_{10},n_{11} \vert R_1,R_0,C_1,C_0 ) &\approx & \frac{(\frac{R_0}{e})^{R_0}(\frac{R_1}{e})^{R_1}(\frac{C_0}{e})^{C_0}(\frac{C_1}{e})^{C_1}}{(N/e)^N (\frac{n_{00}}{e})^{n_{00}}(\frac{n_{01}}{e})^{n_{01}}(\frac{n_{10}}{e})^{n_{10}}(\frac{n_{11}}{e})^{n_{11}}}
\end{eqnarray}

All the factors of $e$ cancel out, and we can simplify to get:

\begin{eqnarray}
pq &=& \frac{n_{11}}{N}\\
p(1-q) &=& \frac{n_{10}}{N}\\
(1-p)q &=& \frac{n_{01}}{N}\\
(1-p)(1-q) &=& \frac{n_{00}}{N} \\
P(\mathbf{n} \vert \mathbf{R},\mathbf{C} ) &=& \frac{N!}{\prod_{i,j} n_{ij}! }(\frac{R_0}{N})^{R_0}(\frac{R_1}{N})^{R_1}(\frac{C_0}{N})^{C_0}(\frac{C_0}{N})^{C_0} \\
&=& \frac{N!}{n_{00}!\ n_{10}!\ n_{11}!\ n_{01}!} (p)^{n_{10}+n_{11}}(1-p)^{n_{00}+n_{01}}(q)^{n_{11}+n_{01}}(1-q)^{n_{10}+n_{00}}
\end{eqnarray}

The same multinomial formula we found above!

This really isn't so surprising, as it says the combinatoric probability converges to the multinomial with fully continuous priors $p,q$ in the large sample $N \to \infty$ limit, but it is interesting to note.

Now, Fisher, when quoting p-values, or significance tests for a relationship between $x,y$, would simply use the count of contingency tables that had table counts $\mathbf{n}$ more extreme than what's observed. For instance, let's say we observe a True/True $x,y$ occurence that is higher than expected under the marginals: $n_{11}>N p q$ or $n_{11} > R_1 C_1 / N $: what's the sum of the probabilities of tables that have an even \textit{higher} $n_{11}$? This is called a one-tailed pvalue significance test, and for the fisher exact test and the multinomial method, corresponds to a simple sum.

I won't get too into the details of implementation now, but suffice to say, scipy's got a fisher test calculation all on its own!


------------------------------------------------------------------------------------------------------------------------

Now what hasn't been mentioned is lift. And it relates directly to the laplace smoothed priors discussed earlier. Lift is simply:

\begin{eqnarray}
l(x\vert y) &=& \frac{P(x \vert y)}{P(x)} = \frac{P(x,y)}{P(x)P(y)}
\end{eqnarray}

or, the probability of $x$ taking on some value given $y$ relative to $x$ occurring independently. Lift is a number between zero and infinity, and basically means: how many more times likely is $x$ going to occur given $y$? For low sample size $N < 1000$, it's probably a good idea to smooth the priors $P(x),P(y)$ giving us:

\begin{eqnarray}
l(x\vert y) &=&  \frac{P(x,y)}{P(x)P(y)} \\
&=& \frac{(N+\alpha d_x)(N+\alpha d_y)}{N}\frac{n_{xy}}{n_x n_y }
\end{eqnarray}

Where $n_x,n_y$ is the event count of $x$ and $y$. $n_{xy}$ is th joint event count of $x,y$.
------------------------------------------------------------------------------------------------------------------------

For those who are heavily implementation-focused, you'll note that this fisher test can be done for any type of contingency table -- it doesn't have to be two binomial marginal distributions, they could easily be multinomials! -- but often the best way to do things is to OneHotEncode the scikit-learn first, and then do simple fisher exact tests on each of the pairwise 2x2 contingency tables for feature selection. I hope that helps!