Sunday, October 15, 2017

P values for Logistic Regression

A friend recently asked to compute p-values for each and every coefficient fit by a sklearn logistic regression model. It boils down to computing the Log-Likelihood, the Fisher information matrix and the Cramer-Rao bound:

\begin{eqnarray}
\frac{-\partial^2 \mathcal{L}}{\partial \beta_i \partial \beta_j} &=& F_{ij}\\
\mathrm{Var}\left[\beta_i\right] & \ge & F_{ii}^{-1}
\end{eqnarray}

Which for logistic regression, gives us:

\begin{eqnarray}
P(y \vert x) &=& \frac{1}{1+e^{-\vec{\beta} \cdot \vec{x}}}\\
\mathcal{L} &=& \sum_{n=1}^N y_n \log \left(P(y_n \vert x_n) \right)+ (1-y_n) \log \left(1-P(y_n \vert x_n) \right)\\
F_{ij} &=& \sum_{n=1}^N \frac{x_{ni}x_{nj}}{2+2 \cosh\left(\vec{\beta}\cdot \vec{x_n} \right)}
\end{eqnarray}


Normally one assumes each $\beta$ coefficient is normally distributed, with mean $\hat{\beta}$ -- the maximum likelihood estimate, which in the case of logistic regression is wherever you stop numerically, since there's no analytical solution -- and variance:

\begin{eqnarray}
\sigma_{\beta_i} &\approx & F_{ii}^{-1}\\
\beta_i & \sim & \mathcal{N}\left(\hat{\beta_i}, F_{ii}^{-1} \right)
\end{eqnarray}

Python code is here -- where I set the regularization parameter to be essentially zero, and no fit of the intercept so one can compare directly with statsmodel output