Tuesday, October 27, 2015

Dual Lagrangians and ordinary Least Squares

The ordinary least squares problem can be set up as just a norm minimization problem:

\begin{eqnarray}
\mathrm{minimize} \ \ \frac{\vert \vert Ax-t \vert \vert^2}{2}+l \frac{x^T x}{2}
\end{eqnarray}

Where $x$ is some vector that in lives in $D$-dimensional ``feature'' space, and the matrix $A$ is $N \times D$ -- essentially the design matrix $\Phi(X)$ common in the Machine learning literature. We can write down the ``primal objective'' of this problem by writing down the ``Lagrangian'':

\begin{eqnarray}
L(x)&=& \frac{(Ax-t)^T (Ax-t)}{2}+l \frac{x^T x}{2} \ \ \ \mathrm{or,}\\
L(z,x\nu) &=& \frac{z^T z}{2}+\nu^T \left(Ax-t-z \right)+l \frac{x^Tx}{2}
\end{eqnarray}

Where $\nu$ is a vector-valued lagrange multiplier, that \textbf{imposes} our constraint $z=Ax-t$. This reformulation of the problem is just one of the many simple tricks you can do using lagrange multipliers. The canonical form of such Lagrangian in optimization theory is the following:
\begin{eqnarray}
L(x,\lambda,\nu)&=& f_0(x)+ \lambda_i f_i(x)+\nu_j h_j(x)
\end{eqnarray}

where Einstein summation convention is used. Pardon the confusion of $\lambda$'s : normally this greek letter is used for regularization, but in the problem I am going to use $l$ as the regularization parameter and $\lambda$ as the Lagrange multiplier that imposes some inequality constraint. We require, for $x$ to be a feasible solution to the problem:

\begin{eqnarray}
h_i(x) &=& 0 \forall i \\
f_i(x) &\le & 0 \forall i
\end{eqnarray}

Notice that in this example we no inequality constraints, so $\lambda$ will be irrelevant. If we were to minimize this ordinary least squares problem directly -- analytically solving what's called the primal objective -- we get:

\begin{eqnarray}
\partial_x L(x) &=& A^T(Ax-t)+l x =0 \\
x_\mathrm{min} &=& (A^TA+l)^{-1} A^T t
\end{eqnarray}

We see that the solution involves the inversion of a $D \times D$ matrix -- quite difficult if the length of our feature vector is large. This is what's called the ordinary least squares solution. We have let $l$ be non-zero in order to create something called a ``ridge'', which smooths out our solution.

Now, we can define the dual lagrangian as:

\begin{eqnarray}
g(\lambda, \nu)&=& \mathrm{min}_{x,z}\left(L(x,z,\lambda,\nu) \right) \\
\end{eqnarray}

which we can find by taking the proper derivatives and plugging in:

\begin{eqnarray}
\partial_x L(x,z,\lambda,\nu) &=& \nu^T A + l x^T \\
\Rightarrow x&=& -\frac{A^T \nu}{l}\\
\partial_z L(x,z,\lambda,\nu) &=& z^T-\nu^T \\
\Rightarrow z=\nu
\end{eqnarray}

So we get

\begin{eqnarray}
g(\lambda, \nu) &=& \frac{\nu^T \nu}{2}+\nu^T \left(-AA^T \nu/l - t - \nu \right) + \frac{l \nu A A^T \nu}{2}\\
&=& -\nu^T \left(\frac{AA^T}{l}+1\right)\nu-\nu^Tt
\end{eqnarray}

Now, we define $\mathrm{max}(g)=d^*$ as the maximum of the dual, which, is in general less than the minimum of the primal, $\mathrm{min}(L)=p^*$. When $d^*=p^*$, we have what's called strong duality. The conditions for this to be the case are called the KKT conditions, and boil down to essentially all derivatives being equal to zero at the solution:

\begin{eqnarray}
\partial_{x,z,\lambda,\nu}L=0
\end{eqnarray}

and all lagrange multiplier terms being equal to zero evaluated at the solution. If we maximize our dual, we find:

\begin{eqnarray}
\partial_\nu g &=& - \left(\frac{AA^T}{l}+1 \right)\nu-t=0 \\
\Rightarrow \nu &=& -\left(AA^T/l+1 \right)^{-1} t
\end{eqnarray}

and, using our expression before, we find the dual $x$ is:

\begin{eqnarray}
x &=& -\frac{A^T \nu}{l} = A^T \left(AA^T+l \right)^{-1}t
\end{eqnarray}

These two solutions are equivalent if $d^*=p^*$, and in this case, they are! Note that the second solution involves the inversion of an $N \times N$ matrix, and may be a much better route if the number of features outweigh the number of data points, $N << D$. This dual solution is precisely the kernel ridge regression solution, because $AA^T$ is our kernel ``Gram Matrix''. The model becomes:

\begin{eqnarray}
\mathrm{model} &=& Ax = AA^T \left(AA^T+l \right)^{-1}t\\
AA^T &=& K \\
\mathrm{model} &=& Ax = K \left(K+l \right)^{-1}t\\
\end{eqnarray}

Friday, October 23, 2015

Gaussian Process and Defining a Stochastic Process (Connection with Rasmussen)

About a week ago I posted about how to define a stochastic process, and the conclusion was, if we wanted to write some function:

\begin{eqnarray}
f(x) \sim P[f(x)] &=& \frac{1}{Z}e^{\int dx dx^\prime f(x) K^{-1}(x,x^\prime)f(x^\prime)}
\end{eqnarray}

with mean zero and covariance $\mathrm{Var}[f(x)]=K(x,x^\prime)$, we can decompose the kernel into basis functions through Mercer's theorem and write

\begin{eqnarray}
K(x,x^\prime) &=& \sum_n \lambda_n \phi_n(x) \phi_{n}(x^\prime) \\
f(x) &=& \sum_m \phi_m(x) \sqrt{2\lambda_m} \mathrm{erf}^{-1}(X_m)
\end{eqnarray}

This is precisely the basis function regression we see so often in Rasmussen:

\begin{eqnarray}
f(x) &=& \vec{\mathbf{w}}\cdot \vec{\phi}(x) \\
\vec{\mathbf{w}}_m &=& \sqrt{2\lambda_m} \mathrm{erf}^{-1}(X_m)\\
\mathrm{Var}(ww^T) &=& 2 \sqrt{\lambda_m \lambda_n} \langle  \mathrm{erf}^{-1}(X_m)  \mathrm{erf}^{-1}(X_n)\rangle
\end{eqnarray}

Since the erf transformed variables are unit normal, we get a kronecker delta in $n,m$ and

\begin{eqnarray}
\mathrm{Var}(w_m w_n)  &=& 2\lambda_m \delta^K_{mn}
\end{eqnarray}

I've missed a factor of two somehow but no matter. The basic point comes across. We can define an arbitrary Gaussian process with arbitrary Kernel by expanding in basis functions $\phi(x)$, much like we did when moving from the univariate Gaussian to the multivariate Gaussian, with non-trivial covariance between the components of our random vector.

In the case of the squared exponential kernel, the set of basis functions is formally infinite, but ultimately this doesn't matter for the conditional distribution,

\begin{eqnarray}
f(x^*) \vert X &=& \sum_{i} \alpha_i K(x^*, X_i)
\end{eqnarray}

which is a piece of wisdom from the representer theorem. For any regularized regression problem with quadratic cost -- which, is the same thing as putting a prior on the weights $\vec{w}$ in our model and maximizing the Gaussian posterior -- we can represent our smooth solution completely in terms of the input data $X$.

This may have a simple connection, in the process view, to the "family of functions", and that when we observe our stochastic process at discrete intervals $X$ with values $\vec{f}$, we are really hitting our original pdf with a series of dirac delta functions.


Wednesday, October 14, 2015

Chernoff Bound, Union Bound, Political Polling and Base rate estimation


Let's say we are trying to estimate the base rates of some classes $C=\lbrace 1,2,3 \dots K \rbrace$. After polling $N$ people, each of which will elect one of theses classes, how confident are we in our results, $P(c)$? The first step to answering this question is to simplify the problem and treat it as a sequence of binomials.

Let us at first restrict ourself to two classes. $c=1$ means ``no'' or ``Democrat'' and $c=0$ means ``yes'', or ``Republican''. If weuse a bernouli random variable $X=0,1$ to represent a single ``vote'' we get

\begin{eqnarray}
P(X \vert p ) &=& p^{I_{x=1}}(1-p)^{I_{x \ne 1}}\\
q &=& 1-p
\end{eqnarray}

Where $I$ is the indicator function. The expectation of this distribution is:

\begin{eqnarray}
\langle X \rangle &=& p \\
\langle X^2 \rangle_c = \mathrm{Var}(X)&=& pq
\end{eqnarray}

Given these first two cumulants, we can write the Markov bound:

\begin{eqnarray}
P(X > a) & \le & \frac{\langle X \rangle}{a}\\
 \le & \frac{p}{a}
\end{eqnarray}

and the improved Chebyshev bound:

\begin{eqnarray}
P( \vert X-p \vert \ge mpq) & \le & \frac{pq}{mpq}=\frac{1}{m}
\end{eqnarray}

Similarly, for a binomial random variable, a sequence of votes:

\begin{eqnarray}
X_N &=& \sum_{i=1}^N X_i \\
X_N & \sim & \left(\begin{array}{c}
N \\
X_N
\end{array}\right) p^{X_N}(1-p)^{X_N}
\end{eqnarray}

we get expectation and variance

\begin{eqnarray}
\langle X_N \rangle &=& Np \\
\langle X_N^2 \rangle_c = \mathrm{Var}(X)&=& Npq
\end{eqnarray}

So our bound becomes

\begin{eqnarray}
P(\vert X_N-Np \vert \ge mNpq) \le \frac{1}{m}
\end{eqnarray}

But that seems repetitive. Let's define a new statistic, the estimator of the mean:

\begin{eqnarray}
\hat{\mu}_N &=& \frac{X_N}{N} = \frac{\sum_{i=1}^N}{N} \\
\langle \hat{\mu}_N \rangle &=& p \\
\langle \hat{\mu}_N^2 \rangle &=& \frac{pq}{N}
\end{eqnarray}

Now we have an inequality that scales with the number of votes, or the survey size, $N$:

\begin{eqnarray}
P(\vert \frac{X_N}{N}-p \vert \ge a ) \le \frac{pq}{Na^2}
\end{eqnarray}

More generally the estimator of any mean, can be written

\begin{eqnarray}
P(\vert \frac{X_N}{N}-\mu \vert \ge \epsilon) &=& \frac{\sigma^2}{N \epsilon^2}
\end{eqnarray}

So, we can now make a few helpful statements. Our confidence that the estimator of the mean is within plus or minus $\epsilon$ of the true value is precisely $1-P(\vert \frac{X_N}{N}-\mu \vert \ge \epsilon)$, for a ``survey'' of $N$ participants. So let's see how accuracy $\epsilon$ scales with survey size and confidence. Writing confidence as:

\begin{eqnarray}
\mathrm{confidence}=1-\delta &=& 1-P(\vert \frac{X_N}{N}-\mu \vert \ge \epsilon)\\
\delta &\le & \frac{\sigma^2}{N\epsilon^2}\\
\epsilon &\le & \frac{\sigma}{\sqrt{N \delta}}
\end{eqnarray}

So we that accuracy scales like one over square root survey size, and that if we want to be twice as confident with a given accuracy, we'd better poll twice as many people. Sometimes in the real world, this implies a cost -- at least in terms of time! -- and so we'd like to do even better, if at all possible.

It turns out we can do much, much better by using something called the Chernoff Bound, which assumes not only pair-wise independence between our random votes $X_i$ -- as we assumed by writing $\mathrm{Var}(\frac{X_N}{N})=\frac{\sigma^2}{N}$ -- but independence across all the variables, for all possible combinations.

How do we do this? Re-write the markov bound with some free parameter:

\begin{eqnarray}
P(x \ge a ) &=& P(e^{sx} \ge e^{as}) \le \frac{\langle e^{sx}\rangle }{e^{sa}}
\end{eqnarray}

We see that the numerator is simply the moment generating function, evaluated at $s$. If we want to swap the inequality, we just send $s \to -s$:
\begin{eqnarray}
P(x \le a ) &=& P(e^{-sx} \ge e^{-as}) \le \frac{\langle e^{-sx} \rangle}{e^{-sa}}
\end{eqnarray}

For our bernouli distribution on a single ``vote'', we have:

\begin{eqnarray}
M(s)=\langle e^{sX_1} \rangle &=& \sum_{X} e^{sX} P(X) \\
&=& e^sp+1-p \\
M(s)&=& p(e^s-1)+1
\end{eqnarray}

Now, if we write down the moment generating function of the sum, we find, if the $X_i$'s are all independent:
\begin{eqnarray}
\langle e^{s X_N} \rangle &=& e^{s \sum_{i} X_i} \\
&=& \langle \prod_{i=1}^N e^{s X_i} \rangle \\
M_N(s) &=& M(s)^N \\
&=& \left(1+p(e^{s_N}-1)\right)^N
\end{eqnarray}

Recall that we want to bound the probability of our random variables $X_N$ being outside some range, so we want to minimize the right handside of this equation:

\begin{eqnarray}
P(X_N \ge a) &=& M_N(s)e^{-sa}
\end{eqnarray}

Minimizing this equation is the same is minimizing the log, since both sides are positive definite:

\begin{eqnarray}
\log P(X_N \ge a) &=& N \log \left(1+p(e^{s_N}-1)\right)-sa
\end{eqnarray}

Taking the derivative with respect to $s$ and setting things equal to zero, we get:

\begin{eqnarray}
s&=&\log \left(\frac{a}{Np} \right)
\end{eqnarray}

Which, for arbitary mean, is the same thing as saying

\begin{eqnarray}
s&=&\log \left(\frac{a}{\mu} \right)
\end{eqnarray}

So now let's choose $a=\mu(1+\theta)$, so that $\theta$ is our fractional accuracy relative to the variable and $s=\log(1+\delta)$. Putting this into our equation gives:

\begin{eqnarray}
\log P(X \ge \mu(1+\theta)  &\le & \mu\left(\theta- (1+\theta)\log (1+\theta) \right) \\
P(X \ge \mu(1+\theta) &\le & \left(\frac{e^{\theta}}{(1+\theta)^{1+\theta}} \right)^\mu
\end{eqnarray}

But, we could have taylor expanded the logarithm above, to get:

\begin{eqnarray*}
\log P(X \ge \mu(1+\theta)  &\le & \mu\left(\theta- (1+\theta)\log (1+\theta) \right)
\le \mu( - \theta^2/3 )\\
P(X \ge \mu(1+\theta) &\le &e^{-\theta^2 \mu/3}
\end{eqnarray*}

So this is the upper bound. What about the lower? We just send $s \to -s$ to get

\begin{eqnarray}
P(X \le \mu(1-\theta) & \le &  e^{-\theta^2 \mu/2}
\end{eqnarray}

So the lower bound, with the same taylor expansion, does slightly better! We can now make a double sided bound:

\begin{eqnarray}
P( \vert X-\mu \vert \ge \mu \theta ) &\le & 2e^{-\theta^2 \mu/3}
\end{eqnarray}

If we specify, once again $1-\delta =1-P( \vert X-\mu \vert \ge \mu \theta ) $ as our survey confidence, we can write
 \begin{eqnarray*}
 P( \vert X_N-pN \vert \ge Np \theta ) &\le & 2e^{-\theta^2 Np/3}
 \end{eqnarray*}

 Let $\theta=\frac{\epsilon}{p}$

 \begin{eqnarray}
  P( \vert X_N-pN \vert \ge N\epsilon )= P( \vert \frac{X_N}{N}-p \vert \ge \epsilon )&\le& 2e^{-\frac{N\epsilon^2}{3p}}
  \end{eqnarray}

  Now for a few tricks,

  \begin{eqnarray}
  \delta &=& P( \vert \frac{X_N}{N}-p \vert \ge \epsilon ) \\
\delta &\le  &  2e^{-\frac{N\epsilon^2}{3p}}\le 2e^{-\frac{N\epsilon^2}{3}}\\
e^{N\epsilon^2/3}&\ge & \frac{2}{\delta} \\
N &\ge &\frac{3}{\epsilon^2} \log (\frac{2}{\delta})
 \end{eqnarray}

 This result is known as the sampling theorem. In order to constrain our underlying ``base rate'' to $p = X_N/N \pm \epsilon$ with confidence $1-\delta$, we need to poll AT LEAST $N$ people. This is an extremely useful result, and has much better scaling properties than the Chebyshev bound, which would suggest:

 \begin{eqnarray}
 N & \ge & \frac{1}{\delta}\frac{\sigma^2}{\epsilon^2}
 \end{eqnarray}

 We see that we pay just as heavily for accuracy $\epsilon$ but much more heavily for confidence $\delta$. We would like to make $\delta$ as small as possible, and so if we taylor expand both expressions, and call $R=1-\delta$ confidence, we find:

 \begin{eqnarray}
\mathrm{Chebyshev:}\ \ N &\sim& 3R \frac{1}{\epsilon^2 } \\\
\mathrm{Chernoff:}\ \ N &\sim & 3\log(2R) \frac{1}{\epsilon^2 }
 \end{eqnarray}

The latter  CH is superior -- logarithmic scaling in confidence rather than linear!

Such considerations are important, because often we are interested in constraining the maximum error of many different random variables -- say, $K$ class base rates, $P(c)=p_c$ -- with the same accuracy $\epsilon$. This can be accomplished by the union bound:

\begin{eqnarray}
P( X_1\cup X_2 \cup X_3 \cdots X_N) & \le & \sum_{i=1}^N P(X_i)
\end{eqnarray}

So that, if we want to write the probability that any of our mean base rate estimates, for some ``poll'' with $K$ possibilities is incorrect by a factor greater than $\epsilon$, we can write:

\begin{eqnarray*}
P( \vert \hat{\mu}_{N_1}-p_1 \vert \ge \epsilon \cup \vert \hat{\mu}_{N_2}-p_2 \vert \ge \epsilon \cup \dots \vert \hat{\mu}_{N_K}-p_K \vert \ge \epsilon ) & \le & \sum_{c=1}^K P(\vert \hat{\mu}_{N_c}-p_c \vert \ge \epsilon)\\
\delta &\le &  \sum_{c=1}^K \delta_c
\end{eqnarray*}

Assuming the same confidence for every independent class bound, we get:

\begin{eqnarray}
\delta &\le &  \sum_{c=1}^K \delta_c \le 2Ke^{-\theta^2 \mu/3}
\end{eqnarray}

which implies:

\begin{eqnarray}
N & \ge & \frac{3}{\epsilon^2}\log(\frac{2K}{\delta})
\end{eqnarray}

So to bound the maximum of $K$ different statistics, or base rates with the same accuracy $\epsilon$, with some confidence $\delta$, we essentially send $\delta \to K \delta$ from a single binomial test; and, since the Chernoff scales logarithmically, this is not such a big deal!

Tuesday, October 13, 2015

Streaky Shooter, Stephen Curry

So, recently I was given the following problem: let's say some basketball player has an a priori probability  $P(y_1=1)=0.35$ of making  a three point shot. This implies the probability of missing the three-pointer is $P(y=0)=0.65$. (Note that $y=1$ denotes ``make'' and $y=0$ ``miss''.) Pretty good odds, for an NBA player -- I think.

But let's say that our basketball player is streaky, in the sense that $P(y_2=1 \vert y_1=1)=0.6$, or the probability of making the next shot, given that he made the previous, is sixty percent. (This implies $P(y_2=0 \vert y_1=1)=0.4$.) And conversely, the probability of making the next shot, given that he missed the previous is a bit lower than the prior, $P(y_2=1 \vert y_0=0)=0.3$, which implies $P(y_2=0 \vert y_0=0)=0.7$. This defines something called a markov process, in the sense that the $n^\mathrm{th}$ shot only depends on the one before it.

\begin{eqnarray}
P(y_n \vert y_{n-1},y_{n-2},\dots y_2,y_1) &=& P(y_n \vert y_{n-1})
\end{eqnarray}

To repeat an old aphorism: ``the future to depends on the past only through the present''. We can represent this conditional probability density with a matrix,

\begin{eqnarray}
P(y_n \vert y_{n-1}) &=& \mathbf{W}= \left(\begin{array}{cc}
.6 & .3 \\
.4 & .7 \\
\end{array}\right)
\end{eqnarray}

and the a priori probability density with a vector:

\begin{eqnarray}
P(y_1) &=& \left(\begin{array}{c}
0.35 \\
0.65
\end{array}\right)
\end{eqnarray}

If we want to write down the joint probability density on a few shots, we can use Bayes' chain rule:

\begin{eqnarray}
P(y_3, y_2,y_1) &=& P(y_3 \vert y_2, y_1) P(y_2 \vert y_1) P(y_1)
\end{eqnarray}

which by markov property simplifies to

\begin{eqnarray}
P(y_3, y_2,y_1) &=& P(y_3 \vert y_2) P(y_2 \vert y_1) P(y_1)
\end{eqnarray}

Marginalizing over $y_2$ we see that we have a matrix multiplication:

\begin{eqnarray}
\sum_{y_2} P(y_3, y_2,y_1) &=& \sum_{y_2}  P(y_3 \vert y_2) P(y_2 \vert y_1) P(y_1)\\
P(y_3,y_1) &=& \sum_{y_2}  P(y_3 \vert y_2) P(y_2 \vert y_1) P(y_1)
\end{eqnarray}

and, dividing through by $P(y_1)$ is the same as conditioning on $y_1$, so we have a new conditional probability, which connects the third shot to the first:

\begin{eqnarray}
P(y_3 \vert y_1) &=& \sum_{y_2}  P(y_3 \vert y_2) P(y_2 \vert y_1)
\end{eqnarray}

This is often called the Chapman Kolgomorov equation, and is the basis for much analysis in physics of the rate of change of probability distributions through the master equation and Fokker-Planck equation.

It's pretty easy to see that any conditional distribution can now be written as $n-1$ matrix contractions:

\begin{eqnarray}
P(y_n \vert y_1) &=& (\mathbf{W})^{n-1}
\end{eqnarray}

If we multiply by the vector $P(y_1)$, we get the distribution on $y_n$ alone, since

\begin{eqnarray}
P(y_2) &=& \sum_{y_1} P(y_2 \vert y_1)P(y_1)\\
&=& \mathbf{W}\cdot \vec{\mathbf{P}}_1 \\
P(y_n) &=& (\mathbf{W})^{n-1} \cdot \vec{\mathbf{P}}_1
\end{eqnarray}

But Markov matrices, since their columns add to unity and have strictly non-negative entries have some very useful properties (the Perron-Frobenius Theorem). Their spectra -- or eigenvalues -- all lie within the unit circle in the complex plane, and there is guaranteed to be some eigenvector $\mathbf{\pi}$ with an eigenvalue $\lambda=1$. This eigenvector is called the stationary distribution, and after $n \to \infty$ time steps, we expect our distribution to converge to $\mathbf{\pi}$. For the matrix given above, this vector is

\begin{eqnarray}
\mathbf{\pi} &=& \left(\begin{array}{c}
0.428 \\
0.572
\end{array}\right)
\end{eqnarray}

So after sufficient shots, our shooter will converge to a 42.8 percent success rate, as compared to 35 -- or whatever you specified. This discrepancy is important, as the expected number of buckets given $m$ shots does not go like a binomial distribution

\begin{eqnarray}
\langle \sum_i y_i \rangle &=& m p_0 \\
p_0=0.35
\end{eqnarray}

but something slightly different. More like a binomial process with changing parameter $p$. As $m$ gets large we expect

\begin{eqnarray}
\langle \sum_i y_i \rangle &\approx & m \pi_0 \\
\pi_0 &=& 0.428 = \frac{3}{7}
\end{eqnarray}

Thursday, October 8, 2015

Brief Note on Information Gain Definition

After being confused for quite some time, I've realized that information gain and mutual information gain are essentially the same thing, yet the first one is not symmetric. Mutual information as an intuitive measure of correlation can be displayed by noting that for a bivariate normal distribution, the mutual information between the two variables is:

\begin{eqnarray}
I(x,y) &=& IG(x \vert y) \\
&=& H(x)-H(x \vert y) \\
&\sim &  -\frac{1}{2}\log(1-\rho^2)
\end{eqnarray}



Where $\rho$ is the Pearson correlation, varying between zero and one. So, when choosing to split on variables in a decision tree, or designing an experiment that probes some underlying model space, we obviously want to choose pairs $x,y$ that have $\rho \to 1$, in order to yield as much information as possible. 

Note On "Information Gain" and the Fisher Information

From last post, http://rspeare.blogspot.com/2015/08/the-fisher-matrix-and-volume-collapse_31.html, I was talking about "volume" collapse in parameter space due to some data, $\vec{x}$. I'd like to relate this to information gain, which can be defined pretty simply as:

\begin{eqnarray}
H[p(\vec{\theta})] - H[p(\vec{\theta} \vert \vec{x})] &=& IG(\vec{\theta} \vert \vec{x})
\end{eqnarray}

Now, using Bayes' rule we can change what we've written in the second term above:

\begin{eqnarray}
H[p(\vec{\theta})] - H[\mathcal{L}(\vec{x} \vert \vec{\theta}) p(\vec{\theta})] &=& IG(\vec{\theta} \vert \vec{x})
\end{eqnarray}

And using the addition property of entropy, we can write:

\begin{eqnarray}
IG(\vec{\theta} \vert \vec{x}) &=& - H[\mathcal{L}(\vec{x} \vert \vec{\theta})]
\end{eqnarray}

But, with the fisher information matrix,

\begin{eqnarray}
\mathbf{F}_{ij} &=&\langle \frac{-\partial^2 \log \mathcal{L}(\mathbf{x} \vert \mathbf{\theta})}{\partial \theta_i \partial \theta_j} \rangle
\end{eqnarray}

 we can estimate the covariance of the likelihood function, and therefore it's entropy -- if we use the laplace approximation and denote the likelihood a gaussian in parameter space:

\begin{eqnarray}
H[\mathcal{L}] &=& \frac{d}{2}\log(2\pi e) + \log \left( \vert \mathbf{F}_{ij} \vert^{-1} \right)
\end{eqnarray}

This means that our information gain on $\vec{\theta}$ given an experiment $\vec{x}$ is proportional to the logarithm of the determinant of the Fisher matrix.

\begin{eqnarray}
IG(\vec{\theta} \vert \vec{x} ) &\sim &\log \left( \vert \mathbf{F}_{ij}\vert \right)
\end{eqnarray}

And so, we now see intuitively why this is \textbf{called} the fisher information. Our ``volume'' collapse on the variables of interest $\vec{\theta}$ given our experiment, is:

\begin{eqnarray}
e^{IG(\mathrm{\theta} \vert \mathrm{x})} & \sim & \mathrm{det}\vert \mathbf{F}_{ij} \vert
\end{eqnarray}




Wednesday, October 7, 2015

Defining Stochastic Processes

 VanKampen says that given a single stochastic variable $X$, we can define an infinite number of other stochastic variables by transformation. For example, let $X$ be uniformly distributed over the interval $[-1,1]$. We can then define a normal variable by the following transformation:

 \begin{eqnarray}
z = f(X) &=& \sqrt{2}\sigma \mathrm{erf}^{-1}\left( X \right) + \mu \\
z &\sim &  \mathcal{N}(\mu, \sigma)
 \end{eqnarray}

 Where $z$ is now drawn from a normal distribution with mean $\mu$ and variance $\sigma^2$. We can also do this for a multivariate Gaussian distribution. Say we want to construct:

 \begin{eqnarray}
\mathbf{x} \sim \mathcal{N}(\vec{\mu}, \Sigma) &=& \frac{1}{(2\pi)^{D/2} \mathrm{det}\vert \sigma \vert}e^{(\mathbf{x}-\mathbf{\mu} )\cdot \Sigma^{-1}\cdot (\mathbf{x}-\mathbf{\mu}) /2 }
 \end{eqnarray}

 We can create this random vector by taking the following steps. First, decompose the covariance matrix $\Sigma$ as:

 \begin{eqnarray}
 \Sigma^{-1} &=& U \Lambda^{-1} U^T
 \end{eqnarray}

 Where $\Lambda$ is diagonal and $U$ is orthonormal. One sees that this is simply the eigendecomposition of our covariance matrix, which surely exists and has $\Lambda_{ii} \ge 0 \ \ \forall i$ since $\Sigma$ is semi-positive definite. Then, construct the following vector:

 \begin{eqnarray}
 \vec{\mathbf{y}}_i &=& \vec{f}_i(X) =\sqrt{2} \Lambda_{ii}  \mathrm{erf}^{-1}\left(X_i \right)
 \end{eqnarray}

I give the uniformly distributed random variable $X$ a subscript $i$ to denote independent draws. Then take, linear combinations of the vector $\vec{y}$ and add the mean:

 \begin{eqnarray}
 \vec{x} &=& U \cdot \vec{y}+\vec{\mu}
 \end{eqnarray}

 So we have simply:

 \begin{eqnarray}
\vec{z}_i &=&  U_{ij} \cdot \sqrt{2} \Lambda_{jj} \mathrm{erf}^{-1}\left(X_j \right)+\vec{\mu}_i \\
&=& U_{ij} \cdot f(X,j) + \vec{\mu}_i \\
&=& g(X,i)
 \end{eqnarray}

 So we see that we created our vector out of a linear combination of functions -- given by $U$ -- which is just another indexed function, $g(X,i)$.

 We can promote this discrete vector index to a continuous one, that of time $t$, and thus define a stochastic process, or a random function:

 \begin{eqnarray}
 Y_X(t) &=& f(X,t)
 \end{eqnarray}

 For every value of $t$ we have a random variable, and for every value $X=x$ we have a \textbf{realization} of the process, a function of time. Let's say we want a Gaussian random function with zero mean and covariance kernel $K$. The probability distribution for such a function is:

 \begin{eqnarray}
 P\left[\eta(t)\right] &=& \frac{1}{Z} e^{-\int dt_1 \int dt_2 \eta(t_1) K^{-1}(t_1,t_2) \eta(t_2)/2}
 \end{eqnarray}

Don't worry about the function normalization constant $Z$ for now. So far we have made no assumption on our covariance across time, encapsulated by the kernel $K(t,t^\prime)$. Let's assume ``stationarity'', in the sense that correlation only depends upon time difference $K(t,t^\prime)=K(t-t^\prime)$. Let's also make the kernel diagonal in that $\eta(t)$ is only correlated with itself at time $t$:

\begin{eqnarray}
K &=& \delta^D(t-t^\prime) \\
 P\left[\eta(t)\right] &=& \frac{1}{Z} e^{-\int dt \eta(t)^2/2}
\end{eqnarray}

The covariance of this random function $\eta(t)$ is now:

\begin{eqnarray}
\langle \eta(t) \eta(t^\prime) \rangle &=& \delta^D(t-t^\prime)
\end{eqnarray}

This is essentially the functional ``unit'' normal. Can we create such a random function, $\eta(t)$ a stochastic process, using only $X$? The answer is yes, and we will need Mercer's theorem to do it. Let's say we want to create:

\begin{eqnarray}
\eta(t) \sim \mathcal{N}\left( \mu(t), K(t,t^\prime) \right)
\end{eqnarray}

Then we need to decompose our kernel into eigenfunctions using Mercer's theorem:

\begin{eqnarray}
K(t,t^\prime) &=& \sum_{n=1}^\infty \lambda_n \phi_n(t)\phi_n(t^\prime)
\end{eqnarray}

where

\begin{eqnarray}
\lambda_n \phi_n(t) &=& \int dt^\prime K(t,t^\prime)\phi_n(t^\prime) \\
\int \phi_n(t) \phi_m(t) dt &=& \delta^K_{mn}
\end{eqnarray}

The $\phi_m(t)$ are a possibly infinite set of orthornormal eigenfunctions of the kernel $K$ and $\lambda_m$ their eigenvalues. We now do a very similar trick to before, yet we promote vectors to functions. Let

\begin{eqnarray}
y_m=f(X,m) &=& \sqrt{2 \lambda_m}\mathrm{erf}^{-1}\left( X \right) \\
\eta(t) &=& \sum_{m=1}^\infty \phi_m(t) y_m + \mu(t) \\
\eta(t) &=& \sum_{m=1}^\infty \phi_m(t) \sqrt{2 \lambda_m} \mathrm{erf}^{-1}\left( X \right)  + \mu(t)
\end{eqnarray}

This decomposition of a kernel into eigenfunctions actually has an intimate connection with quantum field theory. Since the normalization $Z$ for $P[\eta(t)]$ is in fact the determinant of the kernel $K$, which is the product of the eigenvalues:

\begin{eqnarray}
Z &=& \sqrt{\prod_{m=1}^\infty 2\pi \lambda_m }
\end{eqnarray}

Such is the trick Sidney Coleman used to do quite a few path integral calculations -- or at least as I understand from a friend!