Sunday, April 27, 2014

Loop Diagrams and the Gram-Charlier Expansion

In response to the former question, whether an arbitrary statistical distribution -- with non-zero kurtosis and skewness and higher order cumulants -- can be represented in terms of loop diagrams, the answer is yes!

After looking at Zee's "Child Problem" In "Quantum field theory in a nutshell", we see the following multidimensional integral:

\begin{eqnarray}
Z(J) &=&\int d^Nq \mathrm{Exp}\left(\frac{-\vec{q}\cdot \mathbf{A} \cdot \vec{q}}{2}+\vec{J}\cdot{q}-\frac{\lambda}{4!}q^4 \right)
\end{eqnarray}

Which can be described in terms of a Green's function series expansion in powers of J:

\begin{eqnarray}
Z(J) &=& \sum_{s=0}^N \frac{1}{s!}J_{i_1}J_{i_2}\cdots J_{i_s} G_{i_1i_2\cdots i_s} \\
G_{i\cdots j} &=& \int d^Nq \left(q_i \cdots q_j \right)\mathrm{Exp}\left(\frac{-\vec{q}\cdot \mathbf{A} \cdot \vec{q}}{2}-\frac{\lambda}{4!}q^4 \right) = \langle q_i \cdots q_j \rangle
\end{eqnarray}

Now G is a rank 's' tensor, and we interpret J physically, as an excitation of the system, but in statistics, this is just an offset to the total random vector q. The matrix A is the covariance matrix, or our multidimensional second cumulant -- rank 2 tensor -- from before. We can immediately write down the two point Green's function for this expansion to zeroth and first loop order:

\begin{eqnarray}
G_{ij} &=& \mathbf{A}^{-1}_{ij}-\lambda \left( \sum_n  \mathbf{A}^{-1}_{in} \mathbf{A}^{-1}_{jn} \mathbf{A}^{-1}_{nn} \right)
\end{eqnarray}

Which, in statistics-speak reads:

\begin{eqnarray}
G_{ij} &=& \underline{\underline{c_2}}_{ij}-\lambda \left( \sum_n \underline{\underline{c_2}}_{in}\underline{\underline{c_2}}_{jn}\underline{\underline{c_2}}_{nn} \right)
\end{eqnarray}

Those underlines are messy, but we now see that we are taking into account cross correlation between two variables -- labeled by subscripts i and j -- as well as "one-loop" terms which allow for correlation with some intermediate variable "n", and even accounting for "n"'s correlation with itself. I think this is the correct interpretation, and if one were to allow even more expansion, we'd have

\begin{eqnarray}
G_{ij} &=& \mathbf{A}^{-1}_{ij}-\lambda \left( \sum_n  \mathbf{A}^{-1}_{in} \mathbf{A}^{-1}_{jn} \mathbf{A}^{-1}_{nn} +\mathbf{A}^{-1}_{ij}\mathbf{A}^{-1}_{nn}\mathbf{A}^{-1}_{nn}\right)
\end{eqnarray}


for the two-point green's function to oneloop order. This third term corresponds to a "disconnected" diagram, since the point -- or random variable -- n is in no way correlated with the initial "source" points i and j.

Now to write the total generating function, we would have something like

\begin{eqnarray}
Z(J) &=& \left(\frac{(2\pi)^N}{\vert \mathbf{A}\vert }\right)^{1/2}\left[1+\vec{J}\cdot \mathbf{A}^{-1}\cdot \vec{J} -\lambda \vec{J}_i (\sum_n  \mathbf{A}^{-1}_{in} \mathbf{A}^{-1}_{jn} \mathbf{A}^{-1}_{nn} +\mathbf{A}^{-1}_{ij}\mathbf{A}^{-1}_{nn}\mathbf{A}^{-1}_{nn})\vec{J}_j \right]
\end{eqnarray}

It's a bit unclear where to go from here, but these green's functions certainly correspond to expectation values of a random vector. The question is, when on introduces anharmonicity into the field -- for instance a non-zero fourth cumulant -- what happens to our expectation values? And, how do we represent those new 2-point and 4-point green's functions diagrammatically?

Measure for the Path Integral

Quick Reminder to self:

The measure for the path integral, normally written

\begin{eqnarray}
\langle q_F \vert e^{-iHT}\vert q_i \rangle &=& \int Dq e^{\frac{i}{\hbar}\int L(q,\dot{q},t)dt}
\end{eqnarray}

is

\begin{eqnarray}
\int Dq &=& \lim_{N \to \infty}\left( \frac{2\pi m}{i \delta t} \right)^{N/2} \int \cdots \int dq_1 dq_2 \cdots dq_N
\end{eqnarray}

Wednesday, April 23, 2014

PT and the Gram-Charlier

Is it possible to represent the Gram-Charlier expansion in terms of loop diagrams? Something to look into...

Monday, April 21, 2014

ANOVA and the "Temperature" of a data set

Typically, with analysis of variance, one chooses to weight all data points -- or deviations from our model -- equally in the chi-squared statistic, assuming

\begin{eqnarray}
\chi^2 &=& \sum_i^N \frac{(x_i-f(x_i))^2}{2\sigma^2_i}\\
\sigma_i &=& \sigma \ \ \forall, i
\end{eqnarray}

Which yields a very simple posterior distribution:

\begin{eqnarray}
P(\mu, \sigma^2 \vert D) &=& \frac{\left(2\pi \sigma^2\right)^{-N/2}e^{-\chi^2/2}P(\mu,\sigma^2)}{Z}
\end{eqnarray}

Where Z is our awful, marginalized normalizing factor from before. We can write our posterior distribution much like the partition function of an Bose-Einstein ensemble, and associate our chi-squared statistic -- or deviation from our model f(x) -- with an 'Energy':
\begin{eqnarray}
P(\mu, \sigma^2 \vert D) &=& \frac{e^{-\beta E(\mu,\sigma^2)}}{Z} \\
Z &=& \int \int e^{-\beta E(\mu,\sigma^2)} d\mu d\sigma^2 \\
\beta E(\mu,\sigma^2) &=& \sum_i^N \frac{(x_i-f(x_i))^2}{2\sigma^2}
\end{eqnarray}

In this case the variance of each and every data point y_i -- the denominator in our chi-squared expression above -- acts much like the "temperature" of our distribution. We can make things even clearer by putting in the chemical potential, or the energy associated with adding/replacing a member of our dataset:

\begin{eqnarray}
\mu_c &=& \mathrm{chemical}\ \ \mathrm{potential}\\
P(\mu, \sigma^2 \vert D) &=& \frac{e^{N\beta\mu_c-\beta E)}}{Z} \\
\beta \mu_c &=& \log\left(\frac{1}{\sqrt{2\pi \sigma^2}} \right)
\end{eqnarray}

Which manifests itself as our extra normalizing factor for adding an extra data point. Exactly like our chemical potential mu's function in the bose einstein distribution! (Pardon the repetition of mu).

Tuesday, April 15, 2014

Least Chi-Squared -- or Maximum Log-Likelihood -- Estimation of Parameters

Least Chi-Squared -- or Maximum Log-Likelihood -- Estimation of Parameters

Let's say your given some data, a list of x and y values, which seem to be related by a model. One example is a perfect linear relation ship with some scatter, where

\begin{eqnarray}
y_i \approx m*x_i+b
\end{eqnarray}

Where $m$ is a best fit slope and $b$ a vertical offset. Since this is not a perfect relationship, we must introduce an error term into our model:

\begin{eqnarray}
y_i &=& m*x_i+b + e_i\\
y_i &=& f(x_i) + e_i \\
f(x) &=& mx+b
\end{eqnarray}

$f(x_i)$ in this instance, represents our model, or, our "perceived" relationship between $x$ and $y$ values. Another example is when the $y_i$ values are just centered around some single value -- i.e. a random variable. Then we have:

\begin{eqnarray}
y_i &=& f(x_i) + e_i \\
f(x) &=& \mu
\end{eqnarray}

For a random variable $Y$ centered about zero, we might say $f(x)=0$. This is STILL a model!

But the point being, if we want to make a "best-guess" at the parameters within our model -- such as $m$ and $b$ above, or $\mu$ -- we are going to have to come up with some way of describing the error terms, $e_i$.

A reasonable assumption is that those error terms are drawn from some Probability density; let's say a Gaussian.

\begin{eqnarray}
e_i &\sim & \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{e_i^2}{2\sigma^2}}
\end{eqnarray}

We can isolate the error term in the following way: we subtract our "output" value $y_i$ from the model $f(x_i)$, this results in, for a random variable centered about a true value $\mu$, $f(x)=\mu$:

\begin{eqnarray}
e_i &=& y_i - f(x_i) \\
P(y_i \vert \mu, \sigma^2) &=& \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{\left(y_i - f(x_i)\right)^2}{2\sigma^2}}
\end{eqnarray}

Now, this is the probability of a single data point-pair, $x_i,y_i$, assuming our model $f(x)$, and an error term that is described by this Gaussian distribution. If we were to count up all $N$ data point pairs, and calculate the probability of observing the whole set $D:{x_i}$, we would take the product of the individual probabilities:

\begin{eqnarray}
P(D| \mu, \sigma^2) &=& \Pi_i P(y_i \vert \mu, \sigma^2)\\
&=& (2\pi \sigma^2)^{-\frac{N}{2}} e^{-\sum_i \frac{\left(y_i - f(x_i)\right)^2}{2\sigma^2}}
\end{eqnarray}

Notice that this conditional probability has the same structure as our $P(D\vert H_i) $ term from Bayes Theorem, before. Our "hypothesis" in this case, is the mean and variance; or, the parameters of our model. We can construct posterior probability densities on those parameters in the usual way:

\begin{eqnarray}
P(H_i \vert D) &=& \frac{P(D \vert H_i)P(H_i)}{P(D)}\\
&=& \frac{P(D \vert H_i)P(H_i)}{\sum_i P(D \vert H_i)P(H_i)}\\
P(\mu, \sigma^2 \vert D) &=& \frac{P(D \vert \mu, \sigma^2)P(\mu, \sigma^2)}{\int \int P(D \vert \mu, \sigma^2)P(\mu, \sigma^2)d\mu d\sigma^2}
\end{eqnarray}

The denominator in these terms is simply a normalizing factor. If one wants to construct the posterior on a single parameter, say $\mu$, we merely perform what's called marginalization:

\begin{eqnarray}
P(\mu \vert D) &=& \int P(\mu, \sigma^2 \vert D) d\sigma^2
\end{eqnarray}

This equation reads "the probability of some mean $\mu$, given the data $D$, an array of x and y values".

Ok, so why do we care? Well, in Bayes Theorem above, we implicitly used something called a Likelihood function, which is the Likelihood of seeing the data, given some model and its parameters:
\begin{eqnarray}
\mathcal{L}(D \vert \mu, \sigma^2)=P(D \vert \mu, \sigma^2) &=& (2\pi \sigma^2)^{-\frac{N}{2}} e^{-\sum_i \frac{\left(y_i - f(x_i)\right)^2}{2\sigma^2}}
\end{eqnarray}

Now the term in our exponential above can be simplified. Defining the $\chi^2$ statistic for our data:

\begin{eqnarray}
\chi^2 &=& \sum_i \frac{\left(y_i - f(x_i)\right)^2}{\sigma^2}\\
\mathcal{L}(D \vert \mu, \sigma^2)&=& (2\pi \sigma^2)^{-\frac{N}{2}} e^{-\frac{\chi^2}{2}}
\end{eqnarray}

Pretty cool right? So, maximizing Likelihood is minimizing $\chi^2$. We can take the log of this Likelihood function to write things in a more illuminating fashion:

\begin{eqnarray}
\log\left[\mathcal{L}(D \vert \mu, \sigma^2)\right]&=& -\frac{N}{2}\log (2\pi \sigma^2)-\frac{\chi^2}{2}
\end{eqnarray}

Taking the derivative with respect to $\mu$ and setting equal to zero yields are critical value of $\mu$ -- let's call it $\hat{\mu}$ -- that yields maximum Likelihood:

\begin{eqnarray}
\frac{\partial}{\partial \mu}\log\left[\mathcal{L}(D \vert \mu, \sigma^2)\right] &=&-\frac{\partial}{\partial \mu}\chi^2/2\\
0&=&\frac{\sum_i x_i - N\hat{\mu}}{\sigma^2}\\
\hat{\mu} &=& \sum_i \frac{x_i}{N}
\end{eqnarray}

our standard "average". This is our estimator of the mean.

Now for an estimator of the variance -- or, in our older terms, the value of $\sigma$ for which Likelihood is maximized:

\begin{eqnarray}
\frac{\partial}{\partial \sigma}\log\left[\mathcal{L}(D \vert \mu, \sigma^2)\right] &=& 0\\
\frac{N}{\hat{\sigma}}&=&\frac{\sum_i (x_i-\mu)^2}{\hat{\sigma}^3}\\
\hat{\sigma^2} &=&\frac{\sum_i (x_i-\mu)^2}{N}
\end{eqnarray}

Interesting, our standard sum of squares. Notice however, that we've used our "true" mean $\mu$ in the above equation, not our estimator $\hat{\mu}$. In order to account for this, one can write

\begin{eqnarray}
\hat{\sigma^2} &=&\frac{\sum_i (x_i-\mu)^2}{N}\\
N\hat{\sigma^2} &=&\sum_i (x_i-\mu)^2\\
&=&\sum_i (x_i-\hat{\mu})^2+N(\hat{\mu}-\mu)^2\\
&=& Ns^2+N(\sum_i \frac{x_i}{N}-\mu)^2\\
N\hat{\sigma^2} &=& Ns^2+N(\sum_i \frac{(x_i-\mu)^2}{N^2}\\
N\hat{\sigma^2} &=& Ns^2+\hat{\sigma^2}\\
\frac{(N-1)}{N}\hat{\sigma^2} &=& s^2
\end{eqnarray}

Where we now have a new estimator for the variance, which requires no "true information":
\begin{eqnarray}
s^2&=& \frac{\sum_i (x_i-\hat{\mu})^2}{N}
\end{eqnarray}

If we want to best emulate the "true information estimator", we had best divide by $N-1$ instead of $N$. Another way to show this is that $s^2$ is a biased estimator: it's expectation value, or average over an infinite number of data sets, is not $\sigma^2$ but in fact:

\begin{eqnarray}
\langle s^2 \rangle &=& \frac{N}{N-1}\sigma^2
\end{eqnarray}

To review, let us write our Posterior distribution one more time:

\begin{eqnarray}
P(H_i \vert D) &=& \frac{P(D \vert H_i)P(H_i)}{\sum_i P(D \vert H_i)P(H_i)}\\
P(\mu, \sigma^2 \vert D) &=& \frac{\mathcal{L}(D \vert \mu, \sigma^2)P(\mu, \sigma^2)}{\int \int P(D \vert \mu, \sigma^2)P(\mu, \sigma^2)d\mu d\sigma^2}\\
P(\mu, \sigma^2 \vert D) &=& \frac{(2\pi \sigma^2)^{-\frac{N}{2}} e^{-\frac{\chi^2}{2}}P(\mu, \sigma^2)}{\int \int P(D \vert \mu, \sigma^2)P(\mu, \sigma^2)d\mu d\sigma^2}
\end{eqnarray}

The factor $P(\mu,\sigma^2)$ on the right hand side is called a prior, and determines our a priori information about the parameters $\mu$ and $\sigma^2$. What's nice about Bayes' construction is that the posterior from one data set can be used as the prior for another -- $D$ just goes into more initial information. (Perhaps we should've written the prior

\begin{eqnarray}
P(\mu,\sigma^2 \vert \mathrm{old\ dataset})
\end{eqnarray}

Stirling's Approximation

To take a quick detour, let us examine the following definition of the factorial:

\begin{eqnarray}
N! &=& \int_0^\infty x^N e^{-x}dx
\end{eqnarray}

One way to prove this is to write

\begin{eqnarray}
I(a) \int_0^\infty e^{-ax}dx &=& \frac{1}{a}
\end{eqnarray}

and take the derivative underneath the integral sign, to write:

\begin{eqnarray}
I^\prime(a) &=& \frac{\partial}{\partial a} \int_0^\infty e^{-ax}dx \\
&=& \int_0^\infty -ax e^{-ax}dx \\
&=& \frac{-1}{a^2}
\end{eqnarray}

and more generally,

\begin{eqnarray}
\frac{\partial^n I(a)}{\partial a^n} &=& (-1)^n \int_0^\infty a^n x^n e^{-ax}dx = \frac{(-1)^n n!}{a^{n+1}}
\end{eqnarray}

Setting $a=1$ we find

\begin{eqnarray}
\Gamma[n+1] &=& \int_0^\infty x^n e^{-x}dx =  n!
\end{eqnarray}

Now let's examine this integral in the limit $n \to \infty$. We can take our $x$ argument up, into the exponential and write the corresponding function as $f(x)$:

\begin{eqnarray}
n! &=& \int_0^\infty e^{-x+n \log x}dx\\
&=& \int_0^\infty e^{f(x)}dx\\
f(x) &=& -x + n \log x
\end{eqnarray}

Now $f(x)$ is an absurdly large -- or high-valued -- function for large $n$, and so we can approximate this integral as only "counting" contributions around the maximum of $f(x)$. We find the position this maximum in the normal way:

\begin{eqnarray}
f^\prime &=& -1 + \frac{n}{x} = 0 \\
x_0 &=& n
\end{eqnarray}

Taking a look at our second derivative
\begin{eqnarray}
f^{\prime \prime}\vert_{x_0} &=& -\frac{n}{x^2} = -\frac{1}{n} < 0
\end{eqnarray}

we see that $x_0$ is the position of a true maximum. Expanding out our $f(x)$ with a Taylor approximation:

\begin{eqnarray}
n! &\approx& \int_0^\infty e^{f(x_0)+f^\prime(x) \vert_{x_0}(x-x_0)+f^{\prime \prime}(x) \vert_{x_0}\frac{(x-x_0)^2}{2}}dx\\
\end{eqnarray}

We see that the first derivative term is zero by construction, and we are left with a constant times a Gaussian,

\begin{eqnarray}
n! &\approx& e^{-n+n\log n}\int_0^\infty e^{\frac{(x-x_0)^2}{2n}}dx\\
&\approx& n^n e^{-n}\int_0^\infty e^{\frac{(x-n)^2}{2n}}dx\\
\end{eqnarray}

Now this integral is tricky, because we are taking the limit as $n \to \infty$, which means that, essentially, the middle of our Gaussian distribution is far afield from $x=0$. Since the integral of any Gaussian $e^{-x^2/(2\sigma^2)}$ is $\sqrt{2 \pi \sigma^2}$, we can approximate the integral above to be the "full" $-\infty < x < \infty$ integration, because our moment, or center of the distribution, $x_0$ is far to the positive side of zero. This yields, with $\sigma^2=\sqrt{n}$:

\begin{eqnarray}
n! &\approx& n^n e^{-n}\sqrt{2\pi n}
\end{eqnarray}

Which is stirling's approximation!

Now if we use this approximation to examine the binomial distribution in the same limit:

\begin{eqnarray}
P(k;N) &=& \frac{N!}{k! (N-k)!}p^k (1-p)^{N-k}
\end{eqnarray}