Tuesday, April 15, 2014

Least Chi-Squared -- or Maximum Log-Likelihood -- Estimation of Parameters

Least Chi-Squared -- or Maximum Log-Likelihood -- Estimation of Parameters

Let's say your given some data, a list of x and y values, which seem to be related by a model. One example is a perfect linear relation ship with some scatter, where

\begin{eqnarray}
y_i \approx m*x_i+b
\end{eqnarray}

Where $m$ is a best fit slope and $b$ a vertical offset. Since this is not a perfect relationship, we must introduce an error term into our model:

\begin{eqnarray}
y_i &=& m*x_i+b + e_i\\
y_i &=& f(x_i) + e_i \\
f(x) &=& mx+b
\end{eqnarray}

$f(x_i)$ in this instance, represents our model, or, our "perceived" relationship between $x$ and $y$ values. Another example is when the $y_i$ values are just centered around some single value -- i.e. a random variable. Then we have:

\begin{eqnarray}
y_i &=& f(x_i) + e_i \\
f(x) &=& \mu
\end{eqnarray}

For a random variable $Y$ centered about zero, we might say $f(x)=0$. This is STILL a model!

But the point being, if we want to make a "best-guess" at the parameters within our model -- such as $m$ and $b$ above, or $\mu$ -- we are going to have to come up with some way of describing the error terms, $e_i$.

A reasonable assumption is that those error terms are drawn from some Probability density; let's say a Gaussian.

\begin{eqnarray}
e_i &\sim & \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{e_i^2}{2\sigma^2}}
\end{eqnarray}

We can isolate the error term in the following way: we subtract our "output" value $y_i$ from the model $f(x_i)$, this results in, for a random variable centered about a true value $\mu$, $f(x)=\mu$:

\begin{eqnarray}
e_i &=& y_i - f(x_i) \\
P(y_i \vert \mu, \sigma^2) &=& \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{\left(y_i - f(x_i)\right)^2}{2\sigma^2}}
\end{eqnarray}

Now, this is the probability of a single data point-pair, $x_i,y_i$, assuming our model $f(x)$, and an error term that is described by this Gaussian distribution. If we were to count up all $N$ data point pairs, and calculate the probability of observing the whole set $D:{x_i}$, we would take the product of the individual probabilities:

\begin{eqnarray}
P(D| \mu, \sigma^2) &=& \Pi_i P(y_i \vert \mu, \sigma^2)\\
&=& (2\pi \sigma^2)^{-\frac{N}{2}} e^{-\sum_i \frac{\left(y_i - f(x_i)\right)^2}{2\sigma^2}}
\end{eqnarray}

Notice that this conditional probability has the same structure as our $P(D\vert H_i) $ term from Bayes Theorem, before. Our "hypothesis" in this case, is the mean and variance; or, the parameters of our model. We can construct posterior probability densities on those parameters in the usual way:

\begin{eqnarray}
P(H_i \vert D) &=& \frac{P(D \vert H_i)P(H_i)}{P(D)}\\
&=& \frac{P(D \vert H_i)P(H_i)}{\sum_i P(D \vert H_i)P(H_i)}\\
P(\mu, \sigma^2 \vert D) &=& \frac{P(D \vert \mu, \sigma^2)P(\mu, \sigma^2)}{\int \int P(D \vert \mu, \sigma^2)P(\mu, \sigma^2)d\mu d\sigma^2}
\end{eqnarray}

The denominator in these terms is simply a normalizing factor. If one wants to construct the posterior on a single parameter, say $\mu$, we merely perform what's called marginalization:

\begin{eqnarray}
P(\mu \vert D) &=& \int P(\mu, \sigma^2 \vert D) d\sigma^2
\end{eqnarray}

This equation reads "the probability of some mean $\mu$, given the data $D$, an array of x and y values".

Ok, so why do we care? Well, in Bayes Theorem above, we implicitly used something called a Likelihood function, which is the Likelihood of seeing the data, given some model and its parameters:
\begin{eqnarray}
\mathcal{L}(D \vert \mu, \sigma^2)=P(D \vert \mu, \sigma^2) &=& (2\pi \sigma^2)^{-\frac{N}{2}} e^{-\sum_i \frac{\left(y_i - f(x_i)\right)^2}{2\sigma^2}}
\end{eqnarray}

Now the term in our exponential above can be simplified. Defining the $\chi^2$ statistic for our data:

\begin{eqnarray}
\chi^2 &=& \sum_i \frac{\left(y_i - f(x_i)\right)^2}{\sigma^2}\\
\mathcal{L}(D \vert \mu, \sigma^2)&=& (2\pi \sigma^2)^{-\frac{N}{2}} e^{-\frac{\chi^2}{2}}
\end{eqnarray}

Pretty cool right? So, maximizing Likelihood is minimizing $\chi^2$. We can take the log of this Likelihood function to write things in a more illuminating fashion:

\begin{eqnarray}
\log\left[\mathcal{L}(D \vert \mu, \sigma^2)\right]&=& -\frac{N}{2}\log (2\pi \sigma^2)-\frac{\chi^2}{2}
\end{eqnarray}

Taking the derivative with respect to $\mu$ and setting equal to zero yields are critical value of $\mu$ -- let's call it $\hat{\mu}$ -- that yields maximum Likelihood:

\begin{eqnarray}
\frac{\partial}{\partial \mu}\log\left[\mathcal{L}(D \vert \mu, \sigma^2)\right] &=&-\frac{\partial}{\partial \mu}\chi^2/2\\
0&=&\frac{\sum_i x_i - N\hat{\mu}}{\sigma^2}\\
\hat{\mu} &=& \sum_i \frac{x_i}{N}
\end{eqnarray}

our standard "average". This is our estimator of the mean.

Now for an estimator of the variance -- or, in our older terms, the value of $\sigma$ for which Likelihood is maximized:

\begin{eqnarray}
\frac{\partial}{\partial \sigma}\log\left[\mathcal{L}(D \vert \mu, \sigma^2)\right] &=& 0\\
\frac{N}{\hat{\sigma}}&=&\frac{\sum_i (x_i-\mu)^2}{\hat{\sigma}^3}\\
\hat{\sigma^2} &=&\frac{\sum_i (x_i-\mu)^2}{N}
\end{eqnarray}

Interesting, our standard sum of squares. Notice however, that we've used our "true" mean $\mu$ in the above equation, not our estimator $\hat{\mu}$. In order to account for this, one can write

\begin{eqnarray}
\hat{\sigma^2} &=&\frac{\sum_i (x_i-\mu)^2}{N}\\
N\hat{\sigma^2} &=&\sum_i (x_i-\mu)^2\\
&=&\sum_i (x_i-\hat{\mu})^2+N(\hat{\mu}-\mu)^2\\
&=& Ns^2+N(\sum_i \frac{x_i}{N}-\mu)^2\\
N\hat{\sigma^2} &=& Ns^2+N(\sum_i \frac{(x_i-\mu)^2}{N^2}\\
N\hat{\sigma^2} &=& Ns^2+\hat{\sigma^2}\\
\frac{(N-1)}{N}\hat{\sigma^2} &=& s^2
\end{eqnarray}

Where we now have a new estimator for the variance, which requires no "true information":
\begin{eqnarray}
s^2&=& \frac{\sum_i (x_i-\hat{\mu})^2}{N}
\end{eqnarray}

If we want to best emulate the "true information estimator", we had best divide by $N-1$ instead of $N$. Another way to show this is that $s^2$ is a biased estimator: it's expectation value, or average over an infinite number of data sets, is not $\sigma^2$ but in fact:

\begin{eqnarray}
\langle s^2 \rangle &=& \frac{N}{N-1}\sigma^2
\end{eqnarray}

To review, let us write our Posterior distribution one more time:

\begin{eqnarray}
P(H_i \vert D) &=& \frac{P(D \vert H_i)P(H_i)}{\sum_i P(D \vert H_i)P(H_i)}\\
P(\mu, \sigma^2 \vert D) &=& \frac{\mathcal{L}(D \vert \mu, \sigma^2)P(\mu, \sigma^2)}{\int \int P(D \vert \mu, \sigma^2)P(\mu, \sigma^2)d\mu d\sigma^2}\\
P(\mu, \sigma^2 \vert D) &=& \frac{(2\pi \sigma^2)^{-\frac{N}{2}} e^{-\frac{\chi^2}{2}}P(\mu, \sigma^2)}{\int \int P(D \vert \mu, \sigma^2)P(\mu, \sigma^2)d\mu d\sigma^2}
\end{eqnarray}

The factor $P(\mu,\sigma^2)$ on the right hand side is called a prior, and determines our a priori information about the parameters $\mu$ and $\sigma^2$. What's nice about Bayes' construction is that the posterior from one data set can be used as the prior for another -- $D$ just goes into more initial information. (Perhaps we should've written the prior

\begin{eqnarray}
P(\mu,\sigma^2 \vert \mathrm{old\ dataset})
\end{eqnarray}

No comments:

Post a Comment