Tuesday, September 8, 2015

Collaborative Filtering as AutoEncoding

After doing some reading it seems Collaborative Filtering, or the process of minimizing the distance $Y_{ij}-\mathbf{\Theta}_{jl}\mathbf{X}_{il}$ by alternately fitting for $\mathbf{\Theta}_{jl}$ and $\mathbf{X}_{il}$ is the same as sparse autoencoding, or using a restricted Boltzman machine. Will have more to say about this, in particular regarding Lasso -- L1 regularization -- regression and non-negative matrix factorization

Second Order Collaborative Filtering: Playing with latent feature dimension

So, after playing around with things, I find -- unsurprisingly, since the winner of the Netflix contest used a very high-dimensional representation of feature vectors $\vec{x} \in \mathcal{R}^D$, $D \approx 1000$ -- that increasing the dimension of the feature vectors improves training fit substantially. Even with a fairly high regularization parameter of $\lambda = 1$ from the last post, I get the following results for $D=200$:

As you can see, we get a much tighter regression fit on the given ratings matrix, $Y_{ij}$ at the cost of extra computation. Inverting the Hessian of the Cost function -- which turns out to be only $D \times D$, thank goodness, due to diagonality in other degrees of freedom -- takes a great deal of time for high dimension $D$, so we are left with a trade off between goodness of fit and computation time.

This algorithm has been a second order "batch" gradient descent, taking in all the data at once. It will be interesting to see how things can be made incremental, or "online", so that data is taken in bit by bit, and our matrices $\mathbf{X}_{il}$, $\mathbf{\theta}_{jl}$ are updated bit by bit.

Saturday, September 5, 2015

Recommender Systems -- Second Order Collaborative Filtering

So I've been working with recommender systems recently, which take in some incomplete set of product ratings, represented by a matrix $Y$

\begin{eqnarray}
Y_{ij} &=& \left(\begin{array}{cc}
\leftarrow \mathrm{ratings\ for\ movie\ 1} \rightarrow \\
\leftarrow \mathrm{ratings\ for\ movie\ 2} \rightarrow \\
\vdots \\
\leftarrow \mathrm{ratings\ for\ movie}\ n_m \rightarrow \end{array}\right)
\end{eqnarray}

and come up with a model to ``recommend'' similar products -- in this case we'll call them movies -- to users. What this essentially is is a regression problem. But we don't have the vector inputs $\vec{x}$. What's very cool about collaborative filtering is that we find we can solve the problem by guessing the regression inputs $\vec{x}$ and weights $\vec{\theta}$ for each individual and each movie, minimize the cost function, and still get a working model! Even more astoundingly, the ``feature vectors'' $\vec{x}$ that emerge might contain some important information as to categorizing products.

Let's say we have $n_m$ movies in our database and $n_u$ users: then this $Y$ matrix is $n_m \times n_u$, and all entries for which the $j^\mathrm{th}$ user has not rated the $i^{\mathrm{th}}$ movie, we put a question mark (?) in our matrix. Or, correspondingly we can make a mask matrix, $R_{ij}$ which has entries of 1 wherever we have a rating, and 0 wherever we don't. When modeling  $Y_{ij}$, we normally use a linear model, giving each movie a feature vector $\vec{x}^{(i)}$, of dimension of our choosing, and each individual a regression coefficient vector $\vec{\theta}^{j}$, of dimension of our choosing. (Let's call it $D$) We have

\begin{eqnarray}
\vec{x}^{(i)} & \in & \mathcal{R}^D \ \ \forall i \\
\vec{\theta}^{(j)} & \in & \mathcal{R}^D \ \ \forall j \\
Y_{ij} &=& \vec{x}^{(i)}  \cdot \vec{\theta}^{(j)} + \epsilon_{ij}
\end{eqnarray}


And our error matrix, $\epsilon_{ij}$ is drawn from some probability distribution. This model can be more concisely -- or perhaps more clearly -- written as a matrix multiplication:

\begin{eqnarray}
Y_{ij} &=& \mathbf{X}_{il}  \mathbf{\Theta}_{jl} + \epsilon_{ij}
\end{eqnarray}

where we use Einstein summation convention above. If we assume Gaussian independent errors, our cost function -- or, the log likelihood -- becomes,

\begin{eqnarray}
2 J\left( \mathbf{X}_{il},  \mathbf{\Theta}_{jl} \right) &=& \sum_{i,j \in R_{ij}} \left[ Y_{ij} - \mathbf{X}_{il}  \mathbf{\Theta}_{jl} \right]^2
\end{eqnarray}

And our sum above only calculates the errors for movies which we have actually received ratings. If we want to regularize our model, we can put simple priors on our movie Feature vectors $\vec{x}$ and individual regression vectors $\vec{\theta}$ to get:

\begin{eqnarray}
2J\left( \mathbf{X}_{il},  \frac{1}{2}\mathbf{\Theta}_{jl} \right) &=& \sum_{i,j \in R_{ij}} \left[ Y_{ij} - \mathbf{X}_{il}  \mathbf{\Theta}_{jl} \right]^2+\lambda \left(\mathbf{X}_{il} \mathbf{X}_{il}  + \mathbf{\Theta}_{jl} \mathbf{\Theta}_{jl}  \right)
\end{eqnarray}

The regularization term is just the sum of two traces. Now, what's interesting about collaborative filtering is that a priori, we don't know our movie feature vectors, and we don't know what the regression weights are either -- we only choose their dimension and hope things work out OK! Since this cost function is quadratic, the Newton-Rhapson method should optimize it in a ``single'' step. We can compute the derivatives of this cost function with respect to matrices in the following way:

\begin{eqnarray}
\frac{\partial J}{\partial \mathbf{X}_{i^\prime l^\prime}} &=& -\sum_{j \in R_{i^\prime j}} \left[ Y_{i^\prime j} - \mathbf{X}_{i^\prime l}  \mathbf{\Theta}_{jl} \right]  \mathbf{\Theta}_{jl^\prime}+ \lambda \mathbf{X}_{i^\prime l^\prime} \\
\frac{\partial^2 J}{\partial \mathbf{X}_{i^\prime l^\prime} \partial \mathbf{X}_{i l}} &=& \sum_{j \in R_{i^\prime j}} \mathbf{\Theta}_{jl} \mathbf{\Theta}_{jl^\prime}\delta^K_{i,i^\prime}+ \lambda  \delta^K_{l,l^\prime}\delta^K_{i,i^\prime}
\end{eqnarray}

Where $\delta^K$ is the Kronecker delta function. A similar expression pops out for partial derivatives with respect $\mathbf{\Theta}$, but the important point to note with Newton-Rhapson, is that we seek to find the root of the gradient. Normally, when setting $f(x)$ to zero one writes:

\begin{eqnarray}
f(x+\delta x) &=& f(x) + \delta x \frac{\partial f}{\partial x}+O(\delta x^2)
\end{eqnarray}

And so if we update according to

\begin{eqnarray}
\frac{-f(x)}{\left( \frac{\partial f}{\partial x} \right)} &=& \delta x
\end{eqnarray}

For the cost function $J$ we are doing the same thing, except with matrices:

\begin{eqnarray}
f \to f_{il} &=& \frac{\partial J}{\partial \mathbf{X}_{i^\prime l^\prime}}
\end{eqnarray}

So our update takes the form of a double matrix multiplication (and inversion):

\begin{eqnarray}
\delta \mathbf{X}_{i l} &=& - \frac{\partial J}{\partial \mathbf{X}_{i^\prime l^\prime}} \left( \frac{\partial^2 J}{\partial \mathbf{X}_{i^\prime l^\prime} \partial \mathbf{X}_{i l}} \right)^{-1}
\end{eqnarray}

A similarly ugly expression for $\mathbf{\Theta}_{jl} $ applies.

Now I find that updating simultaneously the matrices $X$ and $\Theta$ according to the above second order rule causes the model to diverge, not converge, which I suppose makes sense because the gradient of $J$ with respect to $\mathbf{X}_{i l}$ changes with $\mathbf{\Theta}_{j l}$. A more clever set up to solve this problem would be to apply second order updates, alternately, to $X$ and $\Theta$ -- i.e. something that is truly collaborative. This would cause the gradients $\frac{\partial J}{\partial \mathbf{X}_{il}}$, $\frac{\partial J}{\partial \mathbf{\Theta}_{jl}}$, to alternately be set to zero, and over time, hopefully cause the model to converge to a $J$-cost minimum.

One could also use a first order method -- which is mathematically much easier -- where one must choose some arbitrary step size $\alpha$ -- or ``learning rate'' -- and hope that things work with a parameter update of the form

\begin{eqnarray}
\delta \mathbf{X}_{i l} &=& - \alpha \frac{\partial J}{\partial \mathbf{X}_{i^\prime l^\prime}}
\end{eqnarray}

I find that it takes thousands of iterations for a first order method, choosing some step size $\alpha$ -- and hopefully getting lucky -- to achieve the same performance of a second order method with 20 iterations. Here are the learning curves for a collaborative second order method, after 20 iterations:


And for a ``simultaneous'' first order method, after 100 iterations:

The data for these plots came from Andrew Ng's course on Machine learning, using a regularization parameter $\lambda \approx .01$ and a step size $\alpha \approx .005$. With this code fully implemented and vectorized in python, I find that it takes only a few seconds to achieve an almost perfect fit! (And by that I mean the improvement in the cost function $J$ per iteration is less than one part in one thousand). 

This example shows the power of convergence for second order methods when the cost function is truly second order. The one bottleneck that could appear in this example is the matrix inversion of what are essentially $n_m$ and $n_u$ $D \times D$ matrices, which goes like $O\left((n_m+n_u)D^3 \right)$. If one chooses the feature space to be very high-dimensional, with a great number of users and movies, the vectorized second order algorithm may slow down quite a bit. (But there is no replacing being able to set a matrix of derivatives to zero in a single step!) For separate linear regression problems of this type, it can take a great deal of time for things to converge, and in general that step size should not be the same for all regression coefficients, $\mathbf{\Theta}_{jl}$. 

I wonder how this approach could be parallelized, or distributed, choosing the same dimension $D$ for many clusters. After all, the number of movies on Netflix is prohibitively large for such a model, even if it converges on my laptop in a few seconds for $n_m \sim 2000$ and $n_u \sim 1000$. 


Tuesday, September 1, 2015

Equivalence of Shannon Entropy and Logarithm of Posterior Partition function (under Laplace Approximation)

Now, the integral of the posterior over parameter space is called the evidence:

\begin{eqnarray}
P(D) = Z &=& \int d\vec{\theta} \mathcal{L}(D \vert \vec{\theta})P(\vec{\theta})
\end{eqnarray}

If we approximate the un-normalized posterior as a Gaussian -- which I've recently learned from Christopher Bishop is called the Laplace approximation -- by taking logarithmic derivatives, then we get:

\begin{eqnarray}
P(D \vert \vec{\theta})P(\vec{\theta}) &\approx & P(D \vert \vec{\theta}_{\mathrm{MAP}})P(\vec{\theta}_{\mathrm{MAP}})\mathrm{exp}\left[-(\vec{\theta}-\vec{\theta}_{\mathrm{MAP}})_i\left(F_{ij}+\Sigma_{ij}^{-1}\right)(\vec{\theta}-\vec{\theta}_{\mathrm{MAP}})_j \right]
\end{eqnarray}

We already know what the integral of this Gaussian over parameter space will be:

\begin{eqnarray}
\log Z &=& \frac{D}{2}\log(2\pi)+\frac{1}{2}\log \left( \vert F_{ij}+\Sigma_{ij}^{-1} \vert\right)
\end{eqnarray}

Comparing this with our measured entropy of the posterior, we see we're off by just a constant:

\begin{eqnarray}
H\left[ P(\vec{\theta} \vert D \right] &=&  \frac{D}{2}\log(2\pi)+\frac{1}{2}\log \left( \vert F_{ij}+\Sigma_{ij}^{-1} \vert\right) + \frac{D}{2}
\end{eqnarray}

This is another example of why a statistical mechanics interpretation of $Z$, our normalization of the posterior, is right on point. It's logarithm -- up to an additive constant, which can be thrown away -- is equal to the Entropy of our distribution, which is a common point of wisdom in statistical mechanics. So in conclusion, under the laplace approximation, writing our posterior as a Gaussian by expanding in the exponential, collecting the first $\vec{\theta}_{\mathrm{MAP}}$ and second $F_{ij}$ cumulants, we get:

\begin{eqnarray}
\log Z &=& H\left[P(\vec{\theta} \vert D )\right] + \mathrm{const}
\end{eqnarray}