## Sunday, February 17, 2013

### Collaborative Filtering

Collaborative filtering is a very popular technique being employed these days by recommendation systems such as those powering Amazon or Netflix. The idea is using existing user-item ratings (e.g. for movies or products) and learned relationships between users and items to predict what other items users might be interested in. It turns out that collaborative filtering is more accurate and scalable than other methods, such as profile-based recommendations, i.e. trying to recommend based on what the users describe their profile as. The most famous dataset for collaborative filtering comes from the Netflix Prize competition, which provided a set of 100 million user-movie ratings over a span of about six years; the goal was to have the most accurate predictions on a set of 1.4 million ratings that were not released. Although Netflix has evolved as a company since the competition, the dataset led to an incredible wealth of research in collaborative filtering which has certainly benefited the many other recommendation systems out there.

One interesting paper in this field that came from the winning team described how to integrate temporal dynamics into the predictions. It observes that the time when a user rates a movie is actually crucial for prediction. This piece of data was often ignored for the first 1-2 years of the content, but the methods used are able to capture real-life effects very well. To describe how temporal dynamics are incorporated, we'll first need some background on one of the most used modeling techniques for the Netflix Prize, latent factor models (also known as factor analysis or matrix factorization).

Consider the rating matrix $M$ where $M_{ui}$ is the rating that user $u$ would give to movie $i$, ignoring the time component for now. We assume that users and movies are characterized by $f$ latent factors $p_u, q_i \in \mathbb{R}^f$ so that $M_{ui} = \langle p_u, q_i \rangle$. For example, we can think of the $k$-th component of these factors as how much a user likes action movies and how much action a movie has, respectively. Thus when $p_{uk}$ and $q_{ik}$ correlate positively the rating will be be higher than when they do not. Note that the training set $S = \{ r_{ui}(t) \}$ (user-movie ratings with times) provided by Netflix is a very sparse subset of the entries in $M$ so the goal is to use that sparse subset to interpolate the rest of $M$, which in the latent factor model amounts to learning the appropriate $p_u$ and $q_i$ from the training set. This can be done through stochastic gradient descent on the regularized objective function to solve for

$$\min_{q, p} \sum_{r_{ui} \in S} (r_{ui} - \langle p_u, q_i \rangle)^2$$
There are a few additional details, including baseline predictors (to account for certain users or movies to have intrinsically higher or lower ratings) and regularization, but the crux of the predictive power comes from the latent factor representation. Once the $p$ and $q$ vectors are learned, prediction becomes trivial.

You'll notice, however, that the time domain is completely ignored in the described model. In fact, such a model makes the assumption that movie characteristics and user preferences do not change over time. While the former may be true, the latter most certainly is not, and accounting for that can improve upon the existing model greatly. So instead of $p_{uk}$ we have the time-dependent preference function $p_{uk}(t)$; by learning a function here, we allow for variation in a user's preferences over time. It's important, however, to have $p_{uk}(t)$ be structured in some way; otherwise, the number of variables is far too large for the size of the training set and we would most likely be overfitting. One of the time-based models from the paper is to let

$$p_{uk}(t) = p_{uk} + \alpha_{uk} \cdot \text{dev}_u(t)$$
where $\text{dev}_u(t) = \text{sign}(t - t_u) \cdot |t - t_u|^{\beta}$ and $t_u$ is the mean date of rating for user $u$. This assumes that a user's preference for one of the latent factors tends to either increase or decrease over time rather than oscillating. The $\alpha_{uk}$ and $\beta$ parameters determine the direction and shape of this variation. The authors use cross validation to choose the best value of $\beta$ (it turns out to be 0.4) and learn the parameters $p_{uk}$ and $\alpha_{uk}$. Since the model only has twice as many parameters per user than the non-temporal model, it is still efficient to learn. Again, there are additional details which can be found in the paper, but the main idea is to model variation in a user's preferences using the structured $p_{uk}(t)$ function.

Not only does adding temporal dynamics to the latent factor model increase the score on the Netflix Prize from 6.3% to 7.5% improvement over the Cinematch baseline (which, if you've worked on it at all, is an absolutely huge difference), but the author was able to draw several neat insights as a result of it. He was able to identify the potential cause behind the observed global increase in average ratings around 2004, which turns out to be improved recommendations by the existing Netflix system causing people to tend to watch movies that they are expected to enjoy. Moreover, he observed that the older a movie is, the higher its ratings tend to be, which is a product of the fact that people tend to select old movies more carefully than new ones. Temporal dynamics are very interesting because they are intuitive to people (as these observations show) but often difficult to model computationally. The work in the paper, however, shows that such dynamics are perhaps even more important than we would expect, and can be leveraged to significantly improve recommendation systems.