## Sunday, February 10, 2013

### Pegasos (Part 2)

Last time, I briefly introduced classification problems and the support vector machine (SVM) model. For convenience, the SVM optimization problem is

$$\min_{\textbf{w}} \frac{1}{m} \sum_{(\textbf{x}, y) \in S} \ell(\textbf{w}; (\textbf{x}, y)) + \frac{\lambda}{2} ||\textbf{w}||^2$$
where $S$ is the training set, $\textbf{w}$ is the linear predictor, $\ell$ is the hinge loss, and $\lambda$ is the regularization constant. Pegasos in an algorithm/solver for this problem that has some very unique properties. As mentioned previously, Pegasos is a stochastic subgradient method so it involves many iterations of repeatedly computing subgradients for the objective function on subsets of the training set. It can be described succinctly as follows (as shown in the linked paper). First, fix a constant $k$ and a number of iterations $T$. Then for $t = 1, 2, \ldots, T$, do the following:
1. Choose $A_t \subseteq S$ where $|A_t| = k$. These examples are sampled i.i.d. from $S$.
2. Let $A_t^+ = \{ (\textbf{x}, y) \in A_t : y \langle \textbf{w}_t, \textbf{x}\rangle < 1 \}$, the set of examples in $A_t$ which have a positive hinge loss.
3. Set the learning rate $\eta_t = \frac{1}{\lambda t}$.
4. Set $$\textbf{w}_{t + \frac{1}{2}} = (1 - \eta_t \lambda) \textbf{w}_t + \frac{\eta_t}{k} \sum_{(\textbf{x}, y) \in A_t^+} y \textbf{x}$$ which is a subgradient update step (the objective function is not differentiable).
5. Set $$\textbf{w}_{t+1} = \min \left\{1, \frac{1}{\sqrt{\lambda} ||\textbf{w}_{t+\frac{1}{2}}||} \right\} \textbf{w}_{t+\frac{1}{2}}$$ to scale the vector back if the regularization cost would be too high.

The amazing thing to note here is that the algorithm is extremely simple to describe, requiring only a few computations in each iteration. Also, it might seem counter-intuitive to choose just a subset of the training examples and ignore the rest, but that turns out to be a key component in the runtime analysis of Pegasos. The authors provide proofs and analysis which show that its runtime is $\tilde{O}(d / (\lambda \epsilon))$ where $d$ is the maximum number of non-zero features in each example (dimensionality of the training data) and $\epsilon$ is the desired accuracy (the $\tilde{O}$ notation here ignores logarithmic factors). This result implies that the size of the training set does not influence the runtime of the algorithm, unlike all previous SVM solvers, which means that it is extremely suitable for cases in which the training set is enormous.

In a follow-up paper, an even more remarkable result is shown. The authors develop some theoretical machinery around the relationship between training set size and runtime for solving the optimization problem, which leads to the result that Pegasos' runtime should actually decrease with the size of the training set. And in their empirical studies, it does! It's not often that you come across algorithms which improve their runtime when given more data, so this is quite an interesting find. It definitely makes you wonder how machine learning algorithms should behave under different sizes of training sets.