## Sunday, June 9, 2013

### SVM Probabilities

In a previous post, I described the basics of support vector machines (SVMs) in the context of binary classification problems. The original formulation of SVMs allows only for pure classification, i.e. for a test instance we return only $+1$ or $-1$, rather than a probabilistic output. This is fine if your loss function is symmetric so false positives and false negatives are equally undesirable, but that is often not the case and you may be willing to trade one for the other. Having probabilistic outputs allows you to choose what point along the receiver operating characteristic (ROC curve) you want to be, which is an important part in practical applications of machine learning. Fortunately, people have found ways of adapting the output of SVMs to produce reliable probability estimates; the one I will be discussing is known as Platt's method, developed by John Platt at Microsoft Research.

Recall that the output of SVM training is a predictor $\textbf{w} \in \mathbb{R}^n$ where classification is done by computing $h(\textbf{x}) = \text{sign}(\langle \textbf{w}, \textbf{x} \rangle)$ for some instance $\textbf{x}$. In that sense, the quantity $f(\textbf{x}) = \langle \textbf{w}, \textbf{x} \rangle$ is already some sort of continuous spectrum between the two classes (like a probability would be). Platt's method is simply a post-processing step after SVM training which fits a function to map the value of $f$ to probabilities. So the questions that remain are what kind of function should be fit and how should the fit be done? Platt observed that the class-conditional probability densities, i.e. $p(f | y = \pm 1)$, between the SVM margins appear to be exponential (see Figure 1 in the linked paper for a nice graph of this), which after applying Bayes' rule leads to fitting a parametrized sigmoid function of the form

$$P(y = 1 | f) = \frac{1}{1 + \exp(Af + B)}$$
The parameters $A$ and $B$ can then be trained using a number of optimization algorithms with either a hold-out set or cross-validation to maximize likelihood (or, as typically done, log-likelihood).

There are additional details such as regularization, choice of algorithm for fitting, and numerical stability, but at a high level producing probability estimates from SVM output is just mapping the value of $f$ to a probability. This is convenient because it means the actual SVM implementation can remain untouched, requiring only an additional step at the end. Moreover, this functionality is provided in the libsvm library, which makes it easily accessible. In particular, I was able to leverage it to apply SVM training to the Kaggle contest I mentioned last time, which uses the AUC metric for scoring and thus requires ranking rather than simple binary classification.