Autoboxing is a feature of Java which allows the compiler to automatically translate between primitives and their boxed object forms, e.g. from ints to Integers and vice versa. Since Java generics can only parametrize on real classes, autoboxing is convenient when you want to use collections like List<Integer>. Instead of having to wrap all values with the Integer class, the Java compiler will do that for you. While autoboxing is known to have performance issues in some cases (constantly creating objects as you call methods with primitives will have some overhead), there are even more subtle dangers that you should be aware of. Consider the following code:
It puts two values into a map, removes one, and then prints the content of the map. Simple enough, but here's the output:
Well that's interesting; somehow "foo" did not get removed from the map. Looking closely at the Java API for Map, we see that the remove method takes an Object as its parameter (the reasons for this are subtle and discussed in various places). So when we call remove on the map, the value 1 gets autoboxed to... an Integer. Unfortunately, the map uses Longs for keys, and the Integer 1 is not equal to the Long 1, so remove does not find the key and does nothing. But since the argument to remove has type Object, the code compiles fine and you are blissfully unaware of the impending doom until your system crashes and burns horribly (or you catch the mistake with a unit test!). Sad times. And all of this is fixed by writing "1L" (primitive long) instead of just "1" (primitive int).
This subtle interaction between the collection APIs and autoboxing should make any Java programmer very anxious. It is difficult to spot such mistakes in code, especially because you expect the compiler to catch this kind of error. The best solution I have for avoiding this problem is being very rigid about passing the correct type of primitive; if a value is an int, don't ever store it in a long, and vice versa. This can be somewhat annoying (personally, "1L" is much uglier to read than "1") but may save you some headaches in the future. Alternatively, don't use boxed primitives in collections, but that seems like an even harder standard to maintain. Regardless, add autoboxing to your list of things in Java that you just have to deal with even though the point is that you shouldn't have to think about it.
Analytics
Wednesday, February 27, 2013
Sunday, February 24, 2013
Implicit Values
Scala implicits are an interesting feature that could certainly fill up multiple blog posts, but in this one I'm just going to touch on how implicit values work at a high level and why you might use them. In the context of values and parameters, the implicit keyword can mean two things:
- the value or parameter is implicit in the current scope; or
- the parameter list is implicit and can be omitted provided there are corresponding implicit values in the scope of the method call.
For example, consider the following methods:
Here, foo is an example of a method whose parameter list (just one parameter in this case) is implicit. That means we can call it without any parameters if we have an implicit String in scope. The bar and baz methods show two ways of doing this: bar declares its parameter implicit just like foo, which causes it to be implicit in the current scope, while baz declares an additional implicit value that lives in the scope. Lastly, the qux method demonstrates that you can still pass the parameter explicitly.
Depending on your background and preferences, this might seem like either one of the most useful language constructs or one of the scariest. Implicits are very powerful for building frameworks like Play (more on this later) but at the same time can result in code that is very hard to understand due to the difficulty of tracing the usages of state through the code. Because Scala is a type-safe language, however, you at the very least get static type checking so that you don't accidentally end up in a complete mess of improperly typed implicits. For example, these two methods will not compile (thankfully):
The foobar method pretty clearly does not have an implicit String in the current scope which means foo cannot be called. The second case with fubar is a little more subtle, both in syntax and in semantics. The implicit keyword is applied to a parameter list rather than a parameter (the reasons for this are not clear to me), so both String parameters are implicit in the scope of the method. The result, however, is that the implicit String passed to foo is now ambiguous; the Scala compiler won't do anything crazy like arbitrarily choose one, but instead give you an error saying that you tried to do something very dangerous. So implicits are Scala's type-safe way of pretending like the language is more dynamic than it really is.
In general, I don't really advocate using implicits because it's quite easy to get yourself into trouble doing so, but there are cases in which it makes sense. As I mentioned earlier, the Play framework for building web applications is one instance where the use of implicits is convenient. When handling a web request, there are some things which are conceptually always in scope, such as the actual HTTP request or the currently authenticated user (as an option, of course). It can be burdensome to pass these values around constantly, especially between views, and the pain is noticeable if you're coming from something like Django. By leveraging implicits, you can just declare the variables you expect to be in scope at the top of the view and use them freely; as long as they are always declared implicit, the compiler will do the rest for you.
To summarize, implicits allow Scala programmers to have the illusion of working with a more dynamic language. They are powerful but can be dangerous, and as such should be used sparingly.
The foobar method pretty clearly does not have an implicit String in the current scope which means foo cannot be called. The second case with fubar is a little more subtle, both in syntax and in semantics. The implicit keyword is applied to a parameter list rather than a parameter (the reasons for this are not clear to me), so both String parameters are implicit in the scope of the method. The result, however, is that the implicit String passed to foo is now ambiguous; the Scala compiler won't do anything crazy like arbitrarily choose one, but instead give you an error saying that you tried to do something very dangerous. So implicits are Scala's type-safe way of pretending like the language is more dynamic than it really is.
In general, I don't really advocate using implicits because it's quite easy to get yourself into trouble doing so, but there are cases in which it makes sense. As I mentioned earlier, the Play framework for building web applications is one instance where the use of implicits is convenient. When handling a web request, there are some things which are conceptually always in scope, such as the actual HTTP request or the currently authenticated user (as an option, of course). It can be burdensome to pass these values around constantly, especially between views, and the pain is noticeable if you're coming from something like Django. By leveraging implicits, you can just declare the variables you expect to be in scope at the top of the view and use them freely; as long as they are always declared implicit, the compiler will do the rest for you.
To summarize, implicits allow Scala programmers to have the illusion of working with a more dynamic language. They are powerful but can be dangerous, and as such should be used sparingly.
Sunday, February 17, 2013
Collaborative Filtering
Collaborative filtering is a very popular technique being employed these days by recommendation systems such as those powering Amazon or Netflix. The idea is using existing user-item ratings (e.g. for movies or products) and learned relationships between users and items to predict what other items users might be interested in. It turns out that collaborative filtering is more accurate and scalable than other methods, such as profile-based recommendations, i.e. trying to recommend based on what the users describe their profile as. The most famous dataset for collaborative filtering comes from the Netflix Prize competition, which provided a set of 100 million user-movie ratings over a span of about six years; the goal was to have the most accurate predictions on a set of 1.4 million ratings that were not released. Although Netflix has evolved as a company since the competition, the dataset led to an incredible wealth of research in collaborative filtering which has certainly benefited the many other recommendation systems out there.
One interesting paper in this field that came from the winning team described how to integrate temporal dynamics into the predictions. It observes that the time when a user rates a movie is actually crucial for prediction. This piece of data was often ignored for the first 1-2 years of the content, but the methods used are able to capture real-life effects very well. To describe how temporal dynamics are incorporated, we'll first need some background on one of the most used modeling techniques for the Netflix Prize, latent factor models (also known as factor analysis or matrix factorization).
Consider the rating matrix $M$ where $M_{ui}$ is the rating that user $u$ would give to movie $i$, ignoring the time component for now. We assume that users and movies are characterized by $f$ latent factors $p_u, q_i \in \mathbb{R}^f$ so that $M_{ui} = \langle p_u, q_i \rangle$. For example, we can think of the $k$-th component of these factors as how much a user likes action movies and how much action a movie has, respectively. Thus when $p_{uk}$ and $q_{ik}$ correlate positively the rating will be be higher than when they do not. Note that the training set $S = \{ r_{ui}(t) \}$ (user-movie ratings with times) provided by Netflix is a very sparse subset of the entries in $M$ so the goal is to use that sparse subset to interpolate the rest of $M$, which in the latent factor model amounts to learning the appropriate $p_u$ and $q_i$ from the training set. This can be done through stochastic gradient descent on the regularized objective function to solve for
$$\min_{q, p} \sum_{r_{ui} \in S} (r_{ui} - \langle p_u, q_i \rangle)^2$$
There are a few additional details, including baseline predictors (to account for certain users or movies to have intrinsically higher or lower ratings) and regularization, but the crux of the predictive power comes from the latent factor representation. Once the $p$ and $q$ vectors are learned, prediction becomes trivial.
You'll notice, however, that the time domain is completely ignored in the described model. In fact, such a model makes the assumption that movie characteristics and user preferences do not change over time. While the former may be true, the latter most certainly is not, and accounting for that can improve upon the existing model greatly. So instead of $p_{uk}$ we have the time-dependent preference function $p_{uk}(t)$; by learning a function here, we allow for variation in a user's preferences over time. It's important, however, to have $p_{uk}(t)$ be structured in some way; otherwise, the number of variables is far too large for the size of the training set and we would most likely be overfitting. One of the time-based models from the paper is to let
$$p_{uk}(t) = p_{uk} + \alpha_{uk} \cdot \text{dev}_u(t)$$
where $\text{dev}_u(t) = \text{sign}(t - t_u) \cdot |t - t_u|^{\beta}$ and $t_u$ is the mean date of rating for user $u$. This assumes that a user's preference for one of the latent factors tends to either increase or decrease over time rather than oscillating. The $\alpha_{uk}$ and $\beta$ parameters determine the direction and shape of this variation. The authors use cross validation to choose the best value of $\beta$ (it turns out to be 0.4) and learn the parameters $p_{uk}$ and $\alpha_{uk}$. Since the model only has twice as many parameters per user than the non-temporal model, it is still efficient to learn. Again, there are additional details which can be found in the paper, but the main idea is to model variation in a user's preferences using the structured $p_{uk}(t)$ function.
Not only does adding temporal dynamics to the latent factor model increase the score on the Netflix Prize from 6.3% to 7.5% improvement over the Cinematch baseline (which, if you've worked on it at all, is an absolutely huge difference), but the author was able to draw several neat insights as a result of it. He was able to identify the potential cause behind the observed global increase in average ratings around 2004, which turns out to be improved recommendations by the existing Netflix system causing people to tend to watch movies that they are expected to enjoy. Moreover, he observed that the older a movie is, the higher its ratings tend to be, which is a product of the fact that people tend to select old movies more carefully than new ones. Temporal dynamics are very interesting because they are intuitive to people (as these observations show) but often difficult to model computationally. The work in the paper, however, shows that such dynamics are perhaps even more important than we would expect, and can be leveraged to significantly improve recommendation systems.
One interesting paper in this field that came from the winning team described how to integrate temporal dynamics into the predictions. It observes that the time when a user rates a movie is actually crucial for prediction. This piece of data was often ignored for the first 1-2 years of the content, but the methods used are able to capture real-life effects very well. To describe how temporal dynamics are incorporated, we'll first need some background on one of the most used modeling techniques for the Netflix Prize, latent factor models (also known as factor analysis or matrix factorization).
Consider the rating matrix $M$ where $M_{ui}$ is the rating that user $u$ would give to movie $i$, ignoring the time component for now. We assume that users and movies are characterized by $f$ latent factors $p_u, q_i \in \mathbb{R}^f$ so that $M_{ui} = \langle p_u, q_i \rangle$. For example, we can think of the $k$-th component of these factors as how much a user likes action movies and how much action a movie has, respectively. Thus when $p_{uk}$ and $q_{ik}$ correlate positively the rating will be be higher than when they do not. Note that the training set $S = \{ r_{ui}(t) \}$ (user-movie ratings with times) provided by Netflix is a very sparse subset of the entries in $M$ so the goal is to use that sparse subset to interpolate the rest of $M$, which in the latent factor model amounts to learning the appropriate $p_u$ and $q_i$ from the training set. This can be done through stochastic gradient descent on the regularized objective function to solve for
$$\min_{q, p} \sum_{r_{ui} \in S} (r_{ui} - \langle p_u, q_i \rangle)^2$$
There are a few additional details, including baseline predictors (to account for certain users or movies to have intrinsically higher or lower ratings) and regularization, but the crux of the predictive power comes from the latent factor representation. Once the $p$ and $q$ vectors are learned, prediction becomes trivial.
You'll notice, however, that the time domain is completely ignored in the described model. In fact, such a model makes the assumption that movie characteristics and user preferences do not change over time. While the former may be true, the latter most certainly is not, and accounting for that can improve upon the existing model greatly. So instead of $p_{uk}$ we have the time-dependent preference function $p_{uk}(t)$; by learning a function here, we allow for variation in a user's preferences over time. It's important, however, to have $p_{uk}(t)$ be structured in some way; otherwise, the number of variables is far too large for the size of the training set and we would most likely be overfitting. One of the time-based models from the paper is to let
$$p_{uk}(t) = p_{uk} + \alpha_{uk} \cdot \text{dev}_u(t)$$
where $\text{dev}_u(t) = \text{sign}(t - t_u) \cdot |t - t_u|^{\beta}$ and $t_u$ is the mean date of rating for user $u$. This assumes that a user's preference for one of the latent factors tends to either increase or decrease over time rather than oscillating. The $\alpha_{uk}$ and $\beta$ parameters determine the direction and shape of this variation. The authors use cross validation to choose the best value of $\beta$ (it turns out to be 0.4) and learn the parameters $p_{uk}$ and $\alpha_{uk}$. Since the model only has twice as many parameters per user than the non-temporal model, it is still efficient to learn. Again, there are additional details which can be found in the paper, but the main idea is to model variation in a user's preferences using the structured $p_{uk}(t)$ function.
Not only does adding temporal dynamics to the latent factor model increase the score on the Netflix Prize from 6.3% to 7.5% improvement over the Cinematch baseline (which, if you've worked on it at all, is an absolutely huge difference), but the author was able to draw several neat insights as a result of it. He was able to identify the potential cause behind the observed global increase in average ratings around 2004, which turns out to be improved recommendations by the existing Netflix system causing people to tend to watch movies that they are expected to enjoy. Moreover, he observed that the older a movie is, the higher its ratings tend to be, which is a product of the fact that people tend to select old movies more carefully than new ones. Temporal dynamics are very interesting because they are intuitive to people (as these observations show) but often difficult to model computationally. The work in the paper, however, shows that such dynamics are perhaps even more important than we would expect, and can be leveraged to significantly improve recommendation systems.
Wednesday, February 13, 2013
Reading InputStreams
The Java InputStream is one of the core I/O classes in the JDK and is used pretty much whenever you want to do I/O in your Java programs, e.g. reading from the disk or network. It has a variety of read methods that allow you to read a single byte or many bytes at a time, which are typically called in some sort of loop until you get a return value of -1 indicating the end of the stream. Consider the following example:
Here, we write 256MB worth of data into a ByteArrayOutputStream, which just allocates a byte array to store all of those bytes, and then read it back all at once using a ByteArrayInputStream, which just wraps a byte array with the InputStream API. Unsurprisingly, this program results in the message: "Read 268435456 bytes." Simple enough, but let's see what happens when you decide you want to compress the bytes when writing them out and decompress them when reading them back (this is common when you have to write large amount of easily-compressible data to the disk or network).
Now we're wrapping the ByteArrayOutputStream with a DeflaterOutputStream, which compresses the data as its written out, and the ByteArrayInputStream with an InflaterInputStream, which decompresses the data as its read in. These streams do indeed invert each other correctly, but now this programs prints: "Read 512132 bytes." That's strange, because we expected to get the same number of bytes back after compression followed by decompression. Digging into the contract provided by the InputStream API. you can find the following statement: "Reads some number of bytes from the input stream and stores them into the buffer array b. The number of bytes actually read is returned as an integer. This method blocks until input data is available, end of file is detected, or an exception is thrown." What that means is that InputStreams do not provided any guarantees on how many bytes it will read, even if, as in this case, all of the data is "available" in memory. The InflaterInputStream is most likely designed to inflate data in chunks and be efficient regardless of the underlying InputStream it is reading from. Taking this fact into account, the final example produces the expected output:
The great thing about all of this is that, if your data is small enough, the second example will actually work properly. Thus even testing may not catch the bug, which can lead to a lot of unfortunate situations. So the lesson to be learned is to always wrap InputStream reads in a loop and only consider it done once you see that -1!
Here, we write 256MB worth of data into a ByteArrayOutputStream, which just allocates a byte array to store all of those bytes, and then read it back all at once using a ByteArrayInputStream, which just wraps a byte array with the InputStream API. Unsurprisingly, this program results in the message: "Read 268435456 bytes." Simple enough, but let's see what happens when you decide you want to compress the bytes when writing them out and decompress them when reading them back (this is common when you have to write large amount of easily-compressible data to the disk or network).
Now we're wrapping the ByteArrayOutputStream with a DeflaterOutputStream, which compresses the data as its written out, and the ByteArrayInputStream with an InflaterInputStream, which decompresses the data as its read in. These streams do indeed invert each other correctly, but now this programs prints: "Read 512132 bytes." That's strange, because we expected to get the same number of bytes back after compression followed by decompression. Digging into the contract provided by the InputStream API. you can find the following statement: "Reads some number of bytes from the input stream and stores them into the buffer array b. The number of bytes actually read is returned as an integer. This method blocks until input data is available, end of file is detected, or an exception is thrown." What that means is that InputStreams do not provided any guarantees on how many bytes it will read, even if, as in this case, all of the data is "available" in memory. The InflaterInputStream is most likely designed to inflate data in chunks and be efficient regardless of the underlying InputStream it is reading from. Taking this fact into account, the final example produces the expected output:
The great thing about all of this is that, if your data is small enough, the second example will actually work properly. Thus even testing may not catch the bug, which can lead to a lot of unfortunate situations. So the lesson to be learned is to always wrap InputStream reads in a loop and only consider it done once you see that -1!
Sunday, February 10, 2013
Pegasos (Part 2)
Last time, I briefly introduced classification problems and the support vector machine (SVM) model. For convenience, the SVM optimization problem is
$$\min_{\textbf{w}} \frac{1}{m} \sum_{(\textbf{x}, y) \in S} \ell(\textbf{w}; (\textbf{x}, y)) + \frac{\lambda}{2} ||\textbf{w}||^2$$
where $S$ is the training set, $\textbf{w}$ is the linear predictor, $\ell$ is the hinge loss, and $\lambda$ is the regularization constant. Pegasos in an algorithm/solver for this problem that has some very unique properties. As mentioned previously, Pegasos is a stochastic subgradient method so it involves many iterations of repeatedly computing subgradients for the objective function on subsets of the training set. It can be described succinctly as follows (as shown in the linked paper). First, fix a constant $k$ and a number of iterations $T$. Then for $t = 1, 2, \ldots, T$, do the following:
The amazing thing to note here is that the algorithm is extremely simple to describe, requiring only a few computations in each iteration. Also, it might seem counter-intuitive to choose just a subset of the training examples and ignore the rest, but that turns out to be a key component in the runtime analysis of Pegasos. The authors provide proofs and analysis which show that its runtime is $\tilde{O}(d / (\lambda \epsilon))$ where $d$ is the maximum number of non-zero features in each example (dimensionality of the training data) and $\epsilon$ is the desired accuracy (the $\tilde{O}$ notation here ignores logarithmic factors). This result implies that the size of the training set does not influence the runtime of the algorithm, unlike all previous SVM solvers, which means that it is extremely suitable for cases in which the training set is enormous.
In a follow-up paper, an even more remarkable result is shown. The authors develop some theoretical machinery around the relationship between training set size and runtime for solving the optimization problem, which leads to the result that Pegasos' runtime should actually decrease with the size of the training set. And in their empirical studies, it does! It's not often that you come across algorithms which improve their runtime when given more data, so this is quite an interesting find. It definitely makes you wonder how machine learning algorithms should behave under different sizes of training sets.
$$\min_{\textbf{w}} \frac{1}{m} \sum_{(\textbf{x}, y) \in S} \ell(\textbf{w}; (\textbf{x}, y)) + \frac{\lambda}{2} ||\textbf{w}||^2$$
where $S$ is the training set, $\textbf{w}$ is the linear predictor, $\ell$ is the hinge loss, and $\lambda$ is the regularization constant. Pegasos in an algorithm/solver for this problem that has some very unique properties. As mentioned previously, Pegasos is a stochastic subgradient method so it involves many iterations of repeatedly computing subgradients for the objective function on subsets of the training set. It can be described succinctly as follows (as shown in the linked paper). First, fix a constant $k$ and a number of iterations $T$. Then for $t = 1, 2, \ldots, T$, do the following:
- Choose $A_t \subseteq S$ where $|A_t| = k$. These examples are sampled i.i.d. from $S$.
- Let $A_t^+ = \{ (\textbf{x}, y) \in A_t : y \langle \textbf{w}_t, \textbf{x}\rangle < 1 \}$, the set of examples in $A_t$ which have a positive hinge loss.
- Set the learning rate $\eta_t = \frac{1}{\lambda t}$.
- Set $$\textbf{w}_{t + \frac{1}{2}} = (1 - \eta_t \lambda) \textbf{w}_t + \frac{\eta_t}{k} \sum_{(\textbf{x}, y) \in A_t^+} y \textbf{x}$$ which is a subgradient update step (the objective function is not differentiable).
- Set $$\textbf{w}_{t+1} = \min \left\{1, \frac{1}{\sqrt{\lambda} ||\textbf{w}_{t+\frac{1}{2}}||} \right\} \textbf{w}_{t+\frac{1}{2}}$$ to scale the vector back if the regularization cost would be too high.
The amazing thing to note here is that the algorithm is extremely simple to describe, requiring only a few computations in each iteration. Also, it might seem counter-intuitive to choose just a subset of the training examples and ignore the rest, but that turns out to be a key component in the runtime analysis of Pegasos. The authors provide proofs and analysis which show that its runtime is $\tilde{O}(d / (\lambda \epsilon))$ where $d$ is the maximum number of non-zero features in each example (dimensionality of the training data) and $\epsilon$ is the desired accuracy (the $\tilde{O}$ notation here ignores logarithmic factors). This result implies that the size of the training set does not influence the runtime of the algorithm, unlike all previous SVM solvers, which means that it is extremely suitable for cases in which the training set is enormous.
In a follow-up paper, an even more remarkable result is shown. The authors develop some theoretical machinery around the relationship between training set size and runtime for solving the optimization problem, which leads to the result that Pegasos' runtime should actually decrease with the size of the training set. And in their empirical studies, it does! It's not often that you come across algorithms which improve their runtime when given more data, so this is quite an interesting find. It definitely makes you wonder how machine learning algorithms should behave under different sizes of training sets.
Monday, February 4, 2013
Pegasos (Part 1)
The problem of classification is core to machine learning; given a dataset of labelled examples, e.g. images of known digits, can a model be learned so that future instances can be classified automatically? Whether dealing with vision, audio, or natural language, classification problems are abundant and have many practical applications. It is not surprising that researchers are always coming up with new ways of learning and modeling data to better tackle classification. Support vector machines (SVMs) are a commonly used method for solving such problems that performs very well; as such, algorithms for learning SVMs and the theory behind SVMs are both hot topics in machine learning research. One of the most interesting such algorithms (in my opinion), is Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, which exhibits a unique and counter-intuitive relationship between classification error, training set size, and runtime.
Before I discuss the ideas behind Pegasos, first let me present some background on classification problems and SVMs. The formal definition of the classification problem is as follows. Suppose you are given a training set $S = \{(\textbf{x}_i, y_i)\}_{i=1}^m$ of size $m$ where $\textbf{x}_i \in \mathbb{R}^n$ are feature vectors and $y_i \in \{-1, +1\}$ is the classification label (for simplicity, we are assuming a binary classification problem which SVMs are most naturally suited for). For example, each $\textbf{x}_i$ might be the representation of an image after whitening, and $y_i$ might indicate whether the image contains a mug or not. We can think of the training examples as being drawn from some unknown probability distribution $P(\textbf{X}, Y)$ that we want to estimate. The goal is to find a classifier $h: \mathbb{R}^n \rightarrow \{-1, +1\}$ which minimizes $\Pr[h(\textbf{x}) \neq y]$ for $(\text{x}, y)$ drawn from $P$, i.e. the probability of a classification error given a random sample from the true distribution. Intuitively, a learning algorithm will attempt to build a model for $P$ using $S$ as a representative sample of the distribution.
The idea behind SVMs is to model the classifier $h$ as a separating hyperplane in $\mathbb{R}^n$ where points on one side of the hyperplane are classified as $-1$ and points on the other side are classified as $+1$. One way of representing this is to let $h(\textbf{x}) = \text{sign}(\langle \textbf{w}, \textbf{x} \rangle)$ for some predictor $\textbf{w} \in \mathbb{R}^n$. Then we can form the optimization problem typically used when learning a SVM:
$$\min_{\textbf{w}} \frac{1}{m} \sum_{(\textbf{x}, y) \in S} \ell(\textbf{w}; (\textbf{x}, y)) + \frac{\lambda}{2} ||\textbf{w}||^2$$
where $\ell(\textbf{w}; (\textbf{x}, y)) = \max\{0,~1 - y \langle \textbf{w}, \textbf{x} \rangle\}$ is the hinge loss. The hinge loss will be zero if the inner product is sufficiently far in the same direction as $y$, and it increases linearly with the inner product for training samples that are "close" to the hyperplane or on the wrong side. The first term in the optimization function represents the error for the classifier, while the second term is for regularization, i.e. adding a penalty to large values of $\textbf{w}$ to prevent overfitting the training set.
One of the most useful features of SVMs is that the optimization problem can be solved without depending on the representation of the data itself, but only on the inner product between $\textbf{w}$ and the $\textbf{x}_i$'s. This leads to the idea of the kernel trick for SVMs, which means that instead of operating on the raw representation of the data that may not be linearly separable, we map the data into a high dimensional feature space. This space may potentially even be infinite, e.g. with a Gaussian kernel, but as long as the inner product in that space is easy to compute the optimization problem can be solved. An important detail to note is that, when applying the kernel trick, the vector $\textbf{w}$ also lives in the high dimensional feature space. But thanks to the Karush-Kuhn-Tucker conditions, it is known that $\textbf{w}$ can always be represented as a linear combination of the $\textbf{x}_i$'s, so we can represent $\textbf{w}$ as a set of coefficients. Thus the kernel trick allows SVMs to efficiently perform nonlinear classification of data and makes them much more powerful.
Researchers have been working on solving the SVM optimization problem as efficiently as possible since it was introduced. There have been a number of techniques developed, including interior point methods, decomposition methods such as the SMO algorithm, and gradient methods. Typically, interior point methods have runtimes which grow as $m^3$ while other algorithms can be anywhere from quadratic to linear in $m$ (the interior point methods compensate by having much weaker dependence on the accuracy of the solution). Pegasos is a stochastic subgradient method which falls roughly into the family of gradient methods, but it turns out that the runtime necessary for it to achieve a fixed accuracy not dependent on $m$ (in the case of a linear kernel). This might seem strange at first, but if you think about it, adding more training examples should make it easier for a learning algorithm to achieve the same accuracy. We'll see next time how Pegasos works at a high level and the implications of the algorithm on the relationship between training set size and runtime.
Before I discuss the ideas behind Pegasos, first let me present some background on classification problems and SVMs. The formal definition of the classification problem is as follows. Suppose you are given a training set $S = \{(\textbf{x}_i, y_i)\}_{i=1}^m$ of size $m$ where $\textbf{x}_i \in \mathbb{R}^n$ are feature vectors and $y_i \in \{-1, +1\}$ is the classification label (for simplicity, we are assuming a binary classification problem which SVMs are most naturally suited for). For example, each $\textbf{x}_i$ might be the representation of an image after whitening, and $y_i$ might indicate whether the image contains a mug or not. We can think of the training examples as being drawn from some unknown probability distribution $P(\textbf{X}, Y)$ that we want to estimate. The goal is to find a classifier $h: \mathbb{R}^n \rightarrow \{-1, +1\}$ which minimizes $\Pr[h(\textbf{x}) \neq y]$ for $(\text{x}, y)$ drawn from $P$, i.e. the probability of a classification error given a random sample from the true distribution. Intuitively, a learning algorithm will attempt to build a model for $P$ using $S$ as a representative sample of the distribution.
The idea behind SVMs is to model the classifier $h$ as a separating hyperplane in $\mathbb{R}^n$ where points on one side of the hyperplane are classified as $-1$ and points on the other side are classified as $+1$. One way of representing this is to let $h(\textbf{x}) = \text{sign}(\langle \textbf{w}, \textbf{x} \rangle)$ for some predictor $\textbf{w} \in \mathbb{R}^n$. Then we can form the optimization problem typically used when learning a SVM:
$$\min_{\textbf{w}} \frac{1}{m} \sum_{(\textbf{x}, y) \in S} \ell(\textbf{w}; (\textbf{x}, y)) + \frac{\lambda}{2} ||\textbf{w}||^2$$
where $\ell(\textbf{w}; (\textbf{x}, y)) = \max\{0,~1 - y \langle \textbf{w}, \textbf{x} \rangle\}$ is the hinge loss. The hinge loss will be zero if the inner product is sufficiently far in the same direction as $y$, and it increases linearly with the inner product for training samples that are "close" to the hyperplane or on the wrong side. The first term in the optimization function represents the error for the classifier, while the second term is for regularization, i.e. adding a penalty to large values of $\textbf{w}$ to prevent overfitting the training set.
One of the most useful features of SVMs is that the optimization problem can be solved without depending on the representation of the data itself, but only on the inner product between $\textbf{w}$ and the $\textbf{x}_i$'s. This leads to the idea of the kernel trick for SVMs, which means that instead of operating on the raw representation of the data that may not be linearly separable, we map the data into a high dimensional feature space. This space may potentially even be infinite, e.g. with a Gaussian kernel, but as long as the inner product in that space is easy to compute the optimization problem can be solved. An important detail to note is that, when applying the kernel trick, the vector $\textbf{w}$ also lives in the high dimensional feature space. But thanks to the Karush-Kuhn-Tucker conditions, it is known that $\textbf{w}$ can always be represented as a linear combination of the $\textbf{x}_i$'s, so we can represent $\textbf{w}$ as a set of coefficients. Thus the kernel trick allows SVMs to efficiently perform nonlinear classification of data and makes them much more powerful.
Researchers have been working on solving the SVM optimization problem as efficiently as possible since it was introduced. There have been a number of techniques developed, including interior point methods, decomposition methods such as the SMO algorithm, and gradient methods. Typically, interior point methods have runtimes which grow as $m^3$ while other algorithms can be anywhere from quadratic to linear in $m$ (the interior point methods compensate by having much weaker dependence on the accuracy of the solution). Pegasos is a stochastic subgradient method which falls roughly into the family of gradient methods, but it turns out that the runtime necessary for it to achieve a fixed accuracy not dependent on $m$ (in the case of a linear kernel). This might seem strange at first, but if you think about it, adding more training examples should make it easier for a learning algorithm to achieve the same accuracy. We'll see next time how Pegasos works at a high level and the implications of the algorithm on the relationship between training set size and runtime.
Subscribe to:
Posts (Atom)