Wednesday, March 13, 2013


A couple of days ago, I went to a talk hosted by the Bay Area Scala Enthusiasts on the topic of "Scalable and Flexible Machine Learning with Scala." The talk primarily focused on using Scala with Hadoop to very easily run machine learning algorithms like PageRank, k-means clustering, and decision trees. It was interesting to hear about how the speakers (from LinkedIn and eBay) used machine learning in practice and why they feel Scala is particularly suitable for the task in the context of today's libraries and frameworks. In particular, they showed an impressive library called Scalding that came out of Twitter which is designed to make writing Hadoop workflows in Scala very natural.

Let's start with a quick code snippet taken from the Scalding website.

This piece of code is the implementation of word count, the prototypical MapReduce example, on Hadoop. Besides being impressively short, it is very close to Scala Collections syntax which means it comes naturally to anyone who has programmed in Scala extensively. The idea is to be able to take code the operates on collections and easily translate it into a MapReduce job that can run on Hadoop, which makes a lot of sense since MapReduce is essentially based on the functional programming paradigm.

Of course, it's important that the resulting MapReduce job be efficient. The first two method calls, flatMap and groupBy, don't actually do any work themselves; instead, they simply create flows, an abstraction from the Cascading library which Scalding is built on top of. Much like other languages which attempt to hide the details of MapReduce, Cascading and Scalding construct job/execution flows which are only optimized and executed when you decide to write some output. This is essential because any sort of on-demand execution of map and reduce steps would necessarily lead to inefficiencies in the overall job (e.g. multiple map steps in a row would not be combined and executed on the input all at once).

So why Scalding over something like Pig, which is a dedicated language for Hadoop? One major reason is that it is a huge pain to use custom methods in Pig because you have to write them outside of the language as it is not expressive enough. If you integrate Hadoop into Scala, then you get all of Scala and whatever dependencies you want for free. Furthermore, Pig does not have a compilation step, so it is very easy to get type errors somewhere in the middle of your MapReduce job, whereas with Scalding you would catch them before running the job. In practice, this is a big deal because MapReduce jobs take nontrivial amounts of time and compute resources. Pig has its uses, but Scalding brings a lot to the table in the areas where Pig is lacking.

Running Hadoop jobs is becoming very easy, and Scalding is a great example of how you can implement moderately complex machine learning algorithms that run on clusters with just a handful of lines of code. If there exists a Java library that does most of the hard work, even better; you can integrate it right into your Scalding code. While you probably still have to understand Hadoop pretty well to get the optimal performance out of your cluster, people who are less concerned about resource utilization or who don't have absurdly large datasets benefit greatly from abstractions on top of Hadoop. And everyone wins when it is easy to write testable, robust code even when you are not an expert. This is certainly the direction for many distributed computing and machine learning use cases going forward, and it will be really cool to see what new technologies are developed to meet the demand.

The slides for the talk can be found here.

No comments:

Post a Comment