Sunday, March 31, 2013


A major complaint about Java in the context of today's programming languages is that you need to go through the full process of writing a class with a main method, compiling it, and then executing it just to run a single line of code. As such, one of the most useful features of Scala is the REPL (read-eval-print loop) because it gives you that scripting language feel and allows for very quick experimentation. I decided to play around a bit with the innards of the Scala REPL, whose implementation lives in the scala-compiler jar. Because the classes used to build the REPL are exposed, there are some interesting possibilities for programmatically examining and altering the environment.

The main class that the REPL uses is ILoop, which I imagine stands for "interpreter loop" or something to that effect. It is quite simple to embed the normal REPL into our own program, like this:

If you run this code, you'll be presented with the standard prompt that you see when you launch the REPL (note that without the jline library you won't get command history, tab completion, etc, but the compile-and-run functionality is there). It is running with the same classpath as your program, so any classes you have defined are available in the interpreter. Let's do something a bit more complex; what if we want to programmtically initialize the interpreter environment, e.g. to have certain variables pre-defined or classes pre-imported? We can extend the ILoop class and override the postInitialization() method where we can handle the newly created IMain object, which is the actual interpreter:

Here, we bind the variable SomeConstant to have value 42, pre-import the Foo class, and print out the list of symbols defined by the interpreter. Because we have access to the interpreter object, it is easy to see how you could set up the proper environment with the interpreter and make Domain-Specific Languages (DSLs) even more natural. When you start the custom ILoop from above, we see the following output:

So our constant is defined, the Foo class is imported, and we see that the two defined symbols in the interpreter environment are SomeConstant and this special $intp variable. It turns out that this is a reference to the actual interpreter object that we had earlier, which lets us do crazy meta things like interpret a line of code which interprets a line of code. Here's a transcript from the interpreter environment:

So we see that the implementation of the Scala REPL is quite extensible in terms of altering the environment from code. I think there are a lot of applications of this, especially when using Scala outside of the traditional software engineering environment.

Sunday, March 24, 2013

Memory-Mapped Binary Search

Following up on my last post, I looked into implementing a simple proof-of-concept application using FileChannels. I settled on the following use case: suppose you have a very large file which is a sorted list of strings and you want to perform binary search on it. As mentioned last time, the Sorted String Table data structure, which is roughly a generalization of this, is often manipulated through memory-mapped files. The reason why its convenient to do so is because, regardless of how large the file is, we can map its contents to (virtual) memory and treat it as a big in-memory array, letting the OS take care of paging. Then we can easily perform binary search to test membership in the list.

Let's start by formalizing the data structure. We have two files, the index file and the data file with the actual strings. The former is a list of offsets where strings start (and correspondingly end) in the data file, while the latter is a concatenation of all the strings in sorted order. We will assume we can read the entire index into memory, but we could just as easily memory-map that file as well. When doing the binary search, the index will tell us where in the data file to read strings from to do the comparisons. Here's the code in Scala:

We start off by mapping the data file into memory, from which we obtain a MappedByteBuffer that operates as our in-memory array. The bulk of the work happens in the findString() method which does the binary search; to read from the memory-mapped file, we position the buffer at the location specified by the index and read the string from that position. From there, it is as if we just had a String in memory, and we do the necessary comparisons for the binary search logic.

Again, the biggest win is that we can manipulate very large files without doing any complex memory management to make sure we do not exceed our heap size or the machine's memory. You may notice that there are a few limitations of the above code due to the Java ByteBuffer API. A single ByteBuffer can only represent up to 2GB because it uses integers everywhere instead of longs, so the code will only work for files less than 2GB in size (in which case you often do not exceed the machine's memory). Fortunately, it is not too hard to chunk the file into 2GB blocks and map each chunk to its own ByteBuffer; the code changes slightly, but mostly in additional implementation details.

This proof-of-concept binary search shows how FileChannels and MappedByteBuffers can be leveraged to efficiently solve problems in which your data exceeds your memory limit without adding enormous amounts of complexity to your code.

Wednesday, March 20, 2013

Memory-Mapped Files

Using memory-mapped files is a common technique for improving I/O performance. The concept is pretty simple: take a portion of a file (usually a page worth, or 4KB) and map its contents to a segment of virtual memory so that your program can access that segment as if were normal memory. The mapping is managed by the operating system, and the actual physical memory used is typically the OS page cache. The OS then naturally handles syncing the page back to disk after writes. There are a couple of key benefits you get when you do this:
  • reading from a file no longer requires a system call or a copy from kernel space to user space;
  • you can perform random access and updates to the contents of the file;
  • memory can be shared between multiple processes that need to read the same file;
  • you can manipulate contents of extremely large files in memory; and
  • if your process dies, the OS will usually still flush the written contents to disk.
As such, there are often performance benefits from using memory-mapped files over traditional I/O when used in the correct situations. So all of the above you would typically learn in an operating systems course, but when might you want to use a memory-mapped file outside of writing operating systems?

First, let me introduce the Java interface to memory-mapped files. Since as a Java programmer you don't have access to most low-level operations like mapping files into memory, the functionality has to be built into the language itself. It is done so in the form of the FileChannel, which is part of the new I/O (nio) package. Here's an example of how you might map a portion of a file into memory and write some bytes using a FileChannel:

When we map a file into memory, we are given a MappedByteBuffer which we can then read from and write to assuming we have opened the file in the proper mode. In the example, we set the position to 100 and write four bytes; this only touches memory, but the changes will be flushed to disk by the OS (a flush can also be triggered manually from the FileChannel). The size  You can check out this stack overflow question and this blog post for details about FileChannel performance relative to normal Java I/O and even C.

One neat use for memory-mapped files is taking data structures on disk and manipulating them in memory. For example, suppose you have an extremely large bloom filter that you cannot or do not want to load into the JVM heap. Since a bloom filter is a compact and regular data structure, using memory-mapped files to access it is simply a matter of figuring out the offset at which you want to read or write in the MappedByteBuffer. This is especially useful if you are ingesting a lot of data into the bloom filter as you will be doing many writes to different portions of the large file, so it's best to leave the complex memory management to the OS. As another example, Cassandra, a popular NoSQL data store, also uses memory-mapped files for the caching behavior to handle their Sorted String Table data structures.

Memory-mapped files are a convenient feature provided by operating systems in order to simplify the management of resources on disk and improve I/O performance. Even in Java, when you might think such low-level I/O management is not necessary or possible, there are standard libraries to take advantage of memory-mapped files because they are that useful. So if you ever have to write an I/O-intensive application, consider whether you can leverage the OS to simplify your own system.

Wednesday, March 13, 2013


A couple of days ago, I went to a talk hosted by the Bay Area Scala Enthusiasts on the topic of "Scalable and Flexible Machine Learning with Scala." The talk primarily focused on using Scala with Hadoop to very easily run machine learning algorithms like PageRank, k-means clustering, and decision trees. It was interesting to hear about how the speakers (from LinkedIn and eBay) used machine learning in practice and why they feel Scala is particularly suitable for the task in the context of today's libraries and frameworks. In particular, they showed an impressive library called Scalding that came out of Twitter which is designed to make writing Hadoop workflows in Scala very natural.

Let's start with a quick code snippet taken from the Scalding website.

This piece of code is the implementation of word count, the prototypical MapReduce example, on Hadoop. Besides being impressively short, it is very close to Scala Collections syntax which means it comes naturally to anyone who has programmed in Scala extensively. The idea is to be able to take code the operates on collections and easily translate it into a MapReduce job that can run on Hadoop, which makes a lot of sense since MapReduce is essentially based on the functional programming paradigm.

Of course, it's important that the resulting MapReduce job be efficient. The first two method calls, flatMap and groupBy, don't actually do any work themselves; instead, they simply create flows, an abstraction from the Cascading library which Scalding is built on top of. Much like other languages which attempt to hide the details of MapReduce, Cascading and Scalding construct job/execution flows which are only optimized and executed when you decide to write some output. This is essential because any sort of on-demand execution of map and reduce steps would necessarily lead to inefficiencies in the overall job (e.g. multiple map steps in a row would not be combined and executed on the input all at once).

So why Scalding over something like Pig, which is a dedicated language for Hadoop? One major reason is that it is a huge pain to use custom methods in Pig because you have to write them outside of the language as it is not expressive enough. If you integrate Hadoop into Scala, then you get all of Scala and whatever dependencies you want for free. Furthermore, Pig does not have a compilation step, so it is very easy to get type errors somewhere in the middle of your MapReduce job, whereas with Scalding you would catch them before running the job. In practice, this is a big deal because MapReduce jobs take nontrivial amounts of time and compute resources. Pig has its uses, but Scalding brings a lot to the table in the areas where Pig is lacking.

Running Hadoop jobs is becoming very easy, and Scalding is a great example of how you can implement moderately complex machine learning algorithms that run on clusters with just a handful of lines of code. If there exists a Java library that does most of the hard work, even better; you can integrate it right into your Scalding code. While you probably still have to understand Hadoop pretty well to get the optimal performance out of your cluster, people who are less concerned about resource utilization or who don't have absurdly large datasets benefit greatly from abstractions on top of Hadoop. And everyone wins when it is easy to write testable, robust code even when you are not an expert. This is certainly the direction for many distributed computing and machine learning use cases going forward, and it will be really cool to see what new technologies are developed to meet the demand.

The slides for the talk can be found here.

Thursday, March 7, 2013

Dependency Management

Managing dependencies properly in software is a problem that I've run into time and time again regardless of what project I'm working on. By dependency management, I mean specifying which internal and third-party dependencies your code needs. There are many systems for doing this, such as Ivy and Maven in the Java world and SBT in the Scala world (I'm sure other languages have equivalent facilities). These work pretty well in general; for example, if I'm using Maven as my build tool and I need to depend on Google's Guava library, I can find it in Maven Central, add it to my configuration file, and be done with it. The existence of such repositories is one of the greatest developments in software engineering and has probably reduced the amount of code that you need to write by a few orders of magnitude. Inevitably, however, I always encounter situations in which dependency management causes major headaches, in particular due to inconsistent version requirements.

For example, suppose you are working on a project that requires third-party libraries A and B. Both of them are on Maven Central, so you happily put them into your build configuration and are on your way. Well, not quite. Maven is smart in that it recursively pulls dependencies so that you actually have all of the JARs that you need to run your application. But it turns out that library B was last updated five years ago while library A is brand new, and they both depend on a third library C which has had several releases over the last five years. So B specifies version 1.1 of C as a dependency while A uses the latest-and-greatest version 2.0-alpha. To make things even better, the authors of library C decided to break backwards-compatibility with the major version change from 1.x to 2.x, so now you're stuck with two incompatible versions of the same library on your classpath. Maven will most likely complain because of this, but even if you can coerce it to compile your application (which technically will work), upon running the code you will see scary things like NoSuchMethodErrors and IllegalAccessErrors.

So if you find yourself in such a situation, what can you do? One course of action is to decide on either version 1.1 or 2.0-alpha of library C, find the source of A or B, and build your own version of it after changing the dependency on C. This is quite error-prone because you are not familiar with the third-party code and consequently don't have a full understanding of the scope of the dependency on C. It is also time-consuming to dive into the details of implementations that are supposed to be abstracted away from you and mess with bits and pieces of the internals. I have gone through this process a couple of times (when dealing with ubiquitous libraries like Apache HttpClient and Apache Commons), and it's never been fun.

The problem is that there is nobody who is at fault here; the authors of the libraries are all handling their releases in reasonable ways, and you may very well be the first one who has wanted both libraries A and B in the same application. When nobody does anything wrong and you end up with such a nasty situation, something seems wrong. I recently came across a practice that somewhat mitigates the pain: the authors of the Apache Commons Math library used unique package names when they went from version 2 to 3 (org.apache.commons.math vs org.apache.commons.math3) which means that you can safely have them both on your classpath and think of them as completely separate libraries. But it's not clear whether the generic problem of dependency management is actually "solvable" without having all developers adhere to some rigorous backwards-compatibility standard -- certainly an impossible task.

P.S. There are a few other things that I may as well complain about while I'm on this topic. If you run your own continuous integration system and your code is modular to any degree, you'll most likely run into incompatible versions of your own JARs when certain builds break. This is very annoying and a big killer of developer productivity. Additionally, Scala takes dependency management to a whole new level by having libraries be associated with a specific version of the language, so Maven will complain if you have some dependencies built against Scala 2.10.0 while others are built against Scala 2.9.2. Since people don't always update their libraries in a timely fashion, the process of upgrading your own version of Scala can be painful.

Sunday, March 3, 2013

Anonymized Social Network Data

Social network analysis is a popular research area these days thanks to the rapid growth of networks like Facebook, Google+, and Twitter. Data for subsets of these networks is readily available, e.g. here, for researchers to analyze and discover interesting properties. Sometimes, these datasets are anonymized to protect the privacy of users while preserving the structure of the network; for example, instead of nodes being labelled with names, they are often replaced with a random ID. An interesting question that arises is whether this is secure, that is, can an attacker extract information from data anonymized in this way? Several years ago, it was shown that individuals could be identified from the anonymized Netflix Prize dataset, so this question is important for the research community to consider. In this paper, it is shown that an active attacker can quite easily defeat the naive random ID anonymization.

Let $G$ be the anonymized network with $n$ nodes that an attacker has access to. In this context, an active attacker is someone who has "user-level access" to the network prior to the creation of the anonymized dataset. For example, if $G$ is the anonymized Facebook network, an attacker can create accounts and friend people arbitrarily before $G$ is created such that those nodes and edges will be reflected in $G$. The resulting attack is quite simple. Let $H = \{ x_1, x_2, \ldots, x_k \}$ be the set of "fake" nodes created by the attacker and $W = \{ w_1, w_2, \ldots, w_b \}$ be the set of existing nodes that the attacker wants to compromise (in particular, learn about edges between $w_i$ and $w_j$). It is shown that with just $k = \Theta( \log{n} )$ attacker nodes, you can, with high probability, perform an attack efficiently on $b = \Theta( \log^2{n} )$ of the "real" nodes in the network. The attack is based on generating $H$ in such a way that it can easily be recovered from $G$ and connecting $H$ to $W$ so that each node in $W$ can be uniquely identified.

There are a few properties of $H$ that need to be guaranteed. There cannot be another induced subgraph in $G$ which is isomorphic to $H$, so that we can identify $H$, and there must be an efficient algorithm to find $H$ within $G$. To generate $H$, first include the edges $(x_i, x_{i+1})$ for $i = 1, 2, \ldots, k-1$ and then include every other edge with probability $1/2$. Then each $w_j$ is connected to a small, unique subset of nodes in $H$ so that we can recover all of the $w_j$'s once we find $H$. Furthermore, each $x_i$ has edges added randomly to nodes in $G - H$ until it reaches degree $\Delta_i = O(\log{n})$. Based on the structure of $H$ and the degrees, we can now recover it from $G$ using a search over short walks in $G$. Starting from each node, consider it as a potential $x_1$ and then, if that passes the degree requirements, consider all of its neighbors as potential $x_2$'s. For each neighbor that passes the degree requirements and the internal structure constraints of $H$, continue the walk, and repeat this until a walk with $k$ nodes is found. Through some delicate analysis, the authors show that, with high probability, there will be a unique walk that satisfies all the constraints (corresponding to $H$ itself) and that the search tree will have size $O(n^{1+\epsilon})$ so the algorithm is tractable on large datasets.

The paper describes experiments run on a LiveJournal dataset with over 4 million nodes and 70 million edges. With just $k = 7$ attacker nodes, they are able to successfully recover $H$ with roughly 90% probability, and it only goes up as $k$ increases. This means that it is simple to perform such attacks in practice and, perhaps more importantly, that the attacks have a sufficiently small effect in the network to avoid detection. The end result of the work is captured quite nicely by a quote from the paper, highlighting the observation that simple anonymization is not enough: "true safeguarding of privacy requires mathematical rigor, beginning with a clear description of what it means to compromise privacy, what are the computational and behavioral capabilities of the adversary, and to what information does it have access."