Sunday, March 24, 2013

Memory-Mapped Binary Search

Following up on my last post, I looked into implementing a simple proof-of-concept application using FileChannels. I settled on the following use case: suppose you have a very large file which is a sorted list of strings and you want to perform binary search on it. As mentioned last time, the Sorted String Table data structure, which is roughly a generalization of this, is often manipulated through memory-mapped files. The reason why its convenient to do so is because, regardless of how large the file is, we can map its contents to (virtual) memory and treat it as a big in-memory array, letting the OS take care of paging. Then we can easily perform binary search to test membership in the list.

Let's start by formalizing the data structure. We have two files, the index file and the data file with the actual strings. The former is a list of offsets where strings start (and correspondingly end) in the data file, while the latter is a concatenation of all the strings in sorted order. We will assume we can read the entire index into memory, but we could just as easily memory-map that file as well. When doing the binary search, the index will tell us where in the data file to read strings from to do the comparisons. Here's the code in Scala:

We start off by mapping the data file into memory, from which we obtain a MappedByteBuffer that operates as our in-memory array. The bulk of the work happens in the findString() method which does the binary search; to read from the memory-mapped file, we position the buffer at the location specified by the index and read the string from that position. From there, it is as if we just had a String in memory, and we do the necessary comparisons for the binary search logic.

Again, the biggest win is that we can manipulate very large files without doing any complex memory management to make sure we do not exceed our heap size or the machine's memory. You may notice that there are a few limitations of the above code due to the Java ByteBuffer API. A single ByteBuffer can only represent up to 2GB because it uses integers everywhere instead of longs, so the code will only work for files less than 2GB in size (in which case you often do not exceed the machine's memory). Fortunately, it is not too hard to chunk the file into 2GB blocks and map each chunk to its own ByteBuffer; the code changes slightly, but mostly in additional implementation details.

This proof-of-concept binary search shows how FileChannels and MappedByteBuffers can be leveraged to efficiently solve problems in which your data exceeds your memory limit without adding enormous amounts of complexity to your code.

No comments:

Post a Comment