Ternary Search: Memory-Mapped Files

Using memory-mapped files is a common technique for improving I/O performance. The concept is pretty simple: take a portion of a file (usually a page worth, or 4KB) and map its contents to a segment of virtual memory so that your program can access that segment as if were normal memory. The mapping is managed by the operating system, and the actual physical memory used is typically the OS page cache. The OS then naturally handles syncing the page back to disk after writes. There are a couple of key benefits you get when you do this:

reading from a file no longer requires a system call or a copy from kernel space to user space;
you can perform random access and updates to the contents of the file;
memory can be shared between multiple processes that need to read the same file;
you can manipulate contents of extremely large files in memory; and
if your process dies, the OS will usually still flush the written contents to disk.

As such, there are often performance benefits from using memory-mapped files over traditional I/O when used in the correct situations. So all of the above you would typically learn in an operating systems course, but when might you want to use a memory-mapped file outside of writing operating systems?

First, let me introduce the Java interface to memory-mapped files. Since as a Java programmer you don't have access to most low-level operations like mapping files into memory, the functionality has to be built into the language itself. It is done so in the form of the FileChannel, which is part of the new I/O (nio) package. Here's an example of how you might map a portion of a file into memory and write some bytes using a FileChannel:

When we map a file into memory, we are given a MappedByteBuffer which we can then read from and write to assuming we have opened the file in the proper mode. In the example, we set the position to 100 and write four bytes; this only touches memory, but the changes will be flushed to disk by the OS (a flush can also be triggered manually from the FileChannel). The size You can check out this stack overflow question and this blog post for details about FileChannel performance relative to normal Java I/O and even C.

One neat use for memory-mapped files is taking data structures on disk and manipulating them in memory. For example, suppose you have an extremely large bloom filter that you cannot or do not want to load into the JVM heap. Since a bloom filter is a compact and regular data structure, using memory-mapped files to access it is simply a matter of figuring out the offset at which you want to read or write in the MappedByteBuffer. This is especially useful if you are ingesting a lot of data into the bloom filter as you will be doing many writes to different portions of the large file, so it's best to leave the complex memory management to the OS. As another example, Cassandra, a popular NoSQL data store, also uses memory-mapped files for the caching behavior to handle their Sorted String Table data structures.

Memory-mapped files are a convenient feature provided by operating systems in order to simplify the management of resources on disk and improve I/O performance. Even in Java, when you might think such low-level I/O management is not necessary or possible, there are standard libraries to take advantage of memory-mapped files because they are that useful. So if you ever have to write an I/O-intensive application, consider whether you can leverage the OS to simplify your own system.

Ternary Search

Analytics

Wednesday, March 20, 2013

Memory-Mapped Files

No comments:

Post a Comment