Sunday, May 12, 2013


In the last post of a series on tree-based index data structures, I will be talking about Bw-trees, a new form of B-tree designed to be extremely efficient for flash storage. Bw-trees are logically about the same as B+ trees, and it is a new low-level design and implementation that allow them to perform much better. There are three key aspects to the design which give Bw-trees great performance with multi-core processors and large in-memory caches backed by flash storage:
  • operations are latch-free, in that they do not acquire locks and cause threads to yield;
  • updates to the tree are done in a way that greatly reduces cache misses; and
  • data is persisted via a log-structured store, which leverages fast sequential writes and fast random reads of flash I/O.
The first two requirements are quite tricky to implement properly, and much of the paper is devoted to how to ensure correctness in a multi-threaded situation without locks. There is still the need for some sort of atomic operation, however, and the authors leverage the compare-and-swap (CAS) instruction heavily. For example, suppose you want to update a node in the tree (e.g. adding or removing a key). Assuming it has already been brought into memory in a "page," the term used by the authors to represent a node, you could update it in-place, which is most common. There are two downsides to the natural choice: firstly, there must necessarily be some sort of latch-based concurrency control, and secondly, the cache lines that are affected will need to be invalidated. As such, this is actually quite a poor approach from a performance perspective in a highly concurrent system. The authors use instead a method of "delta updates," where each modification to the page is prepended to the page itself (i.e. the update points to the original page). Then the "location" of the page is updated via a CAS instruction to be the location of the update, causing all further reads of that page to include the update. Further updates to the page form a linked list where the last element is the original page.

This design does bypass both of the drawbacks of an in-place update, but also has its own deficiencies. The first is that, once enough updates are accumulated for a page, the performance will degrade significantly as reading a single page results in many pointer traversals. To solve this problem, pages are periodically consolidated when they reach a certain number of delta updates, allowing the cost of cache misses to be incurred only at this time. The page consolidation is done using another CAS instruction to swap out the old page location for the new one. Another complexity introduced by delta updates is what happens when pages are evicted from memory and persisted. Bw-trees actually persist only the delta updates since the original page itself is immutable. The cost then comes from reading the page back into memory at a later time, but thanks to the excellent random read performance of flash storage, this is not a problem, whereas it would have been virtually impossible for a design like this to succeed using spinning disks. So the combination of all of the "environmental conditions," i.e. multi-core processors and flash storage, cause this choice of delta updates to be much preferable to in-place updates.

There are many more details to making the Bw-tree work correctly and efficiently, but I touched upon what I thought were the most interesting high-level design choices. Other topics include the subtleties of making structural updates to the tree work with only CAS instructions and garbage collecting old pages in a safe way, so if you're interested check out the paper. The last thing to mention is the authors' performance results, which are quite impressive to say the least. On their datasets, they see up to an order of magnitude speed increase over BerkeleyDB, which is one of the best performing key/value stores out there. Perhaps even more amazingly, the Bw-tree was able to out-perform a skip list (a latch-free data structure) on a small dataset in memory, which the authors attribute to the superior cache hit rates enabled by the design.

The Bw-tree is a neat evolution of the B+ tree that shows how the design with the best performance is a function of the current hardware. The attention focused on the latch-free aspect and effectiveness of the processor cache highlight how, as we rely more and more on multi-core processors and highly concurrent workloads, we may need to consider novel ways of building systems that can leverage all of the capabilities available in the hardware.

No comments:

Post a Comment