Sunday, May 26, 2013


Refactoring is a common technique in the day-to-day work of a software engineer. It is crucial for producing readable, extensible, and maintainable code in large projects. If you are a user of Eclipse, then you are probably familiar with its various refactoring capabilities that allow you to safely make sweeping changes to a codebase. Eclipse is one of the most commonly used integrated development environments (IDEs) due to the prevalence of Java and, as such, produces a lot of data about how people refactor. This study done in 2009 analyzes data collected by Eclipse and other sources to reveal some interesting facts about refactoring and tools for doing so. Here are some of the highlights:
  • The developers building refactoring tools don't refactor in the same way as "normal" users. As expected, those building the tools use the complex refactorings more frequently. For the average person, "rename" constitutes the majority of their usage (myself probably included).
  • People typically refactor in "batches." That is, there will often be multiple refactorings of the same kind performed in short succession. Interestingly, this appears to not be true for the "move" tool, as the average batch of that has just one refactoring, perhaps because you typically move what is already a cohesive group all at once.
  • Programmers do not configure refactoring tools, i.e. they tend to leave all of the default options. Unfortunately the study did not have enough information to determine why, and in my mind there are three potential reasons: the defaults could be correct for most cases, programmers may prefer to make minor modifications manually after the refactoring, and status quo bias.
  • Refactoring is often left out of commit messages and done in conjunction with other work. While this is not ideal, I will admit that I am often guilty of it. It is very common to, while working on implementing some new functionality, come across some existing code which you can refactor to make the new code easier to write. Especially if the refactoring is minor, breaking your flow in order to separate the refactoring out into a separate commit can be too much overhead.
Trying to improve the efficiency of a software engineer is a very interesting challenge. In some sense, it is a psychology problem to figure out how programmers understand code and what kinds of tools complement that understanding. As we continue reducing the cost of translating from conceptual models to code, development will become more efficient and accessible. Studies like this are important for bringing to light the important truths of how we program.

Thursday, May 23, 2013

Structural Typing

Scala tries very hard to feel like a dynamic language despite the fact that it is rooted in the statically-typed Java Virtual Machine (JVM). One of the ways in which it does so is through a language feature called structural typing. If you have ever used Python, Ruby, JavaScript, or a number of other languages, you are probably familiar with duck typing; that is, "[if it] walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck." Because these languages are dynamically-typed, you can write code that calls any method on an object, and at run-time the code can potentially crash if it turns out that the object does not have that method (i.e. it is not a duck). The beauty of the statically-typed world is that this is (nearly) impossible because the compiler will ensure that everything checks out before allowing you to continue. The drawback, then, is that even if you know an object has a particular method, unless its type at compile-time has that method you cannot call it. Structural typing is a way of alleviating this restriction.

Let's look at a quick example with two Java classes: PrintStream and PrintWriter. As it turns out, they both have the methods println(String s) and flush(). Unfortunately, they share no common superclass or interface that has these methods, so if we wanted to write a method that relies only on an object's ability to print and flush, we would need two separate such methods, one for PrintStream and one for PrintWriter. The pure Java solution is to add an interface with  println(String s) and flush() and make both PrintWriter and PrintStream implement them; that would work if you owned both classes or could modify the core Java libraries, but for most people that is not an option. Structural typing is essentially a way to emulate that behavior without actually modifying the classes. Let's see how this looks:

So we define a new "type" called Printer, which has exactly those two methods on it. Then we can write the printAndFlush() method that takes in a Printer and knows at compile-time that both the print and flush methods must exist on any object that is passed in. At the end we call printAndFlush() once with a PrintStream and once with a PrintWriter, achieving our goal. Again, the important thing to realize is that type safety is still guaranteed at compile-time, i.e. that any type passed in does indeed have those two methods, so it impossible for printAndFlush() to throw an exception saying that one of the methods does not exist.

An interesting question is how Scala manages to make structural types work in the JVM, which is based on a purely nominal type system (like Java itself). The method signature must have a named type for its parameter where no such type exists, so what do you do? The answer is a little underwhelming: it turns out that the Scala compiler throws away the structural type information after verifying type safety and the method signature in the bytecode takes in an Object. That means that the actual method call is dispatched via reflection and thus has a number of problems ranging from performance to reliability, which is quite disappointing. Fortunately, Java 7 has a new feature known as invokedynamic which is designed to help with dynamic method invocations much like those introduced by structural types, so we can hope that in the future reflection will no longer be needed to support this feature. Scala and other JVM-based languages really push the limits of what the JVM can do, and it's great to see that it is evolving to meet these challenges; one can imagine that someday a language which is not Java becomes the primary interface to the JVM, leveraging the decades of hard work that have gone into it.

Tuesday, May 21, 2013

Portable Native Client

A few years ago, Google released Native Client (NaCl), which is a sandbox for running untrusted, native code downloaded from the Internet. Its purpose is to allow browser-based applications to have the benefits of native applications, e.g. improved performance and the use of threads, in a secure way. One natural use case is for games, which are typically some of the most performance-intensive applications and can really benefit from being written in a low-level language. A major drawback of running native code, however, is that it is not portable across different instruction set architectures (ISAs), and the original NaCl supported only x86. This is a big contrast from the web world, where all browsers can run Javascript, and in some ways can be considered "backwards" as performance becomes less important and portability becomes more.

Since then, Google has added support for other ISAs, but it has been the developer's responsibility to make sure they build, test, and maintain their application across all of them, which is again counter to the trend of development today. However, Google was not ready to let NaCl go, and they recently came out with Portable Native Client (PNaCl) to address this issue. PNaCl adds another layer of indirection in order to reduce the burden of portability on the developer. They main tool they leverage is LLVM, which is a compiler infrastructure that operates independently of the source language and target architecture. Instead of deploying code that is compiled directly for each of the ISAs, developers instead compile to LLVM bitcode, which is an intermediate representation (IR) that is ISA-independent. The LLVM project has tools for translating the IR to a variety of target ISAs, so the browser does this translation after downloading the IR (i.e. only once the ISA is known). The native code produced then runs in the NaCl sandbox, maintaining all of the necessary security features for running untrusted code. In this way, there is no longer any need for developers to worry about the ISA of the machine running the browser, and the burden is shifted to NaCl itself. This is a huge win for portability and is made possible by the fact that LLVM is able to nearly match the performance of direct compilers such as GCC.

PNaCl and LLVM are great examples of the famous quote: "All problems in computer science can be solved by another level of indirection." If we think about it at a high level, it's a pretty impressive feat end-to-end, essentially allowing code written in C/C++ and compiled once to be downloaded over the web and run on (almost) any machine securely and with good performance. Portability being the major lacking feature of NaCl, I am curious to see whether more people start writing applications for PNaCl because it is now much more compelling, although browser support is still limited to Chrome.

Sunday, May 12, 2013


In the last post of a series on tree-based index data structures, I will be talking about Bw-trees, a new form of B-tree designed to be extremely efficient for flash storage. Bw-trees are logically about the same as B+ trees, and it is a new low-level design and implementation that allow them to perform much better. There are three key aspects to the design which give Bw-trees great performance with multi-core processors and large in-memory caches backed by flash storage:
  • operations are latch-free, in that they do not acquire locks and cause threads to yield;
  • updates to the tree are done in a way that greatly reduces cache misses; and
  • data is persisted via a log-structured store, which leverages fast sequential writes and fast random reads of flash I/O.
The first two requirements are quite tricky to implement properly, and much of the paper is devoted to how to ensure correctness in a multi-threaded situation without locks. There is still the need for some sort of atomic operation, however, and the authors leverage the compare-and-swap (CAS) instruction heavily. For example, suppose you want to update a node in the tree (e.g. adding or removing a key). Assuming it has already been brought into memory in a "page," the term used by the authors to represent a node, you could update it in-place, which is most common. There are two downsides to the natural choice: firstly, there must necessarily be some sort of latch-based concurrency control, and secondly, the cache lines that are affected will need to be invalidated. As such, this is actually quite a poor approach from a performance perspective in a highly concurrent system. The authors use instead a method of "delta updates," where each modification to the page is prepended to the page itself (i.e. the update points to the original page). Then the "location" of the page is updated via a CAS instruction to be the location of the update, causing all further reads of that page to include the update. Further updates to the page form a linked list where the last element is the original page.

This design does bypass both of the drawbacks of an in-place update, but also has its own deficiencies. The first is that, once enough updates are accumulated for a page, the performance will degrade significantly as reading a single page results in many pointer traversals. To solve this problem, pages are periodically consolidated when they reach a certain number of delta updates, allowing the cost of cache misses to be incurred only at this time. The page consolidation is done using another CAS instruction to swap out the old page location for the new one. Another complexity introduced by delta updates is what happens when pages are evicted from memory and persisted. Bw-trees actually persist only the delta updates since the original page itself is immutable. The cost then comes from reading the page back into memory at a later time, but thanks to the excellent random read performance of flash storage, this is not a problem, whereas it would have been virtually impossible for a design like this to succeed using spinning disks. So the combination of all of the "environmental conditions," i.e. multi-core processors and flash storage, cause this choice of delta updates to be much preferable to in-place updates.

There are many more details to making the Bw-tree work correctly and efficiently, but I touched upon what I thought were the most interesting high-level design choices. Other topics include the subtleties of making structural updates to the tree work with only CAS instructions and garbage collecting old pages in a safe way, so if you're interested check out the paper. The last thing to mention is the authors' performance results, which are quite impressive to say the least. On their datasets, they see up to an order of magnitude speed increase over BerkeleyDB, which is one of the best performing key/value stores out there. Perhaps even more amazingly, the Bw-tree was able to out-perform a skip list (a latch-free data structure) on a small dataset in memory, which the authors attribute to the superior cache hit rates enabled by the design.

The Bw-tree is a neat evolution of the B+ tree that shows how the design with the best performance is a function of the current hardware. The attention focused on the latch-free aspect and effectiveness of the processor cache highlight how, as we rely more and more on multi-core processors and highly concurrent workloads, we may need to consider novel ways of building systems that can leverage all of the capabilities available in the hardware.

Wednesday, May 8, 2013

B+ trees

As a short follow-up to my last post on B-trees, this post is about a variant that is more popular in practice, the B+ tree. There are two primary differences between the data structures:
  • internal nodes in B+ trees do not store data (only keys); and
  • leaf nodes in B+ trees form a linked list in the order of the keys.
The reasoning behind this is because, as discussed previously, the limiting factor in a looking up a key in an index is the depth of the tree. Each additional step that has to be taken requires an extra disk seek, so minimizing the depth is the most important factor in improving performance. By moving all of the data to the leaves, the internal nodes can have much higher fan-out, thus reducing the depth of the tree. Additionally, by adding the linked list property, it becomes much easier to traverse keys sequentially as is common when doing range queries or index scans in a database.

B-trees and B+ trees are excellent at using block-oriented secondary storage liked spinning hard drives, but the hardware landscape is changing. Next time, I'll discuss a new variant of the B-tree which is optimized for flash storage and solid-state drives.

Sunday, May 5, 2013


The B-tree is one of the fundamental data structures used by database indexes. It is essentially a generalization of the binary search tree where each node can contain more than two children. I never got around to actually learning how a B-tree is implemented and how all the operations work, so I figured I would take a blog post to lay it all out. The main tuning parameter of B-trees is the degree, which controls how many children each node has. If the degree is $2d$, then each internal node (except potentially the root) has between $d$ and $2d$ children, inclusive. An internal node with $k$ children has $k-1$ separator values which denote the ranges that the children cover, e.g. if the separator values are $\{2, 5,8\}$ then the children could cover the ranges $(-\infty, 2)$, $(2, 5)$, $(5, 8)$, and $(8, \infty)$, assuming we know nothing about how the parent has already been bounded. Leaves which are not also the root have the same requirement on the number of keys, i.e. between $d-1$ and $2d-1$. Additionally, all leaves are at the same depth in the tree. We'll see how these properties are maintained through all of the operations on the tree.

First, let's start with the easy operation, looking up a key in the B-tree. Similarly to a binary search tree, we start at the root and traverse downwards by picking the appropriate child. This is done by choosing the child whose range contains the key we are searching for; if $d$ is sufficiently large, we can do a binary search within the node to find the right child. Next, consider insertion. Again, we traverse down the tree by choosing the appropriate child until we reach a leaf. If the leaf is not full (it can contain a maximum of $2d-1$ keys like internal nodes), then we simply add the key to the leaf. If it is full, then we consider all $2d$ keys, split them into two groups of keys using one of the median keys $m$, and insert $m$ into the parent node. If the parent is not full, then we are done. Otherwise, we repeat the process; if the root is full and has a key inserted into it, we split those keys as before and create a new root that has two children. The latter is the only case in which the height of the tree increases, and since we create a new root, all leaves are still at the same depth.

Finally, we come to deletion, which always seems to be the hardest operation. We begin by finding the key in the tree. If the key is in an internal node, we can delete it in the same way that you would delete from a binary search tree. Consider the two children separated by that key; choose either the smallest key in the right child or the largest key in the left child as the new separator value and delete that key from the leaf. So now we have reduced deletion to just deleting from leaves. If the leaf has $d$ or more keys, we simply remove it and return. Otherwise, we need to consider rebalancing the tree to maintain the property that all nodes have between $d-1$ and $2d-1$ keys. To rebalance, look at the immediate left and right siblings of the node which has too few keys (if they exist). If one of them has $d$ or more keys, then move the closest key from the sibling to the current node and update both nodes and the parent's separator value (it is somewhat of a "rotation"). Otherwise, take one of the immediate siblings, which must have $d-1$ elements, combine it with the current node, and move the separator value from the parent to the new combined node. If the parent now has too few keys, we repeat the process. If we reach the root and it has only two children which are subsequently combined, the height of the tree decreases. Again, this leaves all of the leaves at the same depth.

So that is a basic implementation of a B-tree, and there are certainly many optimizations that can be made to reduce the number of times you have to retrieve nodes. But the last important discussion is why B-trees are better than binary search trees for databases. And the reason is because B-trees are designed to leverage the performance properties of spinning disks. Disks have very high seek latency due to the time it takes for the mechanical arm to move to the correct location, but they have relatively good throughput once the arm is in place (see here for more details). As such, disk-backed data structures benefit more from reading a block of data at once rather than a very small amount. In the case of B-trees, the size of the nodes are often chosen to be exactly the size of a disk block to maximize performance; as such, the data structure has much smaller depth than a binary search tree, resulting in many fewer disk seeks and significantly better performance.