## Sunday, May 5, 2013

### B-trees

The B-tree is one of the fundamental data structures used by database indexes. It is essentially a generalization of the binary search tree where each node can contain more than two children. I never got around to actually learning how a B-tree is implemented and how all the operations work, so I figured I would take a blog post to lay it all out. The main tuning parameter of B-trees is the degree, which controls how many children each node has. If the degree is $2d$, then each internal node (except potentially the root) has between $d$ and $2d$ children, inclusive. An internal node with $k$ children has $k-1$ separator values which denote the ranges that the children cover, e.g. if the separator values are $\{2, 5,8\}$ then the children could cover the ranges $(-\infty, 2)$, $(2, 5)$, $(5, 8)$, and $(8, \infty)$, assuming we know nothing about how the parent has already been bounded. Leaves which are not also the root have the same requirement on the number of keys, i.e. between $d-1$ and $2d-1$. Additionally, all leaves are at the same depth in the tree. We'll see how these properties are maintained through all of the operations on the tree.

First, let's start with the easy operation, looking up a key in the B-tree. Similarly to a binary search tree, we start at the root and traverse downwards by picking the appropriate child. This is done by choosing the child whose range contains the key we are searching for; if $d$ is sufficiently large, we can do a binary search within the node to find the right child. Next, consider insertion. Again, we traverse down the tree by choosing the appropriate child until we reach a leaf. If the leaf is not full (it can contain a maximum of $2d-1$ keys like internal nodes), then we simply add the key to the leaf. If it is full, then we consider all $2d$ keys, split them into two groups of keys using one of the median keys $m$, and insert $m$ into the parent node. If the parent is not full, then we are done. Otherwise, we repeat the process; if the root is full and has a key inserted into it, we split those keys as before and create a new root that has two children. The latter is the only case in which the height of the tree increases, and since we create a new root, all leaves are still at the same depth.

Finally, we come to deletion, which always seems to be the hardest operation. We begin by finding the key in the tree. If the key is in an internal node, we can delete it in the same way that you would delete from a binary search tree. Consider the two children separated by that key; choose either the smallest key in the right child or the largest key in the left child as the new separator value and delete that key from the leaf. So now we have reduced deletion to just deleting from leaves. If the leaf has $d$ or more keys, we simply remove it and return. Otherwise, we need to consider rebalancing the tree to maintain the property that all nodes have between $d-1$ and $2d-1$ keys. To rebalance, look at the immediate left and right siblings of the node which has too few keys (if they exist). If one of them has $d$ or more keys, then move the closest key from the sibling to the current node and update both nodes and the parent's separator value (it is somewhat of a "rotation"). Otherwise, take one of the immediate siblings, which must have $d-1$ elements, combine it with the current node, and move the separator value from the parent to the new combined node. If the parent now has too few keys, we repeat the process. If we reach the root and it has only two children which are subsequently combined, the height of the tree decreases. Again, this leaves all of the leaves at the same depth.

So that is a basic implementation of a B-tree, and there are certainly many optimizations that can be made to reduce the number of times you have to retrieve nodes. But the last important discussion is why B-trees are better than binary search trees for databases. And the reason is because B-trees are designed to leverage the performance properties of spinning disks. Disks have very high seek latency due to the time it takes for the mechanical arm to move to the correct location, but they have relatively good throughput once the arm is in place (see here for more details). As such, disk-backed data structures benefit more from reading a block of data at once rather than a very small amount. In the case of B-trees, the size of the nodes are often chosen to be exactly the size of a disk block to maximize performance; as such, the data structure has much smaller depth than a binary search tree, resulting in many fewer disk seeks and significantly better performance.