## Sunday, January 20, 2013

### Erasure Coding

Durability is a critical feature of distributed storage systems. Because these storage systems are typically built on a large number of machines using commodity hardware, individual component failures are expected and common; thus durability comes from how fault tolerance is designed into the system. The original Google File System (GFS) was among the earliest systems built at a scale where these problems started to manifest, and one of the key features was that all data in GFS was redundantly stored in three copies. Combined with the proper distribution of these three copies so they don't fail simultaneously, this design allowed Google to scale out their storage system without losing data. While GFS was quite a landmark piece of technology, we've come a long way since then, and people are constantly devising new ways to achieve durability while optimizing other important features of distributed storage systems like latency, storage overhead, and reconstruction cost.

Storing identical copies of data is a simple strategy for redundancy, but more involved techniques such as erasure coding can greatly improve the storage overhead while providing the same durability guarantees. The idea behind erasure coding is to encode a message of $K$ symbols using $N > K$ symbols such that you can reconstruct the original message using only a subset of the $N$ symbols. For example, you can think of storing three copies of data as erasure coding with $K = 1$ and $N = 3$, where as long as you have one of the three copies you can reconstruct the original data. Another common example is encoding $K$ bits using $N = K+1$ bits where the extra bit stores the parity/XOR of the first $K$ bits; losing any single bit will still allow you to recover the original message. These examples demonstrate how erasure codes are redundant and prevent losing the message even when individual symbols are lost (e.g. a machine storing data dies).

The most common erasure codes are Reed-Solomon codes, which are present in CDs, DVDs, and barcodes to ensure that these media are resistant to partial damage. Reed-Solomon codes extend the idea of the parity bit by leveraging polynomial interpolation. Given the message of $K$ symbols, consider the polynomial of degree (at most) $K-1$ with the symbols as the coefficients. Evaluate the polynomial at $N > K$ points known to both the sender and the receiver, and send the results of those evaluations. As long as $K$ of the resulting symbols are received, the polynomial can be interpolated and thus the original message reconstructed. The details of the algebra on polynomial rings and an efficient decoding algorithm are beyond the scope of this post, but suffice it to say that people have figured out how to use Reed-Solomon codes very effectively. It shouldn't come as a surprise then that the new GFS (discussed here) uses them and the Hadoop File System (HDFS) has a module which adds the functionality.

The story doesn't end there, though. Last summer, a paper came out of Microsoft Research and the Windows Azure Storage (WAS) team that describes a new technique called local reconstruction codes (LRC) used in WAS, Microsoft's equivalent of GFS and HDFS. One of the downsides of Reed-Solomon codes in storage systems is that, if one of your original $K$ symbols is lost, you must read a full $K$ other symbols in order to recover that one symbol. When these symbols are chunks distributed across machines, it can be costly to do that many accesses. LRC can be thought of as an extension of Reed-Solomon codes which adds an additional parameter that determines how many local groups there are within a message. It facilitates the reconstruction of lost symbols by leveraging additional redundancy introduced at the local group level. The work demonstrates that the technique is more efficient both in terms of storage overhead and reconstruction cost than Reed-Solomon codes when both methods are required to provide the same fault tolerance as storing three copies of the data (which is still the standard for systems today).

The theory behind LRC is really cool (and my two-sentence summary doesn't do it justice, so read the paper if you're at all interested), but the additional practical discussions in the paper are equally interesting. Some of the key points that I found particularly enlightening were the following:
• When retrieving a fragment (unit of a file which is erasure coded) from WAS, a request to reconstruct the fragment is issued simultaneously, so that if the fragment lives on a machine that is under heavy load it's possible that the reconstruction is more efficient and returns first. I imagine that this has a nice effect on the long tail of retrieval times.
• As files are being written, they are initially stored as three redundant copies, but once a file is "sealed" and can no longer be written to, the system will asynchronously erasure code it and remove the redundant copies. This is so the erasure coding doesn't have any performance impact on the "critical path of client writes."
• Even the process of erasure coding must be fault tolerant. The WAS system stores progress information for files that are being erasure coded so that fragments which are complete don't need to be encoded again after a failure. Furthermore, they employ many cyclic redundancy checks during the process and perform extensive randomized reconstruction tests prior to committing to the erasure-coded version and deleting the original copies.
Erasure coding is a powerful technique for providing durability in distributed storage systems and seems to be becoming the standard. LRC is a neat innovation in this space, and it will be exciting to see how these technologies evolve as these types of systems become more and more prevalent.