## Sunday, March 16, 2014

### Distributed Systems and the CAP Theorem

In the field of distributed systems, the CAP theorem is an important result that often guides the design of such systems. The theorem states that a distributed system cannot satisfy consistency, availability, and partition tolerance simultaneously (see the Wikipedia article for definitions of these three properties). In practice, this typically means that, in the case of a network partition, there is some trade-off between consistency and availability. For example, HBase chooses consistency in that case, while Cassandra chooses availability (with eventual consistency). These days, many services have extremely high availability requirements, leading to the popularity of systems that sacrifice strong consistency for availability; having worked with Cassandra myself, however, it is clear that eventual consistency introduces a layer of complexity on both the client and server side that can be difficult to overcome. I was recently introduced to a blog post on how to beat the CAP theorem (it's lengthy, but well worth the read) written by Nathan Marz, the creator of Storm. He outlines an approach to building distributed systems that is intended to simplify the nature of eventual consistency.

First, let me preface the discussion. The article was met with some criticism, but from what I can tell it is mostly people not understanding his ideas combined with the provocative and ambiguous title. There's no such thing as "beating" the CAP theorem in the sense of violating it through some ingenious design; it's a theorem because someone proved it is always true and cannot be violated. The point being made is that we can address the CAP theorem in a way that doesn't lead us down a road of unmanageable complexity and, consequently, systems that are not robust and reliable.

The basic principle that the design leverages is that "data is inherently immutable." This is because data is always associated with a point in time; you can think of data as a fact that was true at that time. So while the balance of your bank account might change from $10$ to $20$ from time $T$ to time $T+1$, the two pieces of data, namely that your balance was $10$ at time $T$ and $20$ at time $T+1$, are forever true. In my experience, starting with this kind of definition sets you up for success because immutability is good, and it simplifies everything. From here, a distributed system is merely an accumulation of data that exposes methods of querying the data, where a query can be any arbitrary computation over all the data. The flexibility you have in querying the data is determined by what the system has chosen to expose, ranging from limited queries (e.g. a plain key-value store) to very expressive queries (e.g. SQL).

Now that we've gotten the mental model of distributed systems out of the way, let's take a look at the key piece of the design that let us "beat" the CAP theorem. Instead of treating all data homogeneously as most systems do, separate the data into two layers: the batch layer, say everything up until the last hour, and the real-time layer, i.e. everything from the last hour. Queries are then sent to both layers of the data and subsequently merged to produce the final result. Since queries are typically too slow to run across the entire batch layer all the time, we precompute "views" of the batch layer that allow queries to be quickly answered. For example, in a key-value store, the data is the history of all inserts, updates, and deletes, but we can precompute a view which is the current map of keys to values, which lets us answer any query by key quickly. Do this every hour, flushing the real-time layer as the view becomes available to query, and we have a system that only transiently depends on the real-time layer.

So what does this buy us? First of all, the batch layer is very simple. You have all of the data, you compute something from it, and you never have to deal with concurrency or anything of that sort. But we still need the real-time layer, which is going to be complex, so have we really saved anything? Let's think about failure (and if you're working with distributed systems, you should always be thinking about failure), both in terms of systems and humans. Failures will occur, and given the complexity of these real-time systems and the application code interacting with them, it's not unusual to corrupt or lose data in unexpected ways. The batch layer is essentially a reliable fallback mechanism in case of a catastrophe (which, as Marz recalls in an incident, can be as simple as running out of disk space somewhere). By isolating the complex real-time layer from the batch layer that is the ultimate source of truth, you protect yourself against these failures.

We can summarize this entire design exercise by the simple principle that we started out with: data is immutable (and immutability is good). Whether you're programming in a multithreaded environment or building a distributed data processing system, leverage immutability as much as you can. It simplifies things and protects you against your own mistakes. And again, I highly recommend checking out the full post to understand more of his thinking.