Sunday, March 16, 2014

Distributed Systems and the CAP Theorem

In the field of distributed systems, the CAP theorem is an important result that often guides the design of such systems. The theorem states that a distributed system cannot satisfy consistency, availability, and partition tolerance simultaneously (see the Wikipedia article for definitions of these three properties). In practice, this typically means that, in the case of a network partition, there is some trade-off between consistency and availability. For example, HBase chooses consistency in that case, while Cassandra chooses availability (with eventual consistency). These days, many services have extremely high availability requirements, leading to the popularity of systems that sacrifice strong consistency for availability; having worked with Cassandra myself, however, it is clear that eventual consistency introduces a layer of complexity on both the client and server side that can be difficult to overcome. I was recently introduced to a blog post on how to beat the CAP theorem (it's lengthy, but well worth the read) written by Nathan Marz, the creator of Storm. He outlines an approach to building distributed systems that is intended to simplify the nature of eventual consistency.

First, let me preface the discussion. The article was met with some criticism, but from what I can tell it is mostly people not understanding his ideas combined with the provocative and ambiguous title. There's no such thing as "beating" the CAP theorem in the sense of violating it through some ingenious design; it's a theorem because someone proved it is always true and cannot be violated. The point being made is that we can address the CAP theorem in a way that doesn't lead us down a road of unmanageable complexity and, consequently, systems that are not robust and reliable.

The basic principle that the design leverages is that "data is inherently immutable." This is because data is always associated with a point in time; you can think of data as a fact that was true at that time. So while the balance of your bank account might change from $10$ to $20$ from time $T$ to time $T+1$, the two pieces of data, namely that your balance was $10$ at time $T$ and $20$ at time $T+1$, are forever true. In my experience, starting with this kind of definition sets you up for success because immutability is good, and it simplifies everything. From here, a distributed system is merely an accumulation of data that exposes methods of querying the data, where a query can be any arbitrary computation over all the data. The flexibility you have in querying the data is determined by what the system has chosen to expose, ranging from limited queries (e.g. a plain key-value store) to very expressive queries (e.g. SQL).

Now that we've gotten the mental model of distributed systems out of the way, let's take a look at the key piece of the design that let us "beat" the CAP theorem. Instead of treating all data homogeneously as most systems do, separate the data into two layers: the batch layer, say everything up until the last hour, and the real-time layer, i.e. everything from the last hour. Queries are then sent to both layers of the data and subsequently merged to produce the final result. Since queries are typically too slow to run across the entire batch layer all the time, we precompute "views" of the batch layer that allow queries to be quickly answered. For example, in a key-value store, the data is the history of all inserts, updates, and deletes, but we can precompute a view which is the current map of keys to values, which lets us answer any query by key quickly. Do this every hour, flushing the real-time layer as the view becomes available to query, and we have a system that only transiently depends on the real-time layer.

So what does this buy us? First of all, the batch layer is very simple. You have all of the data, you compute something from it, and you never have to deal with concurrency or anything of that sort. But we still need the real-time layer, which is going to be complex, so have we really saved anything? Let's think about failure (and if you're working with distributed systems, you should always be thinking about failure), both in terms of systems and humans. Failures will occur, and given the complexity of these real-time systems and the application code interacting with them, it's not unusual to corrupt or lose data in unexpected ways. The batch layer is essentially a reliable fallback mechanism in case of a catastrophe (which, as Marz recalls in an incident, can be as simple as running out of disk space somewhere). By isolating the complex real-time layer from the batch layer that is the ultimate source of truth, you protect yourself against these failures.

We can summarize this entire design exercise by the simple principle that we started out with: data is immutable (and immutability is good). Whether you're programming in a multithreaded environment or building a distributed data processing system, leverage immutability as much as you can. It simplifies things and protects you against your own mistakes. And again, I highly recommend checking out the full post to understand more of his thinking.

Sunday, March 2, 2014

Matrix Sketching

Last time, I wrote about a clever algorithm for approximating the histogram for a stream using bounded memory. The post was motivated by this paper, which is an extension of that algorithm to a problem that seems unrelated at first glance, which is matrix sketching. The matrix sketching problem is as follows: given the rows of a large matrix $A \in \mathbb{R}^{n \times m}$ as a stream, produce a matrix $B \in \mathbb{R}^{l \times m}$ where $l << n$ which is a "good" approximation for $A$ when multiplied with vectors. Specifically, the algorithm in the paper achieves the following result: if $l = 1/\epsilon$, then it produces a matrix $B$ such that, for any unit vector $x$,

$||Ax||^2 \ge ||Bx||^2 \ge ||Ax||^2 - \epsilon ||A||_f^2$

where $||A||_f$ is the Frobenius norm of $A$. Here you can see the parallels with the frequency approximation algorithm; the error is a function of how "big" the stream is, which in this case is the Frobenius norm of the input matrix.

The algorithm works as follows: start with $B$ as a $l \times m$ matrix of all zeroes, and for each input row $A_i$ do the following update:
1. Set $B_l = A_i$ (the last row of $B$).
2. Compute the singular value decomposition (SVD) of $B$, so we obtain $B = U \Sigma V$ with the standard assumption that the diagonal values of $\Sigma$ are $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_l$.
3. "Remove" the smallest singular value from $\Sigma$ by letting

$\bar{\Sigma} = \sqrt{\max(\Sigma - I_l \sigma_l^2, 0)}$

where $I_l$ is the $l \times l$ identity matrix.
4. Then set $B = \bar{\Sigma}V$ (note that the last row of $B$ is all zeroes after this step because the last row of $\bar{\Sigma}$ is all zeroes by construction).
At the end, just output the current value of $B$. I won't go into any of the proof details (they can be found in the paper), but it's interesting to try to understand what the algorithm is doing intuitively. The SVD can be thought of (very loosely) as breaking down a matrix into three transformations applied to a multidimensional space: a rotation ($V$), followed by a scaling along the axes ($\Sigma$), and lastly another rotation ($U$). So the singular values are the scaling factors of the matrix in orthogonal directions, and we are removing the smallest one from each of these directions equally. As a result, we only lose a fraction of the accuracy in any particular direction (i.e. $Bx$ versus $Ax$ for a specific $x$) compared to the Frobenius norm of $A$ as a whole.

This would be pretty cool even if it was a purely theoretical result since it ties two problems together in a very elegant way, but it gets better. The paper goes on to explore the algorithm's performance with some experiments and observes that the accuracy is quite an improvement over existing techniques for matrix sketching. Moreover, computing the SVD is somewhat expensive, so the author describes how the algorithm can be parallelized as well as a way to reduce the computational cost of the SVD step and only slightly relaxing the guarantees. It's a very nice paper that spans both the theoretical and practical domains for the matrix sketching problem.

Saturday, February 15, 2014

Streaming Frequency Approximation

It's been a while, but I finally motivated myself to write another blog post! Today's topic is approximating the frequency of items in a stream. With all the data that is generated today, streaming algorithms have become very popular as one of the efficient ways to process datasets that are far too large to fit in memory. One common data mining problem is determining the frequency of items in data, i.e. building a histogram. Consider a stream $X = x_1, x_2, \ldots, x_n$ of $n$ items, each of which belongs to one of $m$ classes $c_1, c_2, \ldots, c_m$. If $m$ is small, the problem is simple, as we simple maintain the entire histogram in $O(m)$ memory and output the exact result. The interesting case is when there are an extremely large number of distinct classes, the number of which is usually unknown. In this case, we will present an algorithm originally from this paper that approximates each value in the histogram within $\epsilon n$ using $O(1/\epsilon)$ memory.

At any point while processing the stream, the in-memory state is a partial histogram with at most $k = 1/\epsilon$ elements (class/count pairs). When processing an element $x_i$, we use the following procedure with three cases:
1. $x_i$ is already in the histogram: increment its count;
2. we have fewer than $k$ elements in the histogram: add $x_i$ with count $1$;
3. the histogram already has $k$ other elements: while the histogram has $k$ elements, decrease the count of each element by $1$ and evict any elements with zero count, and then add $x_i$ with count $1$.
That's the entire algorithm, and it's easy to prove that it achieves the desired approximation as well. The only case in which we "lose" information is case 3, where the count of each of $k$ elements is decreased by $1$. But we can clearly decrease counts at most $n/k = \epsilon n$ times, since that would necessarily reduce all counts to zero. Thus, using the partial histogram and outputting the frequency of any class not in the histogram as zero, we are guaranteed that each count is within $\epsilon n$ of the actual count. In particular, this means that the algorithm will be guaranteed to output a positive count for any class that occurs with frequency $\epsilon n$ or more in the stream.

I've always found this problem to be particularly beautiful because of the simplicity of the result, the algorithm, as well as the proof. Moreover, it's a very practical problem with many applications. To make things even more interesting, a recent paper shows how this concept can be generalized to do matrix sketching (i.e. approximating a matrix with a smaller one). I hope to get around to writing about that at some point as well.