Analytics

Saturday, February 15, 2014

Streaming Frequency Approximation

It's been a while, but I finally motivated myself to write another blog post! Today's topic is approximating the frequency of items in a stream. With all the data that is generated today, streaming algorithms have become very popular as one of the efficient ways to process datasets that are far too large to fit in memory. One common data mining problem is determining the frequency of items in data, i.e. building a histogram. Consider a stream $X = x_1, x_2, \ldots, x_n$ of $n$ items, each of which belongs to one of $m$ classes $c_1, c_2, \ldots, c_m$. If $m$ is small, the problem is simple, as we simple maintain the entire histogram in $O(m)$ memory and output the exact result. The interesting case is when there are an extremely large number of distinct classes, the number of which is usually unknown. In this case, we will present an algorithm originally from this paper that approximates each value in the histogram within $\epsilon n$ using $O(1/\epsilon)$ memory.

At any point while processing the stream, the in-memory state is a partial histogram with at most $k = 1/\epsilon$ elements (class/count pairs). When processing an element $x_i$, we use the following procedure with three cases:
  1. $x_i$ is already in the histogram: increment its count;
  2. we have fewer than $k$ elements in the histogram: add $x_i$ with count $1$;
  3. the histogram already has $k$ other elements: while the histogram has $k$ elements, decrease the count of each element by $1$ and evict any elements with zero count, and then add $x_i$ with count $1$.
That's the entire algorithm, and it's easy to prove that it achieves the desired approximation as well. The only case in which we "lose" information is case 3, where the count of each of $k$ elements is decreased by $1$. But we can clearly decrease counts at most $n/k = \epsilon n$ times, since that would necessarily reduce all counts to zero. Thus, using the partial histogram and outputting the frequency of any class not in the histogram as zero, we are guaranteed that each count is within $\epsilon n$ of the actual count. In particular, this means that the algorithm will be guaranteed to output a positive count for any class that occurs with frequency $\epsilon n$ or more in the stream.

I've always found this problem to be particularly beautiful because of the simplicity of the result, the algorithm, as well as the proof. Moreover, it's a very practical problem with many applications. To make things even more interesting, a recent paper shows how this concept can be generalized to do matrix sketching (i.e. approximating a matrix with a smaller one). I hope to get around to writing about that at some point as well.