At any point while processing the stream, the in-memory state is a partial histogram with at most k = 1/\epsilon elements (class/count pairs). When processing an element x_i, we use the following procedure with three cases:
- x_i is already in the histogram: increment its count;
- we have fewer than k elements in the histogram: add x_i with count 1;
- the histogram already has k other elements: while the histogram has k elements, decrease the count of each element by 1 and evict any elements with zero count, and then add x_i with count 1.
That's the entire algorithm, and it's easy to prove that it achieves the desired approximation as well. The only case in which we "lose" information is case 3, where the count of each of k elements is decreased by 1. But we can clearly decrease counts at most n/k = \epsilon n times, since that would necessarily reduce all counts to zero. Thus, using the partial histogram and outputting the frequency of any class not in the histogram as zero, we are guaranteed that each count is within \epsilon n of the actual count. In particular, this means that the algorithm will be guaranteed to output a positive count for any class that occurs with frequency \epsilon n or more in the stream.
I've always found this problem to be particularly beautiful because of the simplicity of the result, the algorithm, as well as the proof. Moreover, it's a very practical problem with many applications. To make things even more interesting, a recent paper shows how this concept can be generalized to do matrix sketching (i.e. approximating a matrix with a smaller one). I hope to get around to writing about that at some point as well.
I've always found this problem to be particularly beautiful because of the simplicity of the result, the algorithm, as well as the proof. Moreover, it's a very practical problem with many applications. To make things even more interesting, a recent paper shows how this concept can be generalized to do matrix sketching (i.e. approximating a matrix with a smaller one). I hope to get around to writing about that at some point as well.