Sunday, March 2, 2014

Matrix Sketching

Last time, I wrote about a clever algorithm for approximating the histogram for a stream using bounded memory. The post was motivated by this paper, which is an extension of that algorithm to a problem that seems unrelated at first glance, which is matrix sketching. The matrix sketching problem is as follows: given the rows of a large matrix $A \in \mathbb{R}^{n \times m}$ as a stream, produce a matrix $B \in \mathbb{R}^{l \times m}$ where $l << n$ which is a "good" approximation for $A$ when multiplied with vectors. Specifically, the algorithm in the paper achieves the following result: if $l = 1/\epsilon$, then it produces a matrix $B$ such that, for any unit vector $x$,

$||Ax||^2 \ge ||Bx||^2 \ge ||Ax||^2 - \epsilon ||A||_f^2$

where $||A||_f$ is the Frobenius norm of $A$. Here you can see the parallels with the frequency approximation algorithm; the error is a function of how "big" the stream is, which in this case is the Frobenius norm of the input matrix.

The algorithm works as follows: start with $B$ as a $l \times m$ matrix of all zeroes, and for each input row $A_i$ do the following update:
1. Set $B_l = A_i$ (the last row of $B$).
2. Compute the singular value decomposition (SVD) of $B$, so we obtain $B = U \Sigma V$ with the standard assumption that the diagonal values of $\Sigma$ are $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_l$.
3. "Remove" the smallest singular value from $\Sigma$ by letting

$\bar{\Sigma} = \sqrt{\max(\Sigma - I_l \sigma_l^2, 0)}$

where $I_l$ is the $l \times l$ identity matrix.
4. Then set $B = \bar{\Sigma}V$ (note that the last row of $B$ is all zeroes after this step because the last row of $\bar{\Sigma}$ is all zeroes by construction).
At the end, just output the current value of $B$. I won't go into any of the proof details (they can be found in the paper), but it's interesting to try to understand what the algorithm is doing intuitively. The SVD can be thought of (very loosely) as breaking down a matrix into three transformations applied to a multidimensional space: a rotation ($V$), followed by a scaling along the axes ($\Sigma$), and lastly another rotation ($U$). So the singular values are the scaling factors of the matrix in orthogonal directions, and we are removing the smallest one from each of these directions equally. As a result, we only lose a fraction of the accuracy in any particular direction (i.e. $Bx$ versus $Ax$ for a specific $x$) compared to the Frobenius norm of $A$ as a whole.

This would be pretty cool even if it was a purely theoretical result since it ties two problems together in a very elegant way, but it gets better. The paper goes on to explore the algorithm's performance with some experiments and observes that the accuracy is quite an improvement over existing techniques for matrix sketching. Moreover, computing the SVD is somewhat expensive, so the author describes how the algorithm can be parallelized as well as a way to reduce the computational cost of the SVD step and only slightly relaxing the guarantees. It's a very nice paper that spans both the theoretical and practical domains for the matrix sketching problem.