Sunday, April 21, 2013

Spanner TrueTime

Spanner is one of Google's recent developments in the area of large-scale distributed systems, and its design from the very beginning was intended for deployment at the global scale. Not only that, but because many consumers of systems similar to Spanner (e.g. BigTable and Megastore) wanted better consistency guarantees, the authors also built externally-consistent distributed transactions. This level of consistency is quite rare among the popular NoSQL databases out there, with many of them opting for eventual consistency, so it is quite refreshing. Moreover, Spanner uses schemas which consist of "semi-relational tables" (they require primary keys) and provides a "SQL-based query language," which is interesting given the direction that other distributed databases are generally heading. The technical paper for Spanner can be found here.

What appears to be the most revolutionary concept behind Spanner is how it deals with time. Managing time in distributed systems is a well-known difficulty and has been for many decades (see Leslie Lamport's paper), so it might not be surprising that the combination of strong consistency and global scale ends up focusing a great deal on time. Google's approach to solving this problem has two parts; the first is formalizing time uncertainty and working with it, rather than around it. They introduce TrueTime, which is an API that the machines in their system use to quantify their own uncertainty regarding the time. It has just three methods, which are, TT.after(), and TT.before(). The first returns an interval that specifies the earliest and latest times that it could be right now, while the latter two allow you to query whether it is definitely after or before a point in time.

It turns out that by using TrueTime and a lot of Paxos, Spanner is able to guarantee a level of consistency that is stronger than that of many other distributed systems while still being very efficient. For example, you can implement snapshot isolation for read-only transactions and point-in-time reads by ensuring that all replicas can "safely" read at a certain timestamp (i.e. a later timestamp has been committed). And during transactional writes, the commit is delayed so that it occurs at a time known to be after the recorded timestamp, as specified by TrueTime, which prevents out-of-order writes. The details of how the authors manage all of this can be the found in the paper, but suffice it to say that TrueTime on top of Paxos is what allows all of these consistency guarantees to be made.

Lastly, it should be pointed out that with the Spanner design, if the clock skew of the individual machines is high, performance will suffer greatly. Commits on read-write transactions will be forced to block for long periods and thus limit latency and throughput. And so the second part of Spanner time management is in the hardware: GPS and atomic clocks. By using clocks that give you much better performance in terms of skew along with a custom clock synchronization protocol (instead of NTP), Spanner is able to maintain very low levels of uncertainty about the time. Their novel approach to time accounts for uncertainty so that they can reason formally about consistency but heavily leverages clock accuracy for performance. The fact that Google is able to deploy this type of system globally (it is powering their new advertising backend) speaks a lot to the affordability of accurate clocks today.

All in all, Spanner is a unique distributed system that brings together a set of ideas and even some hardware to solve a very hard problem. It will be interesting to see if the allure of a high-performance system that provides strong consistency starts drawing the attention of the NoSQL community and whether the TrueTime API becomes more common as a way to manage time uncertainty.

No comments:

Post a Comment