Sunday, June 30, 2013

Dremel Data Model

It is common knowledge that analyzing large datasets efficiently can benefit greatly from column-oriented storage. That is, instead of storing all the data for a single row together like a traditional database would, separate the columns out and store all of the data for each column together (see the Wikipedia link for an example). The benefits of this are twofold: (1) queries that only access a subset of the columns can reduce the amount of data that has to be retrieved, and (2) compression algorithms often perform better on homogeneous data, e.g. a column with many values of the same type. As such, data stores such as Cassandra and HBase have become widely used when dealing with records that have a large number of columns. Given that both of these projects were based on the original ideas of Google's BigTable, it should be no surprise that Google has continued leveraging the power of column-oriented storage to deal with the scale of data that it processes. Dremel is a project that has been used at Google for a number of years now; it takes the idea of column-oriented storage, generalizes it, and then optimizes it in order to perform queries over tens to hundreds of terabytes of data in seconds.

The key to Dremel's performance is how the data is represented and consequently laid out on disk. First of all, they generalize having a bunch of columns per record to a hierarchical structure of nested columns that supports repeated and optional fields. That may sound familiar because it is exactly the format of protocol buffers, which is the language-independent binary representation of data that Google generally uses, so it is a natural choice for the type of data Dremel should support. Column-oriented storage is simple in the normal case of one (potentially optional) value per column per record since you can easily figure out which value corresponds to which record, but it gets trickier when you allow for the full protocol buffer specification. Consider the following example schema (a simplified version of what is presented in the paper):

This represents a web document, which has an ID and a set of names associated with it; each name is a URL that points to that document and the languages associated with the URL. The generalization of column-oriented storage to handle nested columns is that each leaf field in the hierarchy is stored separately. For example, all values of "docId" are stored together, and all values of "" are stored together since both are leaf fields in the schema for a document. Let's look at two example records which we will assume are stored in the order presented:

The "docId" field is simple and only needs to store the values "10" and "20" since it is a top-level, required field. The "" field, on the other hand, will need to store the values "us", NULL, NULL, "gb", NULL. We will see why these NULL values are necessarily shortly, but they are basically placeholders for potential values of ""

It should be clear that because of the repeated and optional values we cannot only store the values above without additional metadata, since doing so would lead to the loss of record boundaries. To solve this, Dremel introduces two metadata fields known as the "repetition level" and the "definition level." These are logically attached to every column value in order to preserve record boundary information. The repetition level handles repeated fields and indicates the depth in the hierarchy that was repeated between the current column value and the previous one (0 is the root, meaning a new document). The definition level handles optional fields and indicates the depth in the hierarchy that actually exists for the current value; this is only relevant for NULL values. For the "" field, Dremel would store the following:


In row 1 of the table, rep = 0 because it is a new record, and def = 3 because all 3 levels of the hierarchy ("name", "language", and "country") are defined, which is why we have a non-NULL value. In row 2, rep = 2 because we are repeating "name.language" when going from the previous row to this one, and def = 2 because only "name" and "language" are defined so the value is NULL. In row 3, rep = 1 because we repeated at the "name" level, and def = 1 because only "name" is defined. In row 4, rep = 1 because we again repeated at the "name" level, and def = 3 because we have a real value. Lastly, row 5 has rep = 0 because it is a new record, and def = 1 because only "name" is defined. The meanings of the repetition and definition levels can be confusing at first, but if you understand how they are derived in the above table, you should be able to convince yourself that it is a lossless representation. Dremel then optimizes to the bit-level how much data they need to actually store, such as omitting representation and definition levels whenever possible and using as few bits as necessary to store the values.

It should be no surprise that Google's solution for interactive queries on huge datasets involves minimizing the amount of data that needs to be retrieved from disk, a concept that database engineers have employed for decades. Since the volume of data being collected today far outpaces amount that can be read off of disk in a few seconds, there seem to be few options in the analytics space for efficient queries other than designing to reduce what needs to be read. But one can certainly imagine more complex queries that involve a large number of fields or even joining two fields that would be extremely valuable yet essentially impossible given the limitations of today's technology. This area seems ripe for additional work in pushing the boundaries of what types of queries are possible on the huge datasets that companies are collecting, but Dremel is definitely a solid approach that Google has derived a lot of value from.

No comments:

Post a Comment