The key to Dremel's performance is how the data is represented and consequently laid out on disk. First of all, they generalize having a bunch of columns per record to a hierarchical structure of nested columns that supports repeated and optional fields. That may sound familiar because it is exactly the format of protocol buffers, which is the language-independent binary representation of data that Google generally uses, so it is a natural choice for the type of data Dremel should support. Column-oriented storage is simple in the normal case of one (potentially optional) value per column per record since you can easily figure out which value corresponds to which record, but it gets trickier when you allow for the full protocol buffer specification. Consider the following example schema (a simplified version of what is presented in the paper):
This represents a web document, which has an ID and a set of names associated with it; each name is a URL that points to that document and the languages associated with the URL. The generalization of column-oriented storage to handle nested columns is that each leaf field in the hierarchy is stored separately. For example, all values of "docId" are stored together, and all values of "name.language.country" are stored together since both are leaf fields in the schema for a document. Let's look at two example records which we will assume are stored in the order presented:
The "docId" field is simple and only needs to store the values "10" and "20" since it is a top-level, required field. The "name.language.country" field, on the other hand, will need to store the values "us", NULL, NULL, "gb", NULL. We will see why these NULL values are necessarily shortly, but they are basically placeholders for potential values of "name.language.country."
It should be clear that because of the repeated and optional values we cannot only store the values above without additional metadata, since doing so would lead to the loss of record boundaries. To solve this, Dremel introduces two metadata fields known as the "repetition level" and the "definition level." These are logically attached to every column value in order to preserve record boundary information. The repetition level handles repeated fields and indicates the depth in the hierarchy that was repeated between the current column value and the previous one (0 is the root, meaning a new document). The definition level handles optional fields and indicates the depth in the hierarchy that actually exists for the current value; this is only relevant for NULL values. For the "name.language.country" field, Dremel would store the following:
# | value | rep | def |
---|---|---|---|
1 | "us" | 0 | 3 |
2 | NULL | 2 | 2 |
3 | NULL | 1 | 1 |
4 | "gb" | 1 | 3 |
5 | NULL | 0 | 1 |
In row 1 of the table, rep = 0 because it is a new record, and def = 3 because all 3 levels of the hierarchy ("name", "language", and "country") are defined, which is why we have a non-NULL value. In row 2, rep = 2 because we are repeating "name.language" when going from the previous row to this one, and def = 2 because only "name" and "language" are defined so the value is NULL. In row 3, rep = 1 because we repeated at the "name" level, and def = 1 because only "name" is defined. In row 4, rep = 1 because we again repeated at the "name" level, and def = 3 because we have a real value. Lastly, row 5 has rep = 0 because it is a new record, and def = 1 because only "name" is defined. The meanings of the repetition and definition levels can be confusing at first, but if you understand how they are derived in the above table, you should be able to convince yourself that it is a lossless representation. Dremel then optimizes to the bit-level how much data they need to actually store, such as omitting representation and definition levels whenever possible and using as few bits as necessary to store the values.
It should be no surprise that Google's solution for interactive queries on huge datasets involves minimizing the amount of data that needs to be retrieved from disk, a concept that database engineers have employed for decades. Since the volume of data being collected today far outpaces amount that can be read off of disk in a few seconds, there seem to be few options in the analytics space for efficient queries other than designing to reduce what needs to be read. But one can certainly imagine more complex queries that involve a large number of fields or even joining two fields that would be extremely valuable yet essentially impossible given the limitations of today's technology. This area seems ripe for additional work in pushing the boundaries of what types of queries are possible on the huge datasets that companies are collecting, but Dremel is definitely a solid approach that Google has derived a lot of value from.
No comments:
Post a Comment