One of the most interesting and unfamiliar features of Scala for Java programmers is the Option. The easiest way to think about it is as an explicit representation of an object that may be null. For example, let's say you have a method getString() which returns a String that may be null. In Java you would do:
while in Scala you would do:
where we are leveraging pattern matching. The Option() apply method takes an object and turns it into either a Some (which is generic on the type) or a None depending on whether the object is null. So by using Options you are enforcing an actual type distinction between null and non-null instances of the object. This might not seem like a big deal, but if done thoroughly, you could end up with code in which you are guaranteed that an object not wrapped by an option is non-null. That means you could get rid of NullPointerExceptions! Anyone who has programmed long enough in Java knows how much of a win that would be. Unfortunately, because Scala projects typically have to integrate with a lot of existing Java code, we can't quite reach that ideal state. But if you're diligent in wrapping the return values of third-party libraries in options when the result may be null, you can enforce this checking in your own code.
Besides compile-time distinction between null and non-null instances, options are designed in such a way that they play very nicely with the functional side of Scala. In fact, options are treated as a collection where there is either zero or one element. That means you can do all sorts of interesting things such as:
In each of these cases you would have to do an annoying null check before doing what you actually wanted to do. Scala's functional constructs for operating on collections provide concise ways of handling options without you ever having to mention null or None.
Analytics
Wednesday, January 30, 2013
Sunday, January 27, 2013
Watching StarCraft
As a regular spectator of professional StarCraft 2 matches, I was quite intrigued when I came across this paper (published at a CS conference!) discussing why people enjoy watching the game. The paper presents a couple of interesting results from collecting personal accounts and then formulates a theory around "information asymmetry" to explain the enjoyment derived from watching StarCraft. It presents nine categories of spectators that were derived from the personal accounts (in any context of watching the game, e.g. professional matches or watching a friend play), their descriptions paraphrased as follows.
- The Bystander: uninvested in spectating, just happens to be watching.
- The Curious: wants to learn more about the game.
- The Inspired: watches as motivation to play (like the pros).
- The Pupil: wants to learn more about the game primarily to improve oneself.
- The Unsatisfied: watches because he can't play.
- The Entertained: derives entertainment value from watching.
- The Assistant: helps the player.
- The Commentator: improves the experience for other spectators.
- The Crowd: because everyone else is doing it.
A spectator can naturally fall into multiple categories depending on how he is invested into the game, but those represent the vast majority of reasons why people watch StarCraft. This categorization doesn't really provide anything new, though, as spectators of sports can for the most part be categorized in the same way. It seems more like an explanation of how StarCraft evokes the same spectrum of spectators.
What is more interesting from the paper is the discussion of "information asymmetry" as one of the fundamental factors contributing to the popularity of spectating StarCraft 2. In many other games, the players have all of the information that the spectators have and more (e.g. what they plan to do next). Fighting games, for example, exhibit this property, as the players and spectators watch the same screen; thus the primary value comes from watching the players perform difficult tasks. StarCraft has that aspect, too, but what makes it different is that the spectators have information that the players (individually) do not have. Spectators have knowledge of both players' actions and current positions within the game, while the players only know their own and have some approximation of their opponent's. This means that spectators can build up anticipation around when players learn information about their opponent and how they react to it. Because the players have information that the spectators don't and the spectators have information that the players don't, this is an information asymmetry that contributes to the enjoyment of watching StarCraft.
This theory makes a lot of sense to me, and I think it explains why watching StarCraft is an enjoyable activity even if you don't play or in the absence of social factors (e.g. friends watching, rooting for a team), whereas the same might not be true of some other games or sports.
This theory makes a lot of sense to me, and I think it explains why watching StarCraft is an enjoyable activity even if you don't play or in the absence of social factors (e.g. friends watching, rooting for a team), whereas the same might not be true of some other games or sports.
Thursday, January 24, 2013
Sampling a Normal
The normal distribution is one of the most commonly used probability distributions because it turns out that many probabilities in the real world are approximately normal as a result of the central limit theorem. The standard normal distribution has a probability density function of
$$f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^2}$$
For something so common, its density function is not particularly nice to deal with; its cumulative distribution cannot be expressed in terms of elementary functions. This makes it a bit tricky to sample from as well, since the easiest way to sample a random variable is to sample a value from the uniform distribution on $[0, 1]$ and then invert the cumulative distribution function. Fortunately, people have come up with very clever ways of sampling a normal, and I'll provide a derivation of one, called the Box-Muller transform, here.
Consider the joint distribution of two independent normal variables:
$$f(x, y) = f(x)f(y) = \frac{1}{2\pi} e^{-\frac{1}{2}(x^2+y^2)}$$
This is a distribution across all points in the two-dimensional plane given by their $(x, y)$ coordinates. If we instead consider the polar representation of points, we can transform the given joint distribution via the change of variables formula. Using $x = r\cos{\theta}$ and $y = r\sin{\theta}$ gives (most calculations omitted for brevity)
$$f(r, \theta) = f(x, y) \left| \begin{array}{cc} \cos{\theta} & -r\sin{\theta} \\ \sin{\theta} & r\cos{\theta} \end{array} \right| = f(x, y) \cdot r = \frac{1}{2\pi} r e^{-\frac{1}{2} r^2}$$
Integrating out $\theta$ over the range $[0, 2\pi)$ leaves us with something that we can obtain the cumulative distribution for:
$$f(r) = r e^{-\frac{1}{2} r^2} \Rightarrow F(r) = \int_0^r f(t) dt = \int_0^r t e^{-\frac{1}{2} t^2} dt = 1 - e^{-\frac{1}{2} r^2}$$
Now we can apply the inversion technique to sample $r$; let $U_1$ be a random variable drawn from a uniform distribution on $[0, 1]$. We can compute $r = F^{-1}(1-U_1) = \sqrt{-2\ln{U_1}}$ (we decide to swap $U_1$ with $1-U_1$ to make the result nicer since the two have the same distribution). Notice that we can sample $\theta$ as well because it's actually uniformly distributed on $[0, 2\pi)$, so if $U_2$ is another uniform random variable on $[0, 1]$ then we can take $\theta = 2 \pi U_2$. Putting it all together, we get two (independent) sampled normals
$$x = r\cos{\theta} = \sqrt{-2\ln{U_1}} \cos{(2 \pi U_2)} \\ y = r\sin{\theta} = \sqrt{-2\ln{U_1}} \sin{(2 \pi U_2)}$$
To wrap things up and make it a bit more practical, here's a Java snippet of this in action.
$$f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^2}$$
For something so common, its density function is not particularly nice to deal with; its cumulative distribution cannot be expressed in terms of elementary functions. This makes it a bit tricky to sample from as well, since the easiest way to sample a random variable is to sample a value from the uniform distribution on $[0, 1]$ and then invert the cumulative distribution function. Fortunately, people have come up with very clever ways of sampling a normal, and I'll provide a derivation of one, called the Box-Muller transform, here.
Consider the joint distribution of two independent normal variables:
$$f(x, y) = f(x)f(y) = \frac{1}{2\pi} e^{-\frac{1}{2}(x^2+y^2)}$$
This is a distribution across all points in the two-dimensional plane given by their $(x, y)$ coordinates. If we instead consider the polar representation of points, we can transform the given joint distribution via the change of variables formula. Using $x = r\cos{\theta}$ and $y = r\sin{\theta}$ gives (most calculations omitted for brevity)
$$f(r, \theta) = f(x, y) \left| \begin{array}{cc} \cos{\theta} & -r\sin{\theta} \\ \sin{\theta} & r\cos{\theta} \end{array} \right| = f(x, y) \cdot r = \frac{1}{2\pi} r e^{-\frac{1}{2} r^2}$$
Integrating out $\theta$ over the range $[0, 2\pi)$ leaves us with something that we can obtain the cumulative distribution for:
$$f(r) = r e^{-\frac{1}{2} r^2} \Rightarrow F(r) = \int_0^r f(t) dt = \int_0^r t e^{-\frac{1}{2} t^2} dt = 1 - e^{-\frac{1}{2} r^2}$$
Now we can apply the inversion technique to sample $r$; let $U_1$ be a random variable drawn from a uniform distribution on $[0, 1]$. We can compute $r = F^{-1}(1-U_1) = \sqrt{-2\ln{U_1}}$ (we decide to swap $U_1$ with $1-U_1$ to make the result nicer since the two have the same distribution). Notice that we can sample $\theta$ as well because it's actually uniformly distributed on $[0, 2\pi)$, so if $U_2$ is another uniform random variable on $[0, 1]$ then we can take $\theta = 2 \pi U_2$. Putting it all together, we get two (independent) sampled normals
$$x = r\cos{\theta} = \sqrt{-2\ln{U_1}} \cos{(2 \pi U_2)} \\ y = r\sin{\theta} = \sqrt{-2\ln{U_1}} \sin{(2 \pi U_2)}$$
To wrap things up and make it a bit more practical, here's a Java snippet of this in action.
Monday, January 21, 2013
Steiner Trees
I don't usually pay much attention to theory and algorithms results, but I recently came across a paper that proves a new approximation bound for the Steiner tree problem, which is a simple enough problem in graph theory to interest me. The Steiner tree problem is as follows: given a undirected, weighted graph $G(V, E)$ and a subset $R \subseteq V$ of its vertices, find the minimum-weight connected subgraph of $G$ that contains all the vertices in $R$. When $R = V$, it reduces to the classical minimum spanning tree problem, which has a variety of efficient solutions. The Steiner tree problem, however, is known to be NP-complete, so most of the work on it has gone into finding the best approximation algorithms. Although the best results are complex and require the appropriate theory background to understand, there are a couple of nice facts about the Steiner tree problem that can be summarized easily.
The first is that the minimum-weight Steiner tree for $R$ in $G$ has the same cost as the minimum-weight Steiner tree for $R$ in $G^c$, where $G^c$ is the metric closure of $G$. $G^c$ is obtained by creating a complete graph on $V$ where the weight of the edge between $u$ and $v$ is the weight of the shortest path between $u$ and $v$ in $G$. To show this, let $OPT$ and $OPT^c$ be the costs of the minimum-weight Steiner trees for $R$ in $G$ and $G^c$, respectively. We know that $OPT^c \leq OPT$ because $G^c$ has all of the edges of $G$ with less or equal weights, so any Steiner tree in $G$ can be directly translated to a Steiner tree in $G^c$ that is no more expensive. But we can also show that $OPT \leq OPT^c$ by observing that for any tree in $G^c$, we can replace the edges by the shortest paths in $G$. The resulting subgraph has equal cost and contains all the vertices in $R$, so the minimum-weight Steiner tree has cost at most that. Since $OPT = OPT^c$, we can assume without loss of generality that we are working with a complete graph whose distances are a metric (in particular, they satisfy the triangle inequality).
Secondly, we'll provide a simple 2-approximation algorithm for the Steiner tree problem. Consider the minimum spanning tree of $G_R$ (the subgraph induced by $R$), which exists because of the assumption that we have a complete graph. If we let $MST$ be its weight, we claim that $MST < 2 \cdot OPT$. To show this, start with a depth-first traversal of the minimum-weight Steiner tree $T$. This traversal visits all of the vertices in $R$ and uses each edge of $T$ exactly twice (once on the way "down" and once coming back "up"), so it has weight $2 \cdot OPT$. For two consecutive vertices $u, v \in R$ in the traversal, replace the path in the traversal by a direct edge from $u$ to $v$, which from the triangle inequality is no more expensive than the existing path. After doing this for all such pairs, the result is a cycle visiting all vertices in $R$, but this is strictly more expensive than the minimum spanning tree of $G_R$, so we have $MST < 2 \cdot OPT$. Thus outputting the minimum spanning tree of $G_R$ is a 2-approximation for the Steiner tree problem.
The above two results are classical and taught in algorithms classes; once you get beyond the 2-approximation, the required machinery becomes much more advanced. Although I can't claim to fully understand the techniques described in the paper, their approach is based on a formulation of the Steiner tree problem as a linear program (LP), which is essentially a set of linear constraints with a linear cost function to optimize. In combinatorial optimization problems, however, the variables usually have to take on discrete/integer values (e.g. choose this edge or not), whereas linear programs allow them to take on any real values. The technique of LP relaxation takes the original problem and "relaxes" the discrete/integer constraint to formulate a proper linear program; the cost of doing so is that the solution may no longer be optimal, which is why this leads to approximation algorithms rather than exact ones. But linear programs have polynomial-time algorithms, so the approximations can be computed efficiently.
The core of the algorithm proposed in the paper seems to be an iterative use of LP relaxation. In each iteration, they solve the current instance of their LP (called the "directed-component cut relaxation"). Using the solution, they sample a "component" randomly and reduce it to a single node, so the graph gets smaller with each step. They repeat this process some number of times and then connect the components by computing the minimum spanning tree on the remaining vertices in $R$. This algorithm results in a remarkable $\ln{4} + \epsilon < 1.39$ approximation for the Steiner tree problem, improving on the previous best result of $1.55$.
The first is that the minimum-weight Steiner tree for $R$ in $G$ has the same cost as the minimum-weight Steiner tree for $R$ in $G^c$, where $G^c$ is the metric closure of $G$. $G^c$ is obtained by creating a complete graph on $V$ where the weight of the edge between $u$ and $v$ is the weight of the shortest path between $u$ and $v$ in $G$. To show this, let $OPT$ and $OPT^c$ be the costs of the minimum-weight Steiner trees for $R$ in $G$ and $G^c$, respectively. We know that $OPT^c \leq OPT$ because $G^c$ has all of the edges of $G$ with less or equal weights, so any Steiner tree in $G$ can be directly translated to a Steiner tree in $G^c$ that is no more expensive. But we can also show that $OPT \leq OPT^c$ by observing that for any tree in $G^c$, we can replace the edges by the shortest paths in $G$. The resulting subgraph has equal cost and contains all the vertices in $R$, so the minimum-weight Steiner tree has cost at most that. Since $OPT = OPT^c$, we can assume without loss of generality that we are working with a complete graph whose distances are a metric (in particular, they satisfy the triangle inequality).
Secondly, we'll provide a simple 2-approximation algorithm for the Steiner tree problem. Consider the minimum spanning tree of $G_R$ (the subgraph induced by $R$), which exists because of the assumption that we have a complete graph. If we let $MST$ be its weight, we claim that $MST < 2 \cdot OPT$. To show this, start with a depth-first traversal of the minimum-weight Steiner tree $T$. This traversal visits all of the vertices in $R$ and uses each edge of $T$ exactly twice (once on the way "down" and once coming back "up"), so it has weight $2 \cdot OPT$. For two consecutive vertices $u, v \in R$ in the traversal, replace the path in the traversal by a direct edge from $u$ to $v$, which from the triangle inequality is no more expensive than the existing path. After doing this for all such pairs, the result is a cycle visiting all vertices in $R$, but this is strictly more expensive than the minimum spanning tree of $G_R$, so we have $MST < 2 \cdot OPT$. Thus outputting the minimum spanning tree of $G_R$ is a 2-approximation for the Steiner tree problem.
The above two results are classical and taught in algorithms classes; once you get beyond the 2-approximation, the required machinery becomes much more advanced. Although I can't claim to fully understand the techniques described in the paper, their approach is based on a formulation of the Steiner tree problem as a linear program (LP), which is essentially a set of linear constraints with a linear cost function to optimize. In combinatorial optimization problems, however, the variables usually have to take on discrete/integer values (e.g. choose this edge or not), whereas linear programs allow them to take on any real values. The technique of LP relaxation takes the original problem and "relaxes" the discrete/integer constraint to formulate a proper linear program; the cost of doing so is that the solution may no longer be optimal, which is why this leads to approximation algorithms rather than exact ones. But linear programs have polynomial-time algorithms, so the approximations can be computed efficiently.
The core of the algorithm proposed in the paper seems to be an iterative use of LP relaxation. In each iteration, they solve the current instance of their LP (called the "directed-component cut relaxation"). Using the solution, they sample a "component" randomly and reduce it to a single node, so the graph gets smaller with each step. They repeat this process some number of times and then connect the components by computing the minimum spanning tree on the remaining vertices in $R$. This algorithm results in a remarkable $\ln{4} + \epsilon < 1.39$ approximation for the Steiner tree problem, improving on the previous best result of $1.55$.
Sunday, January 20, 2013
Erasure Coding
Durability is a critical feature of distributed storage systems. Because these storage systems are typically built on a large number of machines using commodity hardware, individual component failures are expected and common; thus durability comes from how fault tolerance is designed into the system. The original Google File System (GFS) was among the earliest systems built at a scale where these problems started to manifest, and one of the key features was that all data in GFS was redundantly stored in three copies. Combined with the proper distribution of these three copies so they don't fail simultaneously, this design allowed Google to scale out their storage system without losing data. While GFS was quite a landmark piece of technology, we've come a long way since then, and people are constantly devising new ways to achieve durability while optimizing other important features of distributed storage systems like latency, storage overhead, and reconstruction cost.
Storing identical copies of data is a simple strategy for redundancy, but more involved techniques such as erasure coding can greatly improve the storage overhead while providing the same durability guarantees. The idea behind erasure coding is to encode a message of $K$ symbols using $N > K$ symbols such that you can reconstruct the original message using only a subset of the $N$ symbols. For example, you can think of storing three copies of data as erasure coding with $K = 1$ and $N = 3$, where as long as you have one of the three copies you can reconstruct the original data. Another common example is encoding $K$ bits using $N = K+1$ bits where the extra bit stores the parity/XOR of the first $K$ bits; losing any single bit will still allow you to recover the original message. These examples demonstrate how erasure codes are redundant and prevent losing the message even when individual symbols are lost (e.g. a machine storing data dies).
The most common erasure codes are Reed-Solomon codes, which are present in CDs, DVDs, and barcodes to ensure that these media are resistant to partial damage. Reed-Solomon codes extend the idea of the parity bit by leveraging polynomial interpolation. Given the message of $K$ symbols, consider the polynomial of degree (at most) $K-1$ with the symbols as the coefficients. Evaluate the polynomial at $N > K$ points known to both the sender and the receiver, and send the results of those evaluations. As long as $K$ of the resulting symbols are received, the polynomial can be interpolated and thus the original message reconstructed. The details of the algebra on polynomial rings and an efficient decoding algorithm are beyond the scope of this post, but suffice it to say that people have figured out how to use Reed-Solomon codes very effectively. It shouldn't come as a surprise then that the new GFS (discussed here) uses them and the Hadoop File System (HDFS) has a module which adds the functionality.
The story doesn't end there, though. Last summer, a paper came out of Microsoft Research and the Windows Azure Storage (WAS) team that describes a new technique called local reconstruction codes (LRC) used in WAS, Microsoft's equivalent of GFS and HDFS. One of the downsides of Reed-Solomon codes in storage systems is that, if one of your original $K$ symbols is lost, you must read a full $K$ other symbols in order to recover that one symbol. When these symbols are chunks distributed across machines, it can be costly to do that many accesses. LRC can be thought of as an extension of Reed-Solomon codes which adds an additional parameter that determines how many local groups there are within a message. It facilitates the reconstruction of lost symbols by leveraging additional redundancy introduced at the local group level. The work demonstrates that the technique is more efficient both in terms of storage overhead and reconstruction cost than Reed-Solomon codes when both methods are required to provide the same fault tolerance as storing three copies of the data (which is still the standard for systems today).
The theory behind LRC is really cool (and my two-sentence summary doesn't do it justice, so read the paper if you're at all interested), but the additional practical discussions in the paper are equally interesting. Some of the key points that I found particularly enlightening were the following:
Storing identical copies of data is a simple strategy for redundancy, but more involved techniques such as erasure coding can greatly improve the storage overhead while providing the same durability guarantees. The idea behind erasure coding is to encode a message of $K$ symbols using $N > K$ symbols such that you can reconstruct the original message using only a subset of the $N$ symbols. For example, you can think of storing three copies of data as erasure coding with $K = 1$ and $N = 3$, where as long as you have one of the three copies you can reconstruct the original data. Another common example is encoding $K$ bits using $N = K+1$ bits where the extra bit stores the parity/XOR of the first $K$ bits; losing any single bit will still allow you to recover the original message. These examples demonstrate how erasure codes are redundant and prevent losing the message even when individual symbols are lost (e.g. a machine storing data dies).
The most common erasure codes are Reed-Solomon codes, which are present in CDs, DVDs, and barcodes to ensure that these media are resistant to partial damage. Reed-Solomon codes extend the idea of the parity bit by leveraging polynomial interpolation. Given the message of $K$ symbols, consider the polynomial of degree (at most) $K-1$ with the symbols as the coefficients. Evaluate the polynomial at $N > K$ points known to both the sender and the receiver, and send the results of those evaluations. As long as $K$ of the resulting symbols are received, the polynomial can be interpolated and thus the original message reconstructed. The details of the algebra on polynomial rings and an efficient decoding algorithm are beyond the scope of this post, but suffice it to say that people have figured out how to use Reed-Solomon codes very effectively. It shouldn't come as a surprise then that the new GFS (discussed here) uses them and the Hadoop File System (HDFS) has a module which adds the functionality.
The story doesn't end there, though. Last summer, a paper came out of Microsoft Research and the Windows Azure Storage (WAS) team that describes a new technique called local reconstruction codes (LRC) used in WAS, Microsoft's equivalent of GFS and HDFS. One of the downsides of Reed-Solomon codes in storage systems is that, if one of your original $K$ symbols is lost, you must read a full $K$ other symbols in order to recover that one symbol. When these symbols are chunks distributed across machines, it can be costly to do that many accesses. LRC can be thought of as an extension of Reed-Solomon codes which adds an additional parameter that determines how many local groups there are within a message. It facilitates the reconstruction of lost symbols by leveraging additional redundancy introduced at the local group level. The work demonstrates that the technique is more efficient both in terms of storage overhead and reconstruction cost than Reed-Solomon codes when both methods are required to provide the same fault tolerance as storing three copies of the data (which is still the standard for systems today).
The theory behind LRC is really cool (and my two-sentence summary doesn't do it justice, so read the paper if you're at all interested), but the additional practical discussions in the paper are equally interesting. Some of the key points that I found particularly enlightening were the following:
- When retrieving a fragment (unit of a file which is erasure coded) from WAS, a request to reconstruct the fragment is issued simultaneously, so that if the fragment lives on a machine that is under heavy load it's possible that the reconstruction is more efficient and returns first. I imagine that this has a nice effect on the long tail of retrieval times.
- As files are being written, they are initially stored as three redundant copies, but once a file is "sealed" and can no longer be written to, the system will asynchronously erasure code it and remove the redundant copies. This is so the erasure coding doesn't have any performance impact on the "critical path of client writes."
- Even the process of erasure coding must be fault tolerant. The WAS system stores progress information for files that are being erasure coded so that fragments which are complete don't need to be encoded again after a failure. Furthermore, they employ many cyclic redundancy checks during the process and perform extensive randomized reconstruction tests prior to committing to the erasure-coded version and deleting the original copies.
Erasure coding is a powerful technique for providing durability in distributed storage systems and seems to be becoming the standard. LRC is a neat innovation in this space, and it will be exciting to see how these technologies evolve as these types of systems become more and more prevalent.
Thursday, January 17, 2013
Concurrency with REST
Representational State Transfer (REST) is one of the main architectural models that web services use to communicate with each other. While it is not tied to HTTP, it is most often used on top of HTTP for various web APIs (e.g. Amazon S3, Stripe, Netflix, Zendesk, Greendizer). A very common pattern when writing REST APIs is the management of entities through the four operations of create, read, update, and delete (CRUD). For example, these entities could be buckets in Amazon S3, customers in Stripe, or tickets in Zendesk. The CRUD operations map very naturally to HTTP verbs (POST, GET, PUT, and DELETE, respectively), so you end up with an interface that is clean and easy to understand for developers integrating with your service.
One problem that inevitably arises because we are working in a distributed setting is how to deal with concurrency. What kind of semantics should be guaranteed for clients of the API who are concurrently operating on the same entities? Reads are innocuous here because they don't modify state; creates and deletes are not particularly worrisome either as they can only happen once each. This leaves us with just the update (i.e. conventional write) operation to think about. It's easy to see how two unsynchronized updates can clobber each other in ways that are undesirable, leading to the lost update problem. But remember that HTTP is stateless, so locking a resource is quite dangerous when the client who obtains the lock does not successfully issue another request to release it. If we can't lock, then how do we solve this problem? Enter optimistic concurrency control.
Optimistic concurrency control is based on the assumption that in most cases there is not contention for a resource (hence the "optimistic"). It relies on versioning the entities in the system such that every time an update is made to an entity its version changes, e.g. a counter that increments on each update. When a client gets an entity, he is also given the current version information for that entity; if he then wishes to update the entity, he must provide the same version information that was received. Before performing the update, the server checks to make sure that the version provided by the client matches the version on the server (i.e. nobody else has performed an update) and only proceeds with the update if that is the case. These simple steps ensure that no updates are lost and that the system is robust to client failure; the performance is also good provided the resource is not highly contended.
HTTP has a built-in feature for this version information called the ETag header which is used for caching and conditional requests. When a client gets an entity from the server, the ETag header in the response will be populated with the version information. In practice, the version information is typically the result of applying a cryptographic hash function to either the data itself or some unique version identifier. If the client subsequently decides to update that entity, he should provide the version information in an If-Match header in the update request (a conditional PUT). This allows the server to do proper validation of the client's update as per the optimistic locking protocol and return the appropriate status code if the version no longer matches. If the client gets a response that indicates a version mismatch, he can simply get the entity again and perform the update as desired, the important point being that the other update is acknowledged.
The use of the ETag header for concurrency control is not necessary in all REST APIs, but it can be an easy way to relieve clients of worrying about complex synchronization themselves.
One problem that inevitably arises because we are working in a distributed setting is how to deal with concurrency. What kind of semantics should be guaranteed for clients of the API who are concurrently operating on the same entities? Reads are innocuous here because they don't modify state; creates and deletes are not particularly worrisome either as they can only happen once each. This leaves us with just the update (i.e. conventional write) operation to think about. It's easy to see how two unsynchronized updates can clobber each other in ways that are undesirable, leading to the lost update problem. But remember that HTTP is stateless, so locking a resource is quite dangerous when the client who obtains the lock does not successfully issue another request to release it. If we can't lock, then how do we solve this problem? Enter optimistic concurrency control.
Optimistic concurrency control is based on the assumption that in most cases there is not contention for a resource (hence the "optimistic"). It relies on versioning the entities in the system such that every time an update is made to an entity its version changes, e.g. a counter that increments on each update. When a client gets an entity, he is also given the current version information for that entity; if he then wishes to update the entity, he must provide the same version information that was received. Before performing the update, the server checks to make sure that the version provided by the client matches the version on the server (i.e. nobody else has performed an update) and only proceeds with the update if that is the case. These simple steps ensure that no updates are lost and that the system is robust to client failure; the performance is also good provided the resource is not highly contended.
HTTP has a built-in feature for this version information called the ETag header which is used for caching and conditional requests. When a client gets an entity from the server, the ETag header in the response will be populated with the version information. In practice, the version information is typically the result of applying a cryptographic hash function to either the data itself or some unique version identifier. If the client subsequently decides to update that entity, he should provide the version information in an If-Match header in the update request (a conditional PUT). This allows the server to do proper validation of the client's update as per the optimistic locking protocol and return the appropriate status code if the version no longer matches. If the client gets a response that indicates a version mismatch, he can simply get the entity again and perform the update as desired, the important point being that the other update is acknowledged.
The use of the ETag header for concurrency control is not necessary in all REST APIs, but it can be an easy way to relieve clients of worrying about complex synchronization themselves.
Sunday, January 13, 2013
Fantasy Proleague
Fantasy Proleague is a fantasy league for StarCraft 2 that I've been participating in for the last month or so. I played fantasy baseball for a few seasons in the past, and the concept has always been an interesting optimization problem to me. Recently, the problem of finding the best possible fantasy team for the previous round was posed, and I decided to take a stab at it. I'll briefly describe the rules and then outline the approach I took.
Rules
A fantasy Proleague team consists of six players and one Proleague team. Each player and team you can acquire has an initial cost at the beginning of the round. You begin the season with 30 points to allocate to those six players and one team however you like, with one restriction: you must have one player of each race. Each week, for the four weeks in the round, players and teams accumulate points based on their performance and their costs adjust accordingly. At the end of each week, you are permitted to make two trades, which means taking a player from your team and swapping it for a different player from the pool. A trade is valid so long as the current cost of the player you are trading is at least the current cost of the player you are acquiring and the race requirement is still satisfied. One of the trades may be used on the team as well, with the same cost restriction. You lose one point for every trade you make. The goal is, naturally, to maximize the number of points scored by your team over the course of the round.
Model
Modeling the data is simple; each team has a cost and a point gain for each week (eight total values). The costs represent the cost of the team at the beginning of the week, and the point gains correspond to how many points the team gained in the week that followed. Players are represented in the same way except they additionally have a race.
I'll note that this problem is essentially boils down to search optimization; even without trading it is a knapsack problem, and trading only complicates things further. Luckily, there are only 89 players, eight teams, and four weeks, so a handful of optimizations can be applied to make the search space manageable. The remainder of this post breaks down into two sections: optimizing the initial players/team and optimizing the trades. For most of this discussion, I'm going to ignore the race requirement as it makes the problem much more difficult; fortunately, we still end up with a provably optimal solution (although not in the most elegant way).
Initial Team
Reducing the set of possible initial teams greatly limits the search space, so any pruning we can do before even considering trades is extremely helpful. The key here is identifying which players are strictly superior to which others. We'll define player A to be strictly superior to player B if the following conditions hold (defined similarly for teams):
- player A's initial cost is at most player B's initial cost;
- player A's cumulative points are at least player B's cumulative points at the end of each week;
- player A's cost is at least player B's cost at the end of each week; and
- there is some week for which player A's cumulative points or cost is strictly greater than player B's cumulative points or cost, respectively.
If all of those conditions hold, it's pretty easy to see that you should never choose player B over player A in your initial team. Whatever you do with player B in the future can be done with player A, and player A has some strict advantage over player B that you may be able to leverage (note that this is not true with the race requirement, because player B may have a race that is necessary for your team). So when choosing initial teams, we enforce that we never choose a player unless all players who are strictly superior to the player in question are on the team already, e.g. if there are six or more other players strictly superior to him, he will never be chosen on the initial team. This relatively simple optimization limits the initial space of a few billion teams to on the order of 100,000.
Trading
To reduce the space of trading, we need a very similar concept to strict superiority. Player A is superior from week w to player B if, from week w onward, the following conditions hold:
- player A's cumulative points from week w are at least player B's cumulative points from week w at the end of each week; and
- player A's cost is at least player B's cost at the end of each week.
Then, when deciding whether to trade player A for player B at the beginning of week w, we only consider it if player A's cost that week is at least player B's (required), player B is not already on the team (required), and player A is not superior from week w to player B after subtracting one from B's points (to account for the trade cost). So for each week we can build the list of potential trades for each player after applying this filter.
We can further reduce the search space by the following observation. Suppose we are considering trading player A at the beginning of week w. If players B and C both satisfy the above criteria but player B is superior from week w to player C, there is no reason to explore the space resulting from trading player A for player C since trading for player B is always a better choice. For example, this means that when making the final trade, you only need to consider the player that scores the highest in that last week among all possible players you can trade for, which is a pretty intuitive result.
Result
After applying the optimizations described, I was able to exhaustively search for all potential optimal teams in less than an hour. The best team (for those who actually played) can be found in this forum post. Note that the team there does indeed satisfy the race requirement, which is the last point to discuss. I implemented all of the optimizations ignoring the race requirement because it's much more difficult to determine when a player should always be chosen over another when the race factors in. It turns out, however, that the best team ignoring the race requirement altogether scores the same number of points as the best team I found which satisfies the race requirement. That proves that the team is indeed optimal, although not in the most satisfying way. I've continued to think about a good way of modeling the race requirement without eliminating most of the benefit from the optimizations but have come up empty so far. Regardless, I found interesting theoretical value in the process as well as a practical result; we'll see how it deals with the next round of data once it's in.
Friday, January 11, 2013
Covariant Return Types
The topic of this post is covariant return types in the context of the builder pattern and how you can use the builder pattern with inheritance in Java. To illustrate what the concept of covariant return types is, consider the following example (roughly taken from Wikipedia):
Here the D class overrides the foo method of C, but it doesn't return the same type (B versus A). Since B is a subclass of A, however, this satisfies the property of the return type being covariant, as we are still guaranteed that the return type of foo is "an A." So why is this useful? Suppose you have a Resource class. The builder pattern gives you some nice syntax like this to chain methods together when constructing an object:
It's especially useful when the list of fields is large and you often only want to assign a subset of them. In that case constructors don't cut it (you'd need exponentially many of them to cover all cases...) so the builder pattern provides a convenient alternative that avoids having to pass in a bunch of null values. The key to making this work is having the setters return the "this object" and thus enable chaining. Simple enough, but when you introduce inheritance it gets a bit tricky.
Now we have a subclass ExpiringResource which is just a Resource with an expiration time. Although this code looks natural, it doesn't compile because the withName method returns a Resource and not an ExpiringResource, which doesn't have the withExpirationTime method. Fortunately, we can leverage covariant return types to make this work:
We are overriding the withName method so that it returns an ExpiringResource instead of a normal Resource (a covariant return type). Then we can successfully chain the methods together and make our example compile. It's not shown here, but you'd typically do the same with the other two methods on Resource as well. For these simple setters, it's a bit unfortunate that we have to add so much code in the subclasses, but you can imagine if those methods had other side effects or validation that it would be worthwhile to share that code. To wrap up, Java's support for covariant return types allows us to use the builder pattern while maintaining a sensible inheritance hierarchy.
Here the D class overrides the foo method of C, but it doesn't return the same type (B versus A). Since B is a subclass of A, however, this satisfies the property of the return type being covariant, as we are still guaranteed that the return type of foo is "an A." So why is this useful? Suppose you have a Resource class. The builder pattern gives you some nice syntax like this to chain methods together when constructing an object:
It's especially useful when the list of fields is large and you often only want to assign a subset of them. In that case constructors don't cut it (you'd need exponentially many of them to cover all cases...) so the builder pattern provides a convenient alternative that avoids having to pass in a bunch of null values. The key to making this work is having the setters return the "this object" and thus enable chaining. Simple enough, but when you introduce inheritance it gets a bit tricky.
Now we have a subclass ExpiringResource which is just a Resource with an expiration time. Although this code looks natural, it doesn't compile because the withName method returns a Resource and not an ExpiringResource, which doesn't have the withExpirationTime method. Fortunately, we can leverage covariant return types to make this work:
We are overriding the withName method so that it returns an ExpiringResource instead of a normal Resource (a covariant return type). Then we can successfully chain the methods together and make our example compile. It's not shown here, but you'd typically do the same with the other two methods on Resource as well. For these simple setters, it's a bit unfortunate that we have to add so much code in the subclasses, but you can imagine if those methods had other side effects or validation that it would be worthwhile to share that code. To wrap up, Java's support for covariant return types allows us to use the builder pattern while maintaining a sensible inheritance hierarchy.
Sunday, January 6, 2013
Mockito (Part 2)
Last time, I briefly introduced Mockito, which is a Java/Scala mocking framework that uses a bytecode generation library cglib to create a very simple and natural way of describing mocking behavior. We saw that we could use cglib to create dynamic subclasses and intercept their methods in ways that seemed to be close to how something like Mockito might work. This time I'll provide a full implementation of what I am calling MiniMockito, which allows straightforward when/thenReturn mocking behavior seen here (same example as last time):
Let's take a look at a few method interceptors we'll use to get there (apologies for the large chunk of code):
Here, the MockIntercepted trait is what we'll mix in to the return of methods, e.g. the value of someClass.foo, the first time they are called (or until a stub return value is provided). Because this trait is mixed into a dynamically-generated class, the methods don't actually have implementations anywhere. That's where the StubInterceptor comes in; for the two methods on the MockIntercepted trait, we have predefined return values that give context about which MockInterceptor we're using and which method was stubbed out. These two pieces of information are necessary for thenReturn to set the stub return value.
Lastly, the most important piece is the MockInterceptor itself, which is the method interceptor for the actual mock object. When there is no stub return value set yet, the MockInterceptor will dynamically generate a subclass which mixes in MockIntercepted and uses the StubInterceptor. The subclass carries the context of the method call so we know which method to stub later. The one piece remaining is when the methodReturns map is updated with the stub return values; once it is populated, future calls to the method will always return the stored value! So here's the rest:
The Stubbing object is what actually takes the method and the stub return value and puts them in the methodReturns map. It is returned by the call to when and has context about the mock interceptor and stubbed method because of the MockIntercepted trait we mixed in earlier. The remaining pieces of the implementation are relatively straightforward and included for completeness.
So, to summarize, what does all this code do?
Let's take a look at a few method interceptors we'll use to get there (apologies for the large chunk of code):
Here, the MockIntercepted trait is what we'll mix in to the return of methods, e.g. the value of someClass.foo, the first time they are called (or until a stub return value is provided). Because this trait is mixed into a dynamically-generated class, the methods don't actually have implementations anywhere. That's where the StubInterceptor comes in; for the two methods on the MockIntercepted trait, we have predefined return values that give context about which MockInterceptor we're using and which method was stubbed out. These two pieces of information are necessary for thenReturn to set the stub return value.
Lastly, the most important piece is the MockInterceptor itself, which is the method interceptor for the actual mock object. When there is no stub return value set yet, the MockInterceptor will dynamically generate a subclass which mixes in MockIntercepted and uses the StubInterceptor. The subclass carries the context of the method call so we know which method to stub later. The one piece remaining is when the methodReturns map is updated with the stub return values; once it is populated, future calls to the method will always return the stored value! So here's the rest:
The Stubbing object is what actually takes the method and the stub return value and puts them in the methodReturns map. It is returned by the call to when and has context about the mock interceptor and stubbed method because of the MockIntercepted trait we mixed in earlier. The remaining pieces of the implementation are relatively straightforward and included for completeness.
So, to summarize, what does all this code do?
- The mock method dynamically generates a subclass of the mocked class (SomeClass) which uses the MockInterceptor.
- The MockInterceptor dynamically generates a subclass of the return class (StringInt) that mixes in the MockIntercepted trait and uses the StubInterceptor to implement those methods.
- A call to when uses the MockIntercepted trait that was mixed in to create a Stubbing that knows about the corresponding MockInterceptor and the method that was stubbed (foo).
- Calling thenReturn on the Stubbing populates the map maintained by the MockInterceptor to intercept future calls to the method and return the stub value.
As a final note, this implementation is naturally really hacked together in order to keep it short (sort of, anyway). There are numerous details left out and assumptions made that Mockito naturally does a much better job of handling (e.g. I didn't really figure out a robust way to deal with the lack of no-arg constructors in the mocked class), but it touches upon the basic building blocks offered by cglib that Mockito leverages to achieve its magical behavior. Hope this provided some insight into how this awesome library works!
Friday, January 4, 2013
Mockito (Part 1)
Mockito is a really neat Java/Scala mocking framework that I recently came across. I'm familiar with mocking frameworks that use dynamic proxies to mock interfaces, which are already pretty magical in their use of reflection to intercept method calls. They also have many uses beyond testing and are a good tool to understand. Mockito goes beyond that in its magic, in ways that I previously thought impossible in the Java language. Consider the following example:
This code outputs "bar 2" as you might expect from reading it (it's not quite the same semantics as Mockito, but very close). The interesting things to note here are the fact that SomeClass is just a normal class, not an interface, and the when/thenReturn paradigm (these are static methods in Mockito, along with mock). Looking at the code, it's clear that the call to foo is intercepted/overridden to return something more than just a StringInt object. Otherwise, calling when on it would not be able to accomplish what this code does.
But we are in a type-safe language, so mock needs to return an object which is a subclass of SomeClass that intercepts the foo method call to return an object which is a subclass of StringInt that contains enough information to mock out future calls to foo (e.g. which method was called and what the mocked return value should be). In other words, to not break type safety, we need to create subclasses of these classes that can add on the extra required functionality. Because these subclasses don't exist at compile-time, their bytecode needs to be generated at run-time, which turns out to be possible. The dynamic nature in which this happens is the reason that we can take (almost) any class and create a mock for it with very few assumptions about the original class.
cglib is a bytecode manipulation library whose purpose is to "extend Java classes and implement interfaces at run-time." The Enhancer in particular allows you to create an instance of a dynamically-generated class which can:
This code outputs "bar 2" as you might expect from reading it (it's not quite the same semantics as Mockito, but very close). The interesting things to note here are the fact that SomeClass is just a normal class, not an interface, and the when/thenReturn paradigm (these are static methods in Mockito, along with mock). Looking at the code, it's clear that the call to foo is intercepted/overridden to return something more than just a StringInt object. Otherwise, calling when on it would not be able to accomplish what this code does.
But we are in a type-safe language, so mock needs to return an object which is a subclass of SomeClass that intercepts the foo method call to return an object which is a subclass of StringInt that contains enough information to mock out future calls to foo (e.g. which method was called and what the mocked return value should be). In other words, to not break type safety, we need to create subclasses of these classes that can add on the extra required functionality. Because these subclasses don't exist at compile-time, their bytecode needs to be generated at run-time, which turns out to be possible. The dynamic nature in which this happens is the reason that we can take (almost) any class and create a mock for it with very few assumptions about the original class.
cglib is a bytecode manipulation library whose purpose is to "extend Java classes and implement interfaces at run-time." The Enhancer in particular allows you to create an instance of a dynamically-generated class which can:
- subclass any non-final class,
- have callbacks for method interception, and
- implement additional interfaces.
For example:
Using SomeClass from the previous example, we dynamically generate a subclass for it that uses a method interceptor to return a different object. This code also outputs "bar 2" and gives us a little bit of insight into how these dynamically-generated subclasses can be used to get the mocking behavior in Mockito. If we can create a method interceptor that somehow keeps track of the when/thenReturn calls, we should be able to substitute the return value in place of invoking the actual method.
Stay tuned for part 2 which will demonstrate how this is done and a very minimal implementation of Mockito functionality.
Edit: Part 2 is here.
Edit: Part 2 is here.
Tuesday, January 1, 2013
Message Queues and RPC
With the arrival of a new year, I'd like to bring up an old question in distributed systems research. What is the "best" way for applications running on separate machines to communicate with each other? Specifically, I'm interested in the case in which you have several different servers running individual pieces of your system architecture that need to communicate (rather than, for example, a client-server model). In my experience, I've come across two different basic methods for doing this, namely message queues and remote procedure calls (RPC). Examples of software/libraries include RabbitMQ, HornetQ, and ActiveMQ for message queues, Avro and Thrift for RPC.
The last important feature that I want to discuss is a corollary of asynchronous communication, which is elasticity. Suppose that a system is suddenly subjected to a hundred times the typical load, but only for a brief period. With an RPC-based architecture, there would be many threads all blocked on method calls to remote services as they slowly processed the increased volume of requests; as many know, this is an easy way to kill a system or at the very least make it non-responsive. A message queue helps absorb sudden load spikes because it allows senders to asynchronously send a message and then free its resources. The messages accumulate in the queue (they are designed to handle this) while the recipients process them, and every component in the system behaves normally. Again, a continuation system would let you return control to the original sender once its message has been processed.
What's the difference?
The key difference between message queues and RPC is the level abstraction they provide the programmer. With message queues, it is generally left to the developer to serialize and deserialize messages to/from the queue and decide how to process each type of message; you work with a basic send/receive paradigm and as a result have a lot of control over the communication. RPC, on the other hand, tries to hide this complexity by making a call to a remote machine look like a call to a local object (hence the name). It's often the case that RPC functionality is packaged with serialization libraries (e.g. Avro and Thrift) because it relies on some serialization format in order to automatically handle the translation of a method call on one machine to a method call on another.
If that's where the discussion ended, it would be a pretty clear win for RPC. Abstraction is a programmer's greatest tool, and it can be argued that someone developing an application shouldn't need to know whether a method being called is a local or remote object. The convenience associated with this approach has been attributed as one of the reasons why RPC is so popular (it's used extensively in Google's infrastructure, and Thrift came out of Facebook) despite some serious criticisms from almost 20 years ago. It's my intention to attempt to make sense out of arguments that people have made for and against RPC and discuss what I believe to be important features of message queues that RPC lacks.
Theory vs practice
Many of the discussions about RPC focus on what I'll call "pure RPC." What I mean by that is the concept of making local and remote method calls completely indistinguishable. It's clear that this is impossible given the four concerns of latency, memory access, partial failure, and concurrency brought up here. Instead, I'd like to focus on the practical RPC libraries that people work with and what they offer. In most such cases, the programmer does know that they're making a method call to an external service by nature of understanding the architecture and having to define the interfaces/data structures to be serialized. This has already broken the abstraction, which takes away some of the benefits of RPC, but there's still something useful about having your services represented by interfaces and calling methods on them.
What RPC can do
Scalability is one concern that has been brought up about RPC. Message queues are usually an isolated service which provides a decoupling between the sender and receiver, which means that it's really easy for multiple senders/receivers to operate using the same queue and scale out horizontally. With RPC, it's not quite a "built-in" feature, but those who claim that you can't scale it should have a word with Google and Facebook. Since method calls over RPC are typically directed at a single endpoint, it might seem like it won't scale horizontally, but putting a load balancer in front of your endpoints goes a long way. If you make any assumptions about state it gets a bit trickier, but that's a problem that both message queues and RPC have to deal with.
Robustness is a second problem people like to talk about with RPC; by robustness I mean what semantics we can guarantee on the execution of a method, e.g. exactly once, at most once, or at least once. Generally, you have two possibilities: if you try a method call once only, you get "at most once" semantics (which is typically very hard to reason about); if you retry method calls until you hear a success, you get "at least once" semantics (which is also hard to reason about, but less so). That might seem problematic, but again message queues suffer from the same issue even with acknowledgements and confirmation of delivery.
Message queues and asynchronous communication
This blog post provides a list of ten important features of message queues that make them a good choice. I will highlight three in particular that differentiate message queues from RPC. The first is durability; many message queues provide support for message persistence. When you send a message to the queue, it will write it to disk before acknowledging your message so that in the event of a failure the message will not be lost. This is difficult to do using RPC because there is no intermediary and it doesn't make a lot of sense to replay a method call when you no longer have the context (e.g. stack) in which it was made.
Another property of message queues that make them compelling is the fact that messaging is inherently asynchronous. After a message is sent to the queue, the process should not block waiting for a response; instead it can provide a callback or use continuations (e.g. Jetty). This frees up resources while waiting on expensive I/O operations, an important aspect of building systems that perform well under load. Although there's nothing inherent in the idea of RPC that it has to be synchronous, that's typically how it's used because object-oriented programming has taught us all to write procedural code with "blocking" method calls.
The last important feature that I want to discuss is a corollary of asynchronous communication, which is elasticity. Suppose that a system is suddenly subjected to a hundred times the typical load, but only for a brief period. With an RPC-based architecture, there would be many threads all blocked on method calls to remote services as they slowly processed the increased volume of requests; as many know, this is an easy way to kill a system or at the very least make it non-responsive. A message queue helps absorb sudden load spikes because it allows senders to asynchronously send a message and then free its resources. The messages accumulate in the queue (they are designed to handle this) while the recipients process them, and every component in the system behaves normally. Again, a continuation system would let you return control to the original sender once its message has been processed.
Conclusion
Both message queues and RPC solve the problem of communicating between applications running on different machines. The discussion about which method is "better" is a good way to evaluate what is the right choice in specific situations. Empirical evidence suggests that you can build robust, scalable systems with either one, so it's hard to say if there's necessarily a wrong choice here. Message queues provide some benefits that RPC lacks, though, and I would argue that those are important enough that they should be addressed in some other way if an RPC approach is taken.
Subscribe to:
Posts (Atom)