Monday, September 23, 2013

Erlang and Riak

Today I learned that Riak, a distributed database based on Amazon's Dynamo, is written in Erlang. This is really cool because I personally have some experience with Apache Cassandra, which is another Dynamo-like distributed database written in Java. Specifically, when I worked with Cassandra, I encountered multiple concurrency bugs that were very difficult to track down and sometimes quite insidious. That's not to fault the writers of Cassandra; it is a complex, multithreaded system that has a lot of moving parts, which is exactly the type of code that led me to write about the difficulty of multithreading. Being written in Erlang, however, Riak should be able to escape some of the reliability issues that arise when dealing with shared, mutable state in Java. I am particularly curious if Riak is more stable than Cassandra, but this information seems hard to come by.

I did stumble across this blog post evaluating the choice of Erlang for Riak five years later. The folks at Basho (company behind Riak) seem to be happy with it, and their rationale for choosing Erlang lines up pretty well with my expectations. Two of the more interesting benefits they describe are that "a 'many-small-heaps' approach to garbage collection would make it easier to build systems not suffering from unpredictable pauses in production" and "the ability to inspect and modify a live system at run-time with almost no planning or cost." The former hits home again with my Cassandra experience, as Java garbage collection on a 16GB heap is not nearly as predictable as one would hope, and the latter is certainly valuable for fast-paced startups and the (in)famous "move fast, break things" philosophy. In the end, I am just happy to see that there are startups out there who have sought alternatives to the concurrent programming hell and built some neat systems as a result of it.

Tuesday, September 17, 2013


One of the core principles of designing good software systems is composability, the idea being that it should be easy to combine and reuse different components. This has all sorts of benefits, but one of the most important is that it becomes possible for programmers to understand a complex codebase as the composition of simple parts and their interactions. Without some degree of composability and modularity, it would not be possible to build reliable systems over the course of years involving hundreds of engineers because the code simply becomes too complex to understand. Fortunately, people have gotten better at doing this, and a key part of making components easy to compose is the concept of idempotence. It is the simple idea that, when possible, functions should be written in such a way that when you call them multiple times in succession with the same arguments the result does not change. For example, putting a value in a set is idempotent while appending to a list is not since in the latter case the list will keep growing with each append.

Here's an example of an interface which is not guaranteed to be idempotent (BadService) and the resulting attempt to deal with two instances of that interface:

Here, the BadServiceManager is responsible for starting and stopping two BadServices. As you can see, it requires some very ugly state-management that only gets worse as you have more and more components to bring together. The natural way that this should happen with idempotence guarantees is as follows:

In terms of local systems, reduced complexity is the primary reason that you would like idempotence. If that is not compelling enough, as soon as we start working with distributed systems the benefits become very clear cut. Consider a remote procedure call (RPC) made by machine A to machine B. Machine A receives a network timeout while making the call and is now in a predicament; did machine B execute the call or not? This is a very difficult question to answer generically, but idempotence can help. In the case of a network timeout, machine A simply retries the RPC until it receives a successful response from machine B because idempotence guarantees that any repeated calls will not affect the state. This greatly simplifies the job of the RPC client and makes it very natural for distributed components to communicate with each other robustly.

Understanding idempotence is an effective way to improve the composability and reliability of code. Unfortunately, it is not always easy to make functions idempotent and the restriction can make interfaces less clean. The benefits, however, manifest themselves very clearly when you do not need to deal with the unpredictability and complexities of calling non-idempotent interfaces, especially in a distributed setting.

Sunday, September 8, 2013

Actors and Error Handling

In addition to the basic model of actors, Erlang has another core language construct that is intended to help programmers build reliable systems. It has built-in support for various forms of error handling in the context of Erlang processes. For example, a single process will often rely on the assumption that various other processes that it depends on are available and will stop functioning properly if any of those processes fail. In many other languages, because the concept of disjoint processes is not fundamental, handling such failures is generally ad-hoc and often ignored by programmers until something unexpected happens. But in the actor model and especially when blurring the line between local and remote processes, it is necessary to treat external process failures as an expected outcome that can ideally be recovered from. As such, Erlang provides three primitives with which to build robust systems of processes: links, traps, and monitors.

Links are a way of specifying a hard dependency between two processes: if one of them dies, the other cannot function and thus should die as well. They can be used to set up groups of processes which correspond to some meaningful entity in the system that requires each of the individual processes to be running properly. When processes are linked and one dies, it sends a specific 'EXIT' message to the other that isn't received or caught by the normal messaging and exception handling constructs (instead killing the receiving process) unless some specific measures are taken. This leads us to the concept of traps, which are a method by which you can explicitly handle an exit message from a linked process. If two processes are linked without either of them trapping the exit messages, they will die together. But if, for example, you have a process which is dedicated to monitoring a group of other processes, this process can decide to receive exit messages and act accordingly when they are received, e.g. restarting the entire group of processes. Links and traps allow programmers to specify entities in a system that continue operate after failures.

The final piece of Erlang error handling is the ability to monitor other processes. This is similar to links except that monitors are unidirectional and stackable, meaning that you can compose them more easily and that they work better with libraries. Moreover, you do not need to trap the 'DOWN' messages that the monitored process sends; they are received as normal messages since there is no implicit process kill involved (unidirectional means there is no tight coupling between the processes). Monitors are a more flexible way for processes to communicate their errors to each other without linking their fates together. Between these three simple primitives for error handling, it is possible to construct systems that respond well to failure and can self-recover, leading to claims such as Joe Armstrong's (the creator of Erlang) nine 9's of reliability in the Ericsson AXD301.

As with the actor model, Scala's Akka library provides some similar functionality to Erlang's error handling, which is the concept of supervision. In some sense, Akka supervision is a stricter version of links and monitors, where every actor must be supervised by some other actor, its parent, so that the set of all actors forms a hierarchy (the root actor is provided by the library). The supervising actor can employ different strategies for monitoring the lifecycle of each of its child actors, which often involves restarting failed actors. While not quite as flexible as Erlang error handling, the way Akka enforces the actor hierarchy leads programmers to naturally structure their systems as a hierarchical tree of components that can either function independently of failures in other parts of the hierarchy or propagate/handle errors as necessary. Building concurrent and distributed systems requires serious consideration of errors and failures, but it's often the case that there are no good tools for doing so. Erlang's ideas and Akka's adaptation of them are examples of how this problem can be tackled at a first-class level.

Sunday, September 1, 2013

Clojure Agents

Continuing the exploration of Clojure's concurrency utilities leads us to agents, a built-in method of managing independent, asynchronous change to a single value. We can think of this as a parallel to a single actor that controls some private state which can only be accessed by sending messages to the actor. Agents are a little bit different in that they invert the control over how the state is modified: instead of an actor receiving a predefined set of messages which have different effects on the state, agents allow the sender to specify how the state is mutated. The agent itself is only responsible for serializing the mutations and allows the state to be read by anyone. Controlling access to state through agents is preferable to software transactional memory (STM) when asynchronous updates are acceptable, as you can avoid the performance concerns of STM. In this post, I will walk through an example of using agents for tracking metrics for some samples.

The first line shows how to initialize an agent. In our case, it is a map with three values: the sum of all samples, the sum of the squares of all samples, and the count of the number of samples. By updating these three values for each sample, we can compute the resulting mean and variance at any point in time. The two "public" functions here are record-sample and get-metrics, whose purpose should be clear from the names. Because we want recording metrics to be thread-safe, we process all updates through the agent by using the send function in this line: (send global-metrics update-metrics value). Send tells the agent to process this update at some point in the future, and the update-metrics function takes the sample value and updates the three values in the map. Then to read the metrics in get-metrics, we first call (await global-metrics) to ensure that all actions sent from the current thread finish in order to prevent race conditions of not reading a sample we just recorded (since agents process asynchronously). Finally, we read the current value of global-metrics and return the computed mean and variance.

The final two lines are examples of using these functions; we, in parallel, record samples from 0 to 10000 and then get the mean and variance. Running this, we see the expected result: {:mean 9999/2, :variance 33333333/4}. To understand agents a bit more, I included the commented-out line 4 to demonstrate why the await call is necessary. If we sleep on that line, we still obtain the correct result but it takes 10 seconds. If we sleep on that line and exclude line 18 with the await, we will get something arbitrary like {:mean 221/15, :variance 18404/225} from whatever subset of samples happened to be recorded at that time (the result is nondeterministic). Thus it is essential that we allow the actions from the current thread to finish before attempting to read the metrics. Other than that, there is no real synchronization visible in this code, as it is hidden away by the agent abstraction.

Clojure offers a wide array of methods of dealing with shared, mutable state. Agents are one of the core tools that the language provides for isolating who can update a piece of state so that there is no burden on the programmer to manage that. They also play well with the STM; sending actions to agents within a transaction will do nothing until the transaction is committed, so that each action is executed only a single time regardless of whether the transaction is forced to roll back. In the actor model with Erlang, we were forced by the immutability constraint to have a dedicated actor for managing any shared, mutable state, and Clojure agents are a specialization of that concept designed for simplicity of code. Overall, Clojure seems to have taken a very open-minded approach for addressing the concurrency problem and presents a handful of unique solutions mixed together in a nice way.