Sunday, July 28, 2013

Erlang Executor

Last time, I wrote a crude blocking queue in Erlang to better understand the actor model and the functional paradigm for managing state. This time, I will be building on the blocking queue to implement a simple executor service that emulates the Java concept using actors. To be clear, the goal of the exercise is not to understand how multithreading concepts from Java can be applied in Erlang, which probably goes against the entire point of using the actor model, but rather to develop a sense for how the familiar Java abstractions would manifest themselves in Erlang. That way, when confronted with a a problem for which the natural solution would be a blocking queue or an executor service, it is easier to see how the solution can be framed in terms of actors. Without further ado, the code:

Again, the public interface consists of just three simple functions: new(), submit(), and get(). These functions create a new executor with a specified number of threads, submit a task, and get the result of a task, respectively. When the executor is created, we initialize a blocking queue and spawn a number of processes corresponding to how many "threads" the client wants, each of which runs the worker() function. As in Java, worker() repeatedly removes tasks off the queue and runs them.

The interesting part of this is how the concept of futures is implemented. A future is a proxy to the result of a task that is submitted to an executor, and its primary feature is the ability to block until the result is available, which is the purpose of the get() function. In Java, this would typically implemented using a semaphore, but here the future is another Erlang process that is spawned when the task is submitted. This process runs the wait() function, which has two phases. First, it blocks until the result of the task is received (the "done" message is sent by the worker that executes the task), and once it has the result it will respond to an arbitrary number of "get" messages that are sent by the get() function, sending back the result. So while the result is not available, calls to get() will block on receiving the "done" message, giving us the equivalent of futures. And that is pretty much all there is to implementing the executor!

The relationship between shared, mutable state and Erlang actors is becoming quite clear from these examples. The restrictions of Erlang imply that, any time you want to deal with shared, mutable state, you must spawn a process which is responsible for managing that state. Anyone who wants to read or modify that state can only do so through messages sent to that process, naturally eliminating any possibility of race conditions and unsynchronized access to data. These are useful guarantees to have as a developer, so the benefits of the actor model and functional programming are clear. With that said, it has definitely been trickier for me to design these programs, having used imperative programming techniques for so many years. But after being plagued by synchronization issues, I'm open to the possibility that trading off the intuitive nature of imperative programming and threads is worthwhile for the superior concurrency model.

Thursday, July 25, 2013

Erlang Actors

Following my post on how multithreading is hard, I spent some time learning about the alternatives, one of the most promising of which is the actor model. I have read a lot about how Erlang is a good language for building concurrent systems like message queues (RabbitMQ is the prime example), and actors are a core part of Erlang, so I decided to check it out more extensively (I found this tutorial to be useful). Erlang is functional, so you have to think carefully about how to manage "state" if you are coming from a traditional imperative programming background like I am. To guide my learning, I wrote a very primitive blocking queue in Erlang that is bounded and supports only blocking puts and takes. As it turns out, this looks very different from how you might write one in Java using synchronization primitives like locks and condition variables. It's probably easiest to start out with the code in order to guide the discussion:

To start off, we have the new() function which returns a new "instance" of the blocking queue. As it turns out, the function actually just returns an Erlang process ID, which refers to the process that is created by the spawn() call (I like to think of it as a "listener"). This newly-created process is running the listen() function, which we will get to a bit later, but the important thing to understand is that the process is the only place where the "state" of the queue is managed. No state is returned to the caller of new() because everything is immutable in Erlang, so it does not make sense for different actors to actually be carrying around the "state" of the queue. You may already notice that functional programming and the actor model make you think quite differently.

So now we have the setup: whoever creates the queue has a reference to the process ID of the queue's listener, and meanwhile the listener is running as its own process. Next are the put() and take() functions, which follow the same pattern. They each send a message to the listener (the exclamation mark syntax) and then wait for a response (the receive ... end syntax). At a high level, that's all there is to the actor model; processes are all running independently of each other except for the fact that they can send message to and receive messages from other processes. Using receive gives us the blocking semantics because the listener will not send the reply message until it has successfully put the new value or taken a value off the queue. Thus we have built-in synchronization without having to mess around with low-level synchronization primitives.

Finally, we get to the meat of the program, the listen() function. The first thing it does is decide what it should listen for, and this is based on what is currently in the queue. Depending on whether the queue is empty, full, or neither, we can receive puts, takes, or both, respectively. If the queue is empty, for example, the receive block only recognizes puts, so the process will ignore the take messages until it gets into a state in which it is able to process them (causing all callers of take() to wait in their respective receive blocks). Lastly, you'll notice that both handle_put() and handle_take() end with a recursive call to listen(). This is, again, because Erlang is functional and everything is immutable. We are not able to use something like a while loop around the receives because that would require that the queue is modified in each iteration, so instead we just call listen() again with the new state of the queue.

So there it is, a simple blocking queue implemented in a functional style with actors. It required a lot more thinking than I expected to come up with a solution using these unfamiliar tools, but in the end I started realizing their benefits. By taking shared, mutable state out of the picture you can avoid worrying about most of the issues that plague multithreaded programming. Questions that are still open in my mind include whether the actor model is rich enough to capture the complexity of real applications and what the performance implications are of writing code in this way. It has definitely been worthwhile to explore actors through Erlang, though, and I definitely plan on checking out Akka, which brings actors to Scala.

Thursday, July 18, 2013

Garbage Collection and Scope

Garbage collection (GC) is one of those programming language features that you are supposed to not have to think about, but anyone who has tried to optimize the performance of Java applications will know that managing memory is often the most effective way of speeding things up. The less time the garbage collector has to spend running, the more time your program can have to run. I recently came across an aspect of code design that turns out to have a significant impact on memory performance. Consider the following program:

It is a toy program, but it is designed the emulate the following sequence of events (both the handle1 and handle2 methods):
  • you receive a raw stream of bytes from somewhere, e.g. the network;
  • you deserialize the bytes into something meaningful, e.g. a web request; and
  • you do some processing with the deserialized data, e.g. handling a web request by doing some database operations and returning a response.
Here we are getting the bytes by just allocating a 256MB array and then "deserializing" the bytes into integers. And lastly, the "processing" consists of just repeatedly allocating 1MB arrays until we hit an error (in particular, we are expecting an OutOfMemoryError). The idea is that the handleInts method figures out roughly how much free memory the JVM has at the point at which it is called (forcing GC as necessary) in order to evaluate the relative efficiency of the two versions of code.

It turns out that handle2 is about 256MB more efficient than handle1 as measured by the number of allocations done in handleInts, despite the fact that the two methods are functionally identical. The reason is scope, although not quite in the traditional programming language sense. We have two representations of the data in the code, one as an array of bytes and the other as an array of integers, each of which is about 256MB in size. Conceptually, once we have deserialized the bytes as integers, we can discard the array of bytes, i.e. allow it to be garbage collected. And that is exactly what happens in handle2; the scope of the array of bytes is limited to the time during which it is being deserialized. In handle1, however, we are passing the array of bytes as a method parameter to the method which eventually ends up calling handleInts. That means, in particular, that a reference to the array is living on the call stack, which is a GC root, so it cannot be garbage collected while the method is running.

In my mind, this example raises some interesting questions about how code should be designed and the impact of different design patterns on GC performance. Specifically, we know that having a very deep call stacks, especially when methods do transformations on the data, can lead to larger retained sets of objects (those that cannot be collected) than intended. In an ideal world, a programmer using a language with built-in memory management should not have to be concerned with this, though, so it would be even better if the compiler or JVM could identify this type of situation and make the correct optimization.

Sunday, July 14, 2013

Multithreading is Hard

Over the last few years, I have repeatedly found myself working with complex, multithreaded Java applications that produce all sorts of bugs that are extremely difficult to identify, reproduce, and fix. And in doing so, I have learned a great deal about the Java memory model, the various synchronization primitives (e.g. locks, semaphores, atomic references), and useful concurrent abstractions (e.g. blocking queues, executor services). With enough knowledge and experience with multithreaded programming in Java, it is possible to design and implement complex systems that, after some time, you have modest confidence in. But not only does it take a very long time to get to that point (typically on the order of years for large projects), I would also guess that most programmers have less confidence in their programs than they wish they could. It is also not uncommon for mature projects to still be littered with minor race conditions and rare deadlocks that pop up over time when some poor engineer has to debug why their production system failed. To some degree we have just accepted that the systems we build and the libraries we write will inevitably suffer from bugs that we cannot preemptively identify and fix. I recently came across this technical report by Professor Lee from Berkeley that I feel very clearly identifies the problem, which is the concept of threads and how they are materialized in today's languages.

The main argument, which I have seen formulated in a number of ways, is that because the execution of a multithreaded program is nondeterministic, we cannot efficiently reason about correctness. In a single-threaded program, you only need to guarantee that the single, deterministic execution of the program produces the correct result; even then it can be difficult to convince yourself of correctness (since the space of possible inputs is often exponential in the size of the input), but people have become reasonably proficient at doing so. Once you introduce multithreading, the number of potential executions from interleaving operations is exponential in the size of the program, and correctness depends on all of those executions producing the correct result. So, without any structure to reduce the problem, it is exponentially harder to guarantee the correctness of a multithreaded program than a single-threaded program. The tools in Java that I mentioned are designed to "prune away [the] nondeterminism" introduced by multithreading, namely by reducing the space of potential executions and, specifically, preventing those that are known to be incorrect. But this is very difficult to do given that there are so many possible executions to reason about, not to mention the fact that only one such execution manifests itself each time the code is run. To quote Professor Lee, "A folk definition of insanity is to do the same thing over and over again and to expect the results to be different. By this definition, we in fact require that programmers of multithreaded systems be insane. Were they sane, they could not understand their programs."

At a high level, there are a number of different ways to get around the fundamental problem of nondeterminism in multithreaded programs. One is the use of immutable objects, which amounts to just never having mutable state. If there is no mutable state, then different interleavings will always read the same values and produce the same results. Functional programming emphasizes immutability and, in many cases, not having state altogether, which can make it a very natural paradigm for multithreaded programs (though it comes with its own set of concerns). Clojure is a good example of how the functional paradigm can present a distinct way of leveraging multiple processors without directly manipulating threads. Besides immutable objects, another option for managing multithreaded interactions is to assume all threads touch only local state and use a specific system for passing messages between threads. The actor model is an abstraction based on these principles which has been implemented in languages such as Erlang and Scala with some success. Unfortunately, these paradigms and abstractions have not seen the same popularity as threads, quite possibly because threads have the closest resemblance to sequential programming in popular languages. In the end, everything comes down to getting rid of shared, mutable state; if the state is not shared or is immutable, then the nondeterminism of executions becomes a non-issue, but as long as you are using shared, mutable state, multithreaded programming will be difficult.

Regarding the situation in Java, Professor Lee argues that "the mechanisms (such as semaphores) still require considerable sophistication to use, and very likely will still result in incomprehensible programs with subtle lurking bugs" and that "software engineering process improvements alone will not do the job." We can do the best we can to improve the design, review, implementation, and testing processes around building these applications, but it seems impossible to program in this way and confidently believe that there are no synchronization bugs. Given what we know about threads and the nondeterminism they introduce, it is worth considering whether it is the right programming model to build on and if there is an alternative to be found in functional programming, actors, or something completely different.

Thursday, July 4, 2013

Static Initialization Deadlock

Time some more fun with static initialization in the JVM. I previously wrote about how static initializers are not hotspot-optimized, which can lead to some interesting performance issues, but this time we see something more scary in seemingly innocent code. Take a look:

Here we have a class MyClass (it is an inner class here for simplicity only) which has a static inner class MySubClass. MyClass also has a static field of type MySubClass; that field will be initialized when MyClass is. Then in the main method we have a thread that creates an instance of MySubClass, and we directly reference the static field on MyClass. We aren't really doing a whole lot here; in particular, there is no explicit synchronization, which is usually good enough to rule out things like race conditions and deadlocks. Well, if you run this a few times, you will in fact notice that this program sometimes fails to terminate. Putting jstack to work, we see the following stack traces:

Well, that's interesting; the thread we created is just stuck on the line trying to create the instance of MySubClass while the main thread is trying to initialize MyClass. As it turns out, static initialization in the JVM is thread-safe which is good, but the synchronization introduced used can cause a deadlock. If you recall the necessary conditions for deadlock, one of them is circular wait; that is, threads acquiring locks in a potentially cyclic order. In this case, the two classes MyClass and MySubClass have individual locks guarding their loading and initialization, but initializing one in turn triggers the initialization of the other (hence the possibility of circular wait). Initializing MySubClass naturally has to initialize the superclass, while the presence of the static field in MyClass causes the reverse dependency, so both of our threads get stuck during the class initialization process.

This is quite unfortunate because deadlocks are extremely hard to reproduce and track down, especially when you do not suspect your code of having such problems. And while my example may seem a bit contrived (although less so than they typically are), it is based on a real bug we encountered in a production environment. If a bug exists in the JVM, there is enough Java code out there that someone will almost certainly run into it.