Tuesday, January 1, 2013

Message Queues and RPC

With the arrival of a new year, I'd like to bring up an old question in distributed systems research. What is the "best" way for applications running on separate machines to communicate with each other? Specifically, I'm interested in the case in which you have several different servers running individual pieces of your system architecture that need to communicate (rather than, for example, a client-server model). In my experience, I've come across two different basic methods for doing this, namely message queues and remote procedure calls (RPC). Examples of software/libraries include RabbitMQ, HornetQ, and ActiveMQ for message queues, Avro and Thrift for RPC.

What's the difference?

The key difference between message queues and RPC is the level abstraction they provide the programmer. With message queues, it is generally left to the developer to serialize and deserialize messages to/from the queue and decide how to process each type of message; you work with a basic send/receive paradigm and as a result have a lot of control over the communication. RPC, on the other hand, tries to hide this complexity by making a call to a remote machine look like a call to a local object (hence the name). It's often the case that RPC functionality is packaged with serialization libraries (e.g. Avro and Thrift) because it relies on some serialization format in order to automatically handle the translation of a method call on one machine to a method call on another.

If that's where the discussion ended, it would be a pretty clear win for RPC. Abstraction is a programmer's greatest tool, and it can be argued that someone developing an application shouldn't need to know whether a method being called is a local or remote object. The convenience associated with this approach has been attributed as one of the reasons why RPC is so popular (it's used extensively in Google's infrastructure, and Thrift came out of Facebook) despite some serious criticisms from almost 20 years ago. It's my intention to attempt to make sense out of arguments that people have made for and against RPC and discuss what I believe to be important features of message queues that RPC lacks.

Theory vs practice

Many of the discussions about RPC focus on what I'll call "pure RPC." What I mean by that is the concept of making local and remote method calls completely indistinguishable. It's clear that this is impossible given the four concerns of latency, memory access, partial failure, and concurrency brought up here. Instead, I'd like to focus on the practical RPC libraries that people work with and what they offer. In most such cases, the programmer does know that they're making a method call to an external service by nature of understanding the architecture and having to define the interfaces/data structures to be serialized. This has already broken the abstraction, which takes away some of the benefits of RPC, but there's still something useful about having your services represented by interfaces and calling methods on them.

What RPC can do

Scalability is one concern that has been brought up about RPC. Message queues are usually an isolated service which provides a decoupling between the sender and receiver, which means that it's really easy for multiple senders/receivers to operate using the same queue and scale out horizontally. With RPC, it's not quite a "built-in" feature, but those who claim that you can't scale it should have a word with Google and Facebook. Since method calls over RPC are typically directed at a single endpoint, it might seem like it won't scale horizontally, but putting a load balancer in front of your endpoints goes a long way. If you make any assumptions about state it gets a bit trickier, but that's a problem that both message queues and RPC have to deal with.

Robustness is a second problem people like to talk about with RPC; by robustness I mean what semantics we can guarantee on the execution of a method, e.g. exactly once, at most once, or at least once. Generally, you have two possibilities: if you try a method call once only, you get "at most once" semantics (which is typically very hard to reason about); if you retry method calls until you hear a success, you get "at least once" semantics (which is also hard to reason about, but less so). That might seem problematic, but again message queues suffer from the same issue even with acknowledgements and confirmation of delivery.

Message queues and asynchronous communication

This blog post provides a list of ten important features of message queues that make them a good choice. I will highlight three in particular that differentiate message queues from RPC. The first is durability; many message queues provide support for message persistence. When you send a message to the queue, it will write it to disk before acknowledging your message so that in the event of a failure the message will not be lost. This is difficult to do using RPC because there is no intermediary and it doesn't make a lot of sense to replay a method call when you no longer have the context (e.g. stack) in which it was made.

Another property of message queues that make them compelling is the fact that messaging is inherently asynchronous. After a message is sent to the queue, the process should not block waiting for a response; instead it can provide a callback or use continuations (e.g. Jetty). This frees up resources while waiting on expensive I/O operations, an important aspect of building systems that perform well under load. Although there's nothing inherent in the idea of RPC that it has to be synchronous, that's typically how it's used because object-oriented programming has taught us all to write procedural code with "blocking" method calls.

The last important feature that I want to discuss is a corollary of asynchronous communication, which is elasticity. Suppose that a system is suddenly subjected to a hundred times the typical load, but only for a brief period. With an RPC-based architecture, there would be many threads all blocked on method calls to remote services as they slowly processed the increased volume of requests; as many know, this is an easy way to kill a system or at the very least make it non-responsive. A message queue helps absorb sudden load spikes because it allows senders to asynchronously send a message and then free its resources. The messages accumulate in the queue (they are designed to handle this) while the recipients process them, and every component in the system behaves normally. Again, a continuation system would let you return control to the original sender once its message has been processed.


Both message queues and RPC solve the problem of communicating between applications running on different machines. The discussion about which method is "better" is a good way to evaluate what is the right choice in specific situations. Empirical evidence suggests that you can build robust, scalable systems with either one, so it's hard to say if there's necessarily a wrong choice here. Message queues provide some benefits that RPC lacks, though, and I would argue that those are important enough that they should be addressed in some other way if an RPC approach is taken.

No comments:

Post a Comment