Ternary Search: A Tale of Two Production Bugs (Part 2)

Last time, I wrote about my experience debugging an issue where engineers commuting to and from the office caused production network blips. Ten years later, I encountered a bug at Amplitude that reminded me of that one and the fact that systems are often more intertwined than they seem at first glance.

At Amplitude, we used consul as our service discovery mechanism. We ran consul agents alongside our containers on EC2 instances, and those agents were responsible for reporting service status to the centralized consul servers. In particular, for our in-house columnar query system, we used service discovery with some statefulness (node ID) in order to distribute work in a cache-aware way. So each node, by nature of its ID, would consistently receive work that processed the same data files, and it would cache those files on its local SSDs to get good performance.

The bug occurred seemingly out of the blue, and the symptom was strange. A consul agent running on one of these query nodes would crash itself and then stop reporting that service's health. It would happen with varying frequency -- sometimes a few at once, sometimes days without issues. The system is designed for redundancy, so we could handle a handful of these, but losing a node would immediately start shifting load away and cause other nodes to begin processing and caching the data files. This would degrade performance in the system, and if left unchecked, sufficient node failures would cause queries to stop being processed completely.

Looking into the consul agent logs, the cause was that the consul network was reporting the same hostname with a different IP address than the actual node. For example, if query-12 was running at 10.1.2.3, the consul network was broadcasting that query-12 was actually at 10.1.2.4. When the agent running on query-12 received this message, essentially stating that it is not the consensus agent for that host, it would simply crash itself.

Naturally, we tried to look for the instance with IP address 10.1.2.4 to see why it was broadcasting the same hostname as the other instance. But it didn't exist. It turns out that 10.1.2.4 was the IP address of the previous instance with the same hostname before it had been rotated out (usually as part of regular security updates). That seemed promising, but that old instance had usually been gone for a long time already -- hours, if not days -- so it didn't make sense for it to start showing up again. Surely, we didn't have a rogue consul agent somehow still running on a terminated EC2 instance?

At this point, we simultaneously tried to investigate that possibility and dug through the consul source code to try to come up with hypotheses for how we could observe these symptoms. We happened to have upgraded consul in our infrastructure a few weeks before these errors started occurring, but the timing wasn't quite right. There was no evidence that the old instances were coming back according to any other metrics or logging that we had. Meanwhile, the frequency of the crashes was enough to be annoying, but not to cause real outages or customer impact. A few months passed without any plausible ideas for the root cause -- we kept restarting consul when it happened (painful!).

Finally, I got frustrated with this unexplained behavior and sat down one weekend to try and get to the bottom of it. After staring at hundreds of Datadog charts and all of the consul metrics for long enough, I found an unexplainable correlation. At the same time that these outdated IP broadcasts were happening, there were big spikes in the consul communication queues from the totally unrelated "inbound" cluster. The inbound cluster was an auto scaling group that grew and shrank with our incoming traffic. There was no obvious relationship between these communication queues and the agent crashes, but the correlation was so strong that I felt like I must be onto something.

As I dug into the inbound cluster, I stumbled across something curious. Some instances would be active, "disappear" from the metrics for a few hours or days, and then return again. That's not the typical behavior of an auto scaling group (terminated instances go away forever), and it was also suspiciously similar to the intervals that we were seeing the outdated IP broadcasts. Upon further examination, I found that we had enabled warm pools for the inbound auto scaling group, which is a feature that lets you keep a set of instances "warm" for faster auto scaling. In this particular case, warm meant that the instance was hibernated, i.e. its RAM contents saved for future use. When the instance came back, existing processes continued running as if they had never stopped. The warm pools were enabled right before we started observing the consul agent crashes (!).

At last, there was a root cause. The consul agents running in the inbound cluster were broadcasting outdated state after they came back from hibernation. The communication queues were growing because the consul agent briefly lost network connectivity and started buffering, only to empty those buffered messages (which could be very old) onto the network after the instance resumed. It's still not obvious to me why the consul network would allow these outdated messages to essentially overwrite known current state, but at that point I was ready to move past this issue. We initially disabled the warm pool, and eventually added a hook to shut down consul before hibernating, and the issue completely went away.

These two bugs really highlight the nature of debugging complex systems to me. If you have a bug that you can reproduce in a local environment, it's almost always straightforward to figure out. But in real production systems, the interactions are sometimes so wild that you could have never imagined the root cause at the beginning. In the decade between the last bug and this one, I became much more aware of this fact, and that open-mindedness helped me follow the seemingly nonsensical breadcrumbs that ultimately led to the answer. It's often said that debugging is significantly harder than writing code, and I'd say that my experiences corroborate that!

Ternary Search

Analytics

Wednesday, March 11, 2026

A Tale of Two Production Bugs (Part 2)

No comments:

Post a Comment