Analytics

Wednesday, March 11, 2026

A Tale of Two Production Bugs (Part 2)

Last time, I wrote about my experience debugging an issue where engineers commuting to and from the office caused production network blips. Ten years later, I encountered a bug at Amplitude that reminded me of that one and the fact that systems are often more intertwined than they seem at first glance.

At Amplitude, we used consul as our service discovery mechanism. We ran consul agents alongside our containers on EC2 instances, and those agents were responsible for reporting service status to the centralized consul servers. In particular, for our in-house columnar query system, we used service discovery with some statefulness (node ID) in order to distribute work in a cache-aware way. So each node, by nature of its ID, would consistently receive work that processed the same data files, and it would cache those files on its local SSDs to get good performance.

The bug occurred seemingly out of the blue, and the symptom was strange. A consul agent running on one of these query nodes would crash itself and then stop reporting that service's health. It would happen with varying frequency -- sometimes a few at once, sometimes days without issues. The system is designed for redundancy, so we could handle a handful of these, but losing a node would immediately start shifting load away and cause other nodes to begin processing and caching the data files. This would degrade performance in the system, and if left unchecked, sufficient node failures would cause queries to stop being processed completely.

Looking into the consul agent logs, the cause was that the consul network was reporting the same hostname with a different IP address than the actual node. For example, if query-12 was running at 10.1.2.3, the consul network was broadcasting that query-12 was actually at 10.1.2.4. When the agent running on query-12 received this message, essentially stating that it is not the consensus agent for that host, it would simply crash itself.

Naturally, we tried to look for the instance with IP address 10.1.2.4 to see why it was broadcasting the same hostname as the other instance. But it didn't exist. It turns out that 10.1.2.4 was the IP address of the previous instance with the same hostname before it had been rotated out (usually as part of regular security updates). That seemed promising, but that old instance had usually been gone for a long time already -- hours, if not days -- so it didn't make sense for it to start showing up again. Surely, we didn't have a rogue consul agent somehow still running on a terminated EC2 instance?

At this point, we simultaneously tried to investigate that possibility and dug through the consul source code to try to come up with hypotheses for how we could observe these symptoms. We happened to have upgraded consul in our infrastructure a few weeks before these errors started occurring, but the timing wasn't quite right. There was no evidence that the old instances were coming back according to any other metrics or logging that we had. Meanwhile, the frequency of the crashes was enough to be annoying, but not to cause real outages or customer impact. A few months passed without any plausible ideas for the root cause -- we kept restarting consul when it happened (painful!).

Finally, I got frustrated with this unexplained behavior and sat down one weekend to try and get to the bottom of it. After staring at hundreds of Datadog charts and all of the consul metrics for long enough, I found an unexplainable correlation. At the same time that these outdated IP broadcasts were happening, there were big spikes in the consul communication queues from the totally unrelated "inbound" cluster. The inbound cluster was an auto scaling group that grew and shrank with our incoming traffic. There was no obvious relationship between these communication queues and the agent crashes, but the correlation was so strong that I felt like I must be onto something.

As I dug into the inbound cluster, I stumbled across something curious. Some instances would be active, "disappear" from the metrics for a few hours or days, and then return again. That's not the typical behavior of an auto scaling group (terminated instances go away forever), and it was also suspiciously similar to the intervals that we were seeing the outdated IP broadcasts. Upon further examination, I found that we had enabled warm pools for the inbound auto scaling group, which is a feature that lets you keep a set of instances "warm" for faster auto scaling. In this particular case, warm meant that the instance was hibernated, i.e. its RAM contents saved for future use. When the instance came back, existing processes continued running as if they had never stopped. The warm pools were enabled right before we started observing the consul agent crashes (!).

At last, there was a root cause. The consul agents running in the inbound cluster were broadcasting outdated state after they came back from hibernation. The communication queues were growing because the consul agent briefly lost network connectivity and started buffering, only to empty those buffered messages (which could be very old) onto the network after the instance resumed. It's still not obvious to me why the consul network would allow these outdated messages to essentially overwrite known current state, but at that point I was ready to move past this issue. We initially disabled the warm pool, and eventually added a hook to shut down consul before hibernating, and the issue completely went away.

These two bugs really highlight the nature of debugging complex systems to me. If you have a bug that you can reproduce in a local environment, it's almost always straightforward to figure out. But in real production systems, the interactions are sometimes so wild that you could have never imagined the root cause at the beginning. In the decade between the last bug and this one, I became much more aware of this fact, and that open-mindedness helped me follow the seemingly nonsensical breadcrumbs that ultimately led to the answer. It's often said that debugging is significantly harder than writing code, and I'd say that my experiences corroborate that!

Sunday, March 8, 2026

A Tale of Two Production Bugs (Part 1)

This is a recollection of a particularly tricky bug I encountered in a production system during my time at Sumo Logic in 2013, when I was just a few years out of school and still new to building software. Sumo Logic ran a very large AWS infrastructure in those days, spanning thousands of EC2 instances, as part of building the first log management system in the cloud. Being quite early in AWS' history (Sumo Logic was founded in 2010), there were a number of homegrown tools, including configuration management (a la Salt or Ansible) and a "VPN." Although I had used AWS previously, it was my first exposure to data processing and operations at scale.

I was about six months into the job when this bug started popping up. It was the worst type of symptom: seemingly random network connectivity drops across many different production services. Some services saw it more than others, and sometimes the drops would occur in bursts, but it happened pretty much every day. It was just innocuous enough (retries usually hid the problem) that we let it persist for about a month, with a few engineers taking a stab at debugging it here and there to no avail. The issue slowly became worse over time, and finally the team decided that we needed to track it down once and for all. I don't recall why (maybe nobody else wanted it?), but somehow the task got assigned to me.

Given how little I knew about networking and AWS at the time, it was quite a stretch for me. I spent a whole week playing with two EC2 instances and testing what could cause network disruptions between them. Eventually, I came across a plausible candidate: security group changes occasionally caused my test connection to drop in a non-deterministic way that resembled the bug we were seeing. But there was no obvious reason why the security groups would be changing in production so frequently -- after all, security groups are typically assigned when an instance starts up.

With this hypothesis in mind, I started to look more carefully at the production logs of the network errors. The distribution across services was mostly random, with the only correlation being the services with the most traffic showing up the most often (it still took me a while to rule service-specific reasons out). The distribution across time, however, was unusual. The errors occurred most frequently in two bursts during the day: one in the morning (8-9am), and one in the early evening (5-6pm). This didn't correlate with our actual traffic. And there were many fewer errors over the weekend, although traffic was noticeably lower in that case. The obvious correlation that these time windows brought to mind was... people going to and leaving work?

At this point, I felt like I was so close to figuring it out, but I was also at a loss for how to even find a next hypothesis to test. So I sent a message out to our internal chat (Campfire) asking if anyone could hazard a guess as to why engineers going to and leaving work would cause security group changes. Frankly, I felt a little silly asking such a strange question. To my surprise, our chief architect immediately responded, saying that he now knew the root cause of the bug.

It turned out that the homegrown "VPN" that I mentioned above wasn't a traditional VPN at all. Instead, the client running on each of our laptops would observe changes to our IP address and notify the server. The server would then modify a security group attached to all instances, which contained a list of developers' IP addresses, allowing production access. In fact, we had recently exceeded 50 engineers, and there was a limit of 50 rules per security group, meaning a second security group had been added, worsening the problem. So when each of us commuted to and from the office every day (as everyone did back then), we would unwittingly cause a small blip in the production network!

Everything fell into place after that, and we quickly switched to a real VPN. I still look back on this bug as a great learning experience as well as a satisfying "project," despite its nature being very different from writing code. It serves as a reminder to me to be open-minded when debugging, and to follow the patterns in the data, wherever they may lead. And little did I know that this episode would be the precursor to another one a decade down the road -- this time, with me as the chief architect seeing it all the way through.

Sunday, March 1, 2026

Structured Outputs for LLMs

Back in the early days of LLMs, it was a struggle to get structured output out of models, which made it hard to use the results in a programmatic way -- there was no well-defined interface. That problem was solved a while back, and now most inference systems support JSON schema outputs. As part of building my educational inference runtime, I implemented support for this as well. My code leverages the outlines-core library which does a lot of the heavy lifting. I'll be going over what the library does and how that integrates into the inference runtime.

Let's start with an example JSON schema:

{
  "type": "object",
  "properties": {
    "foo": {
      "type": "string"
    },
    "bar": {
      "type": "integer"
    },
    "baz": {
      "enum": ["a", "b", "c"]
    }
  },
  "required": ["foo"]
}

There are two key observations that allow us to make JSON schemas usable in the LLM sampling process.

  1. Most JSON schemas can be turned into regular expressions (see supported features here).
  2. A regular expression can be turned into a deterministic finite automaton (DFA), i.e. a state machine that transitions based on bytes.

Using outlines-core, we can see that the above JSON schema turns into the following regex:

>>> from outlines_core.json_schema import build_regex_from_schema
>>> schema = '{"type":"object","properties":{"foo":{"type":
    "string"},"bar":{"type":"integer"},"baz":{"enum":
    ["a","b","c"]}},"required":["foo"]}'
>>> print(build_regex_from_schema(schema))
    \{[ ]?"foo"[ ]?:[ ]?"([^"\\\x00-\x1F\x7F-\x9F]|\\["\\/bfnrt])*"
    ([ ]?,[ ]?"bar"[ ]?:[ ]?(-)?(0|[1-9][0-9]*))?([ ]?,[ ]?"baz"[ ]?:
    [ ]?("a"|"b"|"c"))?[ ]?\}

Admittedly, not the prettiest thing in the world, but a regex nonetheless. A lot of the ugliness comes from handling optional whitespace, so if you strip that away it's fairly straightforward and what you would expect, i.e.

\{"foo":"<string>"(,"bar":<int>)?(,"baz":("a"|"b"|"c"))?\}

It's worth noting that, by design, this regex fixes the order of the fields in the JSON object, so it's a strict subset of valid outputs, which is okay.

Next, outlines-core uses the regex_automata crate to convert the regex to a DFA. This DFA encodes the valid per-byte transitions from any state as well as the final states that match the regex. But recall that LLMs do not output single bytes at a time, they output tokens. So the outlines-core library goes through a process to convert this DFA into an alternate DFA with the same states that has per-token transitions instead of per-byte transitions (saved in the Index object).

Converting a regex byte-level DFA to a token-level DFA (simplified).

To do this, it performs a breadth-first search over the DFA states using the vocabulary (set of all tokens) as the possible transitions. For each token, it runs the bytes through the DFA and checks whether the resulting state is not a dead state. If so, it records the token as a valid transition and continues the search from the resulting state (if it hasn't been seen before). It builds this entire DFA upfront, which can be expensive, but once it's available, checking the transitions at a given state is constant time. That lets us easily answer the question: at any point in the LLM's output, what is the set of valid tokens that can be produced next?

LLM logits are masked based on valid token transitions provided by the DFA.

Now that we have the valid token transitions, we're ready to integrate into the inference runtime. The integration happens at the logits layer, right before we apply softmax and sample the token output. We keep track of the current state within the DFA and advance it for each generated token. And then, based on that state, it's a simple masking process where, for any token that isn't in the valid set of transitions, we set the logits to negative infinity so they won't be sampled. That's it!

What I've described is just one way to handle structured outputs for an LLM. There are various approaches that support different types of structures with different performance characteristics. One particularly interesting one is llguidance, which supports context-free grammars (in addition to JSON schema and regex) and is also super fast thanks to using token tries and sparse masks. Perhaps an interesting topic for a future post.