This is a recollection of a particularly tricky bug I encountered in a production system during my time at Sumo Logic in 2013, when I was just a few years out of school and still new to building software. Sumo Logic ran a very large AWS infrastructure in those days, spanning thousands of EC2 instances, as part of building the first log management system in the cloud. Being quite early in AWS' history (Sumo Logic was founded in 2010), there were a number of homegrown tools, including configuration management (a la Salt or Ansible) and a "VPN." Although I had used AWS previously, it was my first exposure to data processing and operations at scale.
I was about six months into the job when this bug started popping up. It was the worst type of symptom: seemingly random network connectivity drops across many different production services. Some services saw it more than others, and sometimes the drops would occur in bursts, but it happened pretty much every day. It was just innocuous enough (retries usually hid the problem) that we let it persist for about a month, with a few engineers taking a stab at debugging it here and there to no avail. The issue slowly became worse over time, and finally the team decided that we needed to track it down once and for all. I don't recall why (maybe nobody else wanted it?), but somehow the task got assigned to me.
Given how little I knew about networking and AWS at the time, it was quite a stretch for me. I spent a whole week playing with two EC2 instances and testing what could cause network disruptions between them. Eventually, I came across a plausible candidate: security group changes occasionally caused my test connection to drop in a non-deterministic way that resembled the bug we were seeing. But there was no obvious reason why the security groups would be changing in production so frequently -- after all, security groups are typically assigned when an instance starts up.
With this hypothesis in mind, I started to look more carefully at the production logs of the network errors. The distribution across services was mostly random, with the only correlation being the services with the most traffic showing up the most often (it still took me a while to rule service-specific reasons out). The distribution across time, however, was unusual. The errors occurred most frequently in two bursts during the day: one in the morning (8-9am), and one in the early evening (5-6pm). This didn't correlate with our actual traffic. And there were many fewer errors over the weekend, although traffic was noticeably lower in that case. The obvious correlation that these time windows brought to mind was... people going to and leaving work?
At this point, I felt like I was so close to figuring it out, but I was also at a loss for how to even find a next hypothesis to test. So I sent a message out to our internal chat (Campfire) asking if anyone could hazard a guess as to why engineers going to and leaving work would cause security group changes. Frankly, I felt a little silly asking such a strange question. To my surprise, our chief architect immediately responded, saying that he now knew the root cause of the bug.
It turned out that the homegrown "VPN" that I mentioned above wasn't a traditional VPN at all. Instead, the client running on each of our laptops would observe changes to our IP address and notify the server. The server would then modify a security group attached to all instances, which contained a list of developers' IP addresses, allowing production access. In fact, we had recently exceeded 50 engineers, and there was a limit of 50 rules per security group, meaning a second security group had been added, worsening the problem. So when each of us commuted to and from the office every day (as everyone did back then), we would unwittingly cause a small blip in the production network!
Everything fell into place after that, and we quickly switched to a real VPN. I still look back on this bug as a great learning experience as well as a satisfying "project," despite its nature being very different from writing code. It serves as a reminder to me to be open-minded when debugging, and to follow the patterns in the data, wherever they may lead. And little did I know that this episode would be the precursor to another one a decade down the road -- this time, with me as the chief architect seeing it all the way through.
No comments:
Post a Comment