What's the weirdest production bug you've discovered and/or helped debug?

There comes a time in our career when we experience that production bug.

The one where the source of the problem doesn’t immediately reveal itself. It’s the one that takes teams of people to dig into. You chase red herrings, you struggle and are close to giving up.

Yet eventually something you never imagined reveals itself as the source of the bug. Deploy a fix and everyone sighs with relief.

I enjoyed this story from the team at Gausto.

Even though the priority of this bug changed over time as we found workarounds, relentless curiosity won out in the end. No single one of us had all of the necessary knowledge to solve this bug on their own, but with persistence and collaboration, we were able to figure it out together.

What’s the weirdest production bug you’ve discovered and/or helped debug?

2 Likes

Problem:

  • The bug was seen in production 2-3 times/month.
  • About 1000 API calls/minute were repeated for 2-15 minutes until the client closed the browser tab.

Journey:

  • The whole team spent a couple of hours trying to figure it out and couldn’t.
  • A few weeks later, while discussing with a person from business operations on a separate topic, I learned of a benefit that some senior clients would have when purchasing a product.
    This made me curious about whether the app supports this and went to check it out(only some products known by business had this benefit);
    On top of this, I knew from the logs from prod, that the clients trying to purchase were seniors. So I thought what if I combine the two pieces of information? And several minutes later I reproduced the issue. Some senior citizens trying to purchase a particular product that had a discount rule in the db/backend were faced with an error at purchase time due to a price check mismatch(as the discount wasn’t recognized by the middle layer or frontend).

Debugging/pinpointing the bug:

  • I noticed a repeated API call with the same response code. I searched for the code on Chrome or google forums and saw it’s a retry that this browser does when it encounters this HTTP response code (no other browsers were doing it).

Solution:

  • The fix was done in 10 minutes by switching the response code number and it was released in prod the next day.
2 Likes

I really enjoy this story: We can’t send email more than 500 miles
https://web.mit.edu/jemorris/humor/500-miles

I can’t think of any good ones from my own experience. I may have just blacked them out in my memory :stuck_out_tongue:

2 Likes

Problem:
I noticed when fixing a UI issue on a billing system that the VAT calculation was incorrect and we were over charging customers.

Journey:
2008 I noticed the issue. Spent about an hour looking at a sample of accounts to see if it was enough to not be an edge case. Then once I found that some cases were not impacted, I spent another hour looking across accounts. I then noticed a pattern that all the ones impacted has registered to self serve on the new website on a certain month. There was an issue in the time date stamp that meant the back end calculation in the financial system was applying an old VAT flag just for those users caused by a certain criteria.

I flagged to my team lead. Told to ignore it as no customers had complained. Went above their heads to compliance and legal, showed the evidence and how much we had over taken, and it was still happening. Legal drove the escalation to make it a P2 so we could fix straight away. Lead was suspended. If we had not found this it would have cost millions in fines, let alone reputation tarnished.

Debugging/pinpointing the bug:
The whole team spent a couple of hours trying to figure fix it and couldn’t.
We were not looking at the correct logs in production (this was 2008).
Once we were aligned several minutes later I reproduced the issue. A senior was then able to fix it.

1 Like

I’m trying to remember details now, but this one did involve a lot of heat. But no fire. I worked at a company that acquired a company that added VOIP calls over internet networks to their wireless custom and DECT solution. And so we integrated their app into our device. The device is basically a telephone exchange that can connect to anything that can carry audio and mix the sound with the kind of sound quality a mixing desk at a concert might give, unlike MS Teams that only lets 2 people talk or sing at once, and badly.

Anyway, we all thought that the cause of the box encountering a memory fault was in the new app code the new acquired company had added. We eventually tracked it down to the SSL encryption library. A library that runs on millions of devices and spent about 6 months, not finding the random crash cause. Hardware were convinced it was heat related. We bought thermal loggers in addition to the other debugging kit we already had, it never helped. Finally we set up a 2 week soak test, and one Sunday I went into office to discover that the aircon turns off on weekends… On Monday none of our tests had crashed, and we quickly decided it was not heat nor SSL code related. Within days, one of the guys found a power line part near the CPU was to blame. We had rebuilt the SSL library dozens of times in vain, but it was fun. (Our electricity bill was phenomenal as well, it really was. Some bugs are not cheap, so don’t be shy when buying tools.)

2 Likes