The Club: Software Testing & Quality Engineering Community Forum | Ministry of Testing

What's the weirdest production bug you've discovered and/or helped debug?

simon_tomes (Simon Tomes) 5 December 2023 12:42 1

There comes a time in our career when we experience that production bug.

The one where the source of the problem doesn’t immediately reveal itself. It’s the one that takes teams of people to dig into. You chase red herrings, you struggle and are close to giving up.

Yet eventually something you never imagined reveals itself as the source of the bug. Deploy a fix and everyone sighs with relief.

I enjoyed this story from the team at Gausto.

Even though the priority of this bug changed over time as we found workarounds, relentless curiosity won out in the end. No single one of us had all of the necessary knowledge to solve this bug on their own, but with persistence and collaboration, we were able to figure it out together.

What’s the weirdest production bug you’ve discovered and/or helped debug?

3 Likes

ipstefan (Stefan Papusoi) 6 December 2023 16:21 2

Problem:

The bug was seen in production 2-3 times/month.
About 1000 API calls/minute were repeated for 2-15 minutes until the client closed the browser tab.

Journey:

The whole team spent a couple of hours trying to figure it out and couldn’t.
A few weeks later, while discussing with a person from business operations on a separate topic, I learned of a benefit that some senior clients would have when purchasing a product.
This made me curious about whether the app supports this and went to check it out(only some products known by business had this benefit);
On top of this, I knew from the logs from prod, that the clients trying to purchase were seniors. So I thought what if I combine the two pieces of information? And several minutes later I reproduced the issue. Some senior citizens trying to purchase a particular product that had a discount rule in the db/backend were faced with an error at purchase time due to a price check mismatch(as the discount wasn’t recognized by the middle layer or frontend).

Debugging/pinpointing the bug:

I noticed a repeated API call with the same response code. I searched for the code on Chrome or google forums and saw it’s a retry that this browser does when it encounters this HTTP response code (no other browsers were doing it).

Solution:

The fix was done in 10 minutes by switching the response code number and it was released in prod the next day.

3 Likes

natebosscher (Nate Bosscher) 7 December 2023 15:12 3

I really enjoy this story: We can’t send email more than 500 miles
https://web.mit.edu/jemorris/humor/500-miles

I can’t think of any good ones from my own experience. I may have just blacked them out in my memory

3 Likes

ajwilson (Aj Wilson) 7 December 2023 16:53 4

Problem:
I noticed when fixing a UI issue on a billing system that the VAT calculation was incorrect and we were over charging customers.

Journey:
2008 I noticed the issue. Spent about an hour looking at a sample of accounts to see if it was enough to not be an edge case. Then once I found that some cases were not impacted, I spent another hour looking across accounts. I then noticed a pattern that all the ones impacted has registered to self serve on the new website on a certain month. There was an issue in the time date stamp that meant the back end calculation in the financial system was applying an old VAT flag just for those users caused by a certain criteria.

I flagged to my team lead. Told to ignore it as no customers had complained. Went above their heads to compliance and legal, showed the evidence and how much we had over taken, and it was still happening. Legal drove the escalation to make it a P2 so we could fix straight away. Lead was suspended. If we had not found this it would have cost millions in fines, let alone reputation tarnished.

Debugging/pinpointing the bug:
The whole team spent a couple of hours trying to figure fix it and couldn’t.
We were not looking at the correct logs in production (this was 2008).
Once we were aligned several minutes later I reproduced the issue. A senior was then able to fix it.

1 Like

conrad.connected (Conrad Braam) 8 December 2023 09:48 5

I’m trying to remember details now, but this one did involve a lot of heat. But no fire. I worked at a company that acquired a company that added VOIP calls over internet networks to their wireless custom and DECT solution. And so we integrated their app into our device. The device is basically a telephone exchange that can connect to anything that can carry audio and mix the sound with the kind of sound quality a mixing desk at a concert might give, unlike MS Teams that only lets 2 people talk or sing at once, and badly.

Anyway, we all thought that the cause of the box encountering a memory fault was in the new app code the new acquired company had added. We eventually tracked it down to the SSL encryption library. A library that runs on millions of devices and spent about 6 months, not finding the random crash cause. Hardware were convinced it was heat related. We bought thermal loggers in addition to the other debugging kit we already had, it never helped. Finally we set up a 2 week soak test, and one Sunday I went into office to discover that the aircon turns off on weekends… On Monday none of our tests had crashed, and we quickly decided it was not heat nor SSL code related. Within days, one of the guys found a power line part near the CPU was to blame. We had rebuilt the SSL library dozens of times in vain, but it was fun. (Our electricity bill was phenomenal as well, it really was. Some bugs are not cheap, so don’t be shy when buying tools.)

2 Likes

simon_tomes (Simon Tomes) 18 February 2025 16:21 6

ChatGPT copy and paste bug!

ramanan49 (Ramanan Prabakaran) 19 February 2025 01:02 7

Hello @simon_tomes ,

I have spotted a critical bug in production after successful acceptance testing.

The bug occurs when an organization is added to User 1, and then I switch to User 2. I’m unable to add the same organization to User 2 because it shows an error message stating that it has already been added.

Obviously, a regression!

I received appreciation from my team—this is the most critical bug I’ve found in production so far!

Happy Testing!
Ramanan

1 Like

komalgc (Komal chowdhary) 19 February 2025 06:55 8

Everything seemed fine…the homepage worked flawlessly for users, signed-in users n even users from different countries. No issues reported…But then, I was watching a mouse flow recordings.
Some of the users, scattered across the globe, had a different experience…one I couldn’t reproduce. The homepage was breaking. No obvious pattern, no clear reason…

Diving deeper…Struggling alot, after lot of experimenting…I traced their journey. Unlike others, these users didn’t navigate directly…they came through Google Ads. That’s when I found it. A sponsored link was injecting something unexpected, causing the homepage to break.

A bug no one knew was critical…until now, which in turn was a revenue loss coz we were loosing out on ad campaigns … Not the weirdest one, but critical one

1 Like

hananurrehman (Hanan Ur Rehman) 19 February 2025 09:12 9

During a demo they discovered they weren’t able to select time values from a dropdown.
Turns out the plugin used for that dropdown did not select anything if you did it by just tapping the laptop’s touchpad!
This was part of a scheduling feature that I had tested extensively for a week to make sure the schedules work flawlessly.

1 Like

Topic		Replies	Views	Activity
A bug slips past your eye: What did you do? 🙋 Questions learning , exploratory-testing , career-development	5	321	6 June 2024
Too Many Bugs in Production - What Are We Going to Do? with Melissa Fisher 🪐 TestBash collaboration , process , testbash-home	29	1154	9 July 2021
QA as an escalation for production issues 🗄️ Archive	5	2431	5 September 2018
Toughest Bug Hunt 🙋 Questions learning , career-development , bugs	5	88	15 March 2025
Bug Backlogs/Mountains, how do you deal with them? 🗄️ Archive	16	1940	4 August 2021