Tester Talk Tuesday : Debugging Dilemmas!

What’s the trickiest bug you’ve ever encountered, and how did you conquer it?

2 Likes

Most of these seem to have faded from memory.
One I had to figure out wasnt a defect in the product code, but was a configuration issue in production.

Our architecture used Kubernetes to run worker pods. this was a global application with hundreds of B2B customers who would have multiple projects each causing dozens to hundreds of these worker pods to launch, work, and then quit. So we could have thousands of these worker pods running in various global regions. Some of those pods could be working for days or more. our system would send commands to those pods via Redis Cache. our system would queue a command in Redis for the system to find and send to the correct pod.

We were having an issue where frequently the system was indicating “hey start a pod task” but the execution was delayed or so delayed it may as well not happen at all.

I couldnt find any issues in datadog, the databases, etc. I couldnt reproduce the issue in testing. I just kept going over each and every thing in Azure trying to find something out of place. It took me a couple days but eventually I noticed that Redis usage graphs looked like an old fashioned memory leak. More and more resource usage, never less. and eventually it would get to an upper limit and then stick there. and thats when the downstream issues would start showing. I scaled the service up and things worked properly again. That got us temporary relief. I did some queries against the redis instance and noted there were many very old semaphores. I did some internetting and discovered that a default configuration of Redis in azure doesnt expire and terminate semaphores (except under duress, IIRC) so our system was waiting and retrying adding new semaphores while Redis reluctantly removed old ones.

I updated the Redis config with a flag I had found in my research that would tell redis how to expire and terminate semaphores that were old or unused. Bingo! smooth as butter.

The defect wasnt a code defect. But it was a configuration defect resulting from a number of other mistakes (this is the result of Root Cause Analysis via “Five Why’s”):

  • Architecture Defect: no one investigated Redis enough to understand this behavior ahead of time.
  • Observability Defect: There was no monitoring in place to indicate that there was an issue with the worker pods starting or that Redis was at capacity and stayed there or that our system was retrying semaphores
  • Testing Defect: There was no dedicated performance testing (in my defense this was an organization decision. the performance story had been “throw more Cloud at it, figure out the efficiency afterwards” I identified this as a known risk to our evaluation of Quality)

The result was that we didnt know about the issue until we got support tickets regarding it.

3 Likes

The trickiest bug I ever encountered was a bug in a compiler that caused debuginfo to be corrupted when the executable became large enough.

It had zero impact outside the organization because debuginfo was only compiled into the builds used by the test team for automation - but it rendered our massive collection of test automation (which had been under continuous development for over 10 years) useless, and nobody in the team had any idea why.

I wound up working closely with one of the senior developers to track down the problem, and we experimented with ways of splitting things so that our automation didn’t need to be completely rebuilt. That project was partly done when I was laid off so I don’t know how it ended, nor do I know when/if the organization that created the IDE fixed the problem.

I do know that it was insanely hard to track down. The symptoms were that the test automation would crash out at a random point. None of the system info snapshots indicated anything out of the ordinary. Rerunning the automation would produce a different crash at a different point. If I tried to debug with the automation tooling IDE, I’d get yet another different crash.

It took the developer using the code IDE to inspect the debuginfo to realize it was corrupted, and to work out what had happened. I think the two of us haunted each other’s cubes for weeks before he figured it out.

The debuginfo was necessary for the automation: without it, the tooling wasn’t able to inspect the various components used in the software (this was a desktop application) and was limited to point-and-click type locators (this was more than 10 years ago).

I had to refactor some of the core parts of the automation to handle the changes that were made to wrap the debuginfo into a separate DLL and was most of the way through that when I was laid off. I have no idea how it ended.

1 Like

@msh WOW!!! that’s quite a journey to uncover the bug.
Your perseverance and detailed investigation are commendable. It’s a valuable lesson in the importance of exploring configurations and system behaviors.

Thanks for sharing this insightful experience and the comprehensive RCA.
Kudos to turning challenges into opportunities for improvement! :clap: :clap:

Extra Thanks for the writing this much effort. :smiling_face_with_three_hearts:

1 Like

@katepaulk Thankyou for sharing your incredible journey tackling that compiler bug! The complexity of debugging, collaborating with a senior developer, and the dedication to find a workaround truly showcase your resilience.
It’s unfortunate the project had to pause but for your efforts and insights, undoubtedly left a lasting impact on the team’s problem-solving approach.

Wishing you continued success in the future! :star2:

somewhat common problem for debug packages to be garbaged by the huge memory budget they require on some platforms. Like you I have also built up an aversion to debug binaries after hitting this at an expensive time for me as well.

@ansha_batra , @conrad.connected - thank you both! It was immensely frustrating when it happened, and a huge relief when we found the cause and a viable workaround.

Thankfully, I’ve been working with web applications and the occasional API since then, which involves a lot less hassle.

Conrad, I sympathize. Compiler issues aren’t common, but when they happen they’re a real pain to deal with. And I agree, if I’m working with a standalone application in the future, I definitely want to avoid compiled in debug info and massive binaries.

1 Like

Also, unfortunately for mobile apps we have to enable the “debuggable” attribute to allow automation, so anyone who tests apple apps has to have a chat with their developers about all the build artefacts. I actually nowadays revel in the debug nightmares, because they prove that the problem was hard and that each of us are skilled enough to crack that nut together if we just step back far enough from the hot pain at the time.

1 Like