Toughest Bug Hunt

As I recall one of the WEIRDEST bugs I have ever dealt with to a team, I wanted to pose the question to the MoT community:

  • What’s the most elusive bug you’ve ever tracked down?
  • How did you find it, and what did you learn from the experience?
4 Likes

Working for an A/V company that made surveillance equipment for law enforcement, tracked down a bug that caused off sync due to processing software when the back camera was triggered and ONLY if the radio was being used!

4 Likes

I’ll have a longer write up on my blog later this month, but the short version is:

We had a pipeline running test suite few times a day. It started to fail intermittently. I realized that when it fails, then it is always the first run of the day that fails. Eventually we discovered that some of the machines had drifting clock and they showed a time that is about 15 minutes earlier. The first run of the day was around midnight UTC, and if pipeline managed to run one of few tests prone to this specific issue within these 15 minutes, then test would fail.

So basically you had to run these tests on the same hardware as Jenkins was running them, and there was about 15 minutes each day when issue could be reproduced. As you can imagine, learning all that was pretty slow process - I think it took us about a week.

Now, you might wonder why these machines had drifting clock. They had NTP client set up, but data center admins filtered out all NTP traffic to outside servers - you had to use NTP server internal to data center, and we didn’t know about it. NTP client was also super unhelpful, because it appeared to work - none of the commands we have run told us that it could not connect to specified server :person_facepalming:

2 Likes

Wow!!! That would have driven me crazy :laughing:

We had to merge the similar-but-diverged codebases for two slightly different centrifuges at the start of the project. You’ve heard of pair programming? This was quadruple-programming—four of us sitting in a conference room with one driving and projecting and the rest of us helping figure out the merge and catching mistakes.

But we missed one.

Months later we were running into a subtle bug in the rotor identification code (let’s just say the instrument had sensors that allowed it to detect which kind of rotor was installed once it was spinning fast enough). I spent days trying to track it down and I finally printed off like six pages of code, laid them out on a tabletop, climbed up on the table and walked through then line by line, crossing off lines as I checked them.

I finally found an incorrect level of nesting that we had introduced in the original merge.

2 Likes

Oh… My … Word! Nesting will be our defense against AI I tell you. That is crazy!