Epic software bugs - major business losses and heavy user impact

I’ve been reading the article 37 Epic Software Failures that Mandate the Need for Adequate Software Testing, and I assume many of you have read it already :slight_smile: but I think we, as professionals from the MoT community, could create our own list of epic bugs :thinking: :sweat_smile: The examples in the article are quite famous and well-known, but probably many of us have had some significant bugs too. Personally, I’ve missed some major and critical bugs, but I can’t say they’ve cost companies a lot in any way, nothing like millions $$ even though a lot of users were affected :smiling_face:

2 Likes

I was pondering this ever since a recent thread here about the same topic a month back, epic bugs. Mainly because I know some bugs have slipped past me, but I’m part of a team, and if at any point I intentionally do let errors occur. I mainly am covering myself with things in my test plan to describe assumptions we made around dependencies and timelines. For example, “We plan to only test the security flows because the release contains crypto updates, anything else just gets a smoke test” in my test plan which sits just behind the release notes review page.

And this is what is the thing that defines “adequate”, it’s always only and ever only about spending the right amount of time, because time is money. So my test plan uses the time estimates from the last iteration to justify spending the same amount of time again. It’s a bit like hand washing before surgery, you can not wash your hands any faster when you are a neuro-surgeon. That does not mean mistakes don’t happen, it just means certain classes of mistake don’t happen. Environmental factors will always change, and the bigger the company the more pain any change causes. For example, we support 3 OS releases, that means that Android 14,13,12 and possibly 11 will get tested. However I know that our app runs just fine even on Android 8. But any security-minded enterprise that is not mandating Android 10 is brain-dead, so my test plan says. For example: “Android 10 is the lowest version and will be smoke-test only.” We state that clearly in my test plan, that we really don’t test anything below 10, it’s jsut meaning pulling out a really slow device. Of course if someone wants us to test more, we will, but they will have to wait for us to do that every time.

Clear communication of business constraints.

3 Likes

In the 1990s I used to work for one of the UK’s largest intruder alarm system manufacturers. We sold our products to distributors, who sold them to individual installers, sometimes through other intermediaries. The result was that we had pretty much no idea who our end clients were.

One of our control panel designs was 8 years old and very reliable, but a few months after a software revision we noticed an increase in returns. Investigation showed that the non-volatile memory chips were burnt out due to excessive write cycles. It turned out the new software was writing to the NVM every minute instead of once a day.

Unfortunately for us, the NVM chips were far more robust than their specification stated. They should have failed within a week or so, in which case we would have found the issue during design and testing. But even the worst ones lasted months.

What started as a trickle of returns turned into a deluge over the next year, and we knew that every product we had shipped would fail soon. And we had shipped about 30,000 before we knew there was a problem. But we didn’t know who had bought them or where they had been installed. We notified the distributors, but many had walk-in sales outlets and had no idea who they had sold to.

Every product had to be fixed by going to site and replacing the NVM and the ROM chip - there were no over-the-air updates back then. Since the NVM had contained the system configuration parameters, the alarm system then needed to be reprogrammed and tested. It cost the company a staggering amount, financially and reputationally.

We were left to consider how we could have found that bug or bugs of that type, and concluded that there was no realistic scenario in which we could have done so. And then lightning struck again, but that’s another story.

2 Likes

Having barely recovered from the NVM chip failure debacle I described earlier, lightning struck us again within a year. At the time, the new failure was kept very quiet for reasons that will become obvious, but that was nearly 30 years ago so it’s safe to tell the tale now.

We manufactured a highly secure (for the time) intruder alarm communication system known as Red Care - you may have seen the stickers on high value premises such as jewellers. If an intruder alarm was triggered, Red Care ensured that the Police would be notified. And no, you couldn’t just cut the wires like they show in the films.

The design was stable, with just the occasional small software update. But suddenly we were inundated with returns. All the control boards were failing exactly 45 days after being powered-up. I’m a bit hazy on the details, but my recollection is that a counter overflow was occurring, which led to a coil or transformer being over-driven and burning out.

Again, every product we had shipped was guaranteed to fail very soon and it would take weeks or months to replace them. Burglars would have had a field day if this had become public knowledge.

Again, we concluded that there was no realistic scenario in which we could have found that bug. We used to turn off the equipment at night, so the counter would reset. But even if we left it on, our test window was only about a week so we still wouldn’t have seen the issue.