What does Reliability Testing mean to you?

Hi! I’m interested to hear your thoughts on reliability testing, what it means to you? Can you describe how you would do reliability testing in your context?

The reason for me asking, is that where I work, we do a specific test called the “reliability test” which means that we run and use the building management system we’re developing for 21 days in a customer like setup and monitor any unexpected failures. It’s important that the system can run for a very long time without crashing, and to simulate the usage of multiple years in 21 days, we say that we “age” the system by logging on more frequently, doing stuff more often in it, than a normal user would do in a normal day.

We are debating if this approach is reliability testing, or more of a stress test. On one hand, stressing the system really hard for short period of time would make us find issues related to that. Fixing those would make the system more reliable under stress, but when what about the long term, not so stressful situation? This boils down to: Are we doing the right thing?

It would be very interesting to hear from others, what values you put in to the meaning of reliability, to gain some inspiration for upcoming discussions :slight_smile:

2 Likes

To me “reliability” means something like “Is it adequate, even in potential failure scenarios, at all times?”

So does the product work well over a long time under reasonable conditions, does it handle errors graciously, will it recover from failure properly, if it fails does it do so with minimum damage and interruption (to data, uptime, hardware, human life, whatever we think is important).

I don’t know if you’re doing the right thing or not. Simulating the usage of multiple years is not the same as running it for multiple years - for example the system date probably never ticks over to the next year and leap years are not accounted for. But the important thing is… does it matter? Take every way that you can think of where “run it for 21 days” differs from “run it for years” (because that is the minimum coverage you are missing from this simulation) and decide whether it matters or not. If it matters then you have found an untested risk.

Remember what your mission is when you’re testing. Here it sounds like you’re testing specifically for use over very long time periods - so think about what that means. Maybe the hardware dies, so what happens when the hard disk fizzles or the memory chip burns out? Data corrupts on disk over time, so what about data corruption? What about if it needs to get taken down for maintenance? What if some third party stops supporting your technology (like Chrome did with flash, or WATiR did with Firefox, for example)? What if you want to upgrade the hardware or software, how would you do that? Things that happen rarely ALWAYS happen over a long enough time period - a lightning power surge, a power cut, a flood, a hacker gets in, a disgruntled employee trashes a server, someone spills lemonade in the cooling vent, you lose a database. Think about what could happen, then if it matters to you and your clients think about whether it’s worth investigating.

And over all don’t forget to think about cost. Maybe the system falling over in certain ways doesn’t matter to you - you could call a hardware failure “not our problem” in certain contexts - not that I’d advise fully committing to that.

1 Like

I totally agree to what @kinofrost says there. Reliability testing is a combination of Robustness, Fault Tolerance, Error Handling and Asserting the Software users in in-advert situations. In addition to what is mentioned by @kinofrost , think about external factors that affect a typical software transaction. Say, there are payments being done online and due to a page refresh or internet connection blip the transaction fails with an unknown error. How does the system behave in scenarios like these?

Remembering major product launches and their failure, like Windows 98 and the infamous BSOD. I am sure no one wants to see that https://www.youtube.com/watch?v=IW7Rqwwth84

A reliable system can handle the weirdest of inputs (both internal and external) and still deliver consistent outcomes.

The philosophy / scenario (correct me if I am wrong here) is invalid "It’s important that the system can run for a very long time without crashing, and to simulate the usage of multiple years in 21 days, we say that we “age” the system by logging on more frequently, doing stuff more often in it, than a normal user would do in a normal day." to me because, in a span of months all the underlying libraries change, the hardware undergoes changes, the user is not the same, so even if you automate the application and execute multiple runs of the same test scenarios day in and day out, it aint gonna fine anything new.

I’d suggest do an actual beta of the application to assess its reliability for more accurate results.