Performance Stress Testing - Exit Criteria and what % Contingency is allowed between Peak Load and Stress numbers

Hi all,
I currently looking to tighten up our Exit Criteria on Load and Stressing testing. Currently our vendors run Load testing to the Peak Load defined in the system requirements. Exit Criteria is clear on this element. But on our Stress Testing numbers we are only seeing some systems having a contingency of 20-30% of Peak Load which is a concern if there is ever a surge.

e.g. Peak Load would be 50 concurrent users. Stress Test number are showing errors (system does recover gracefully) at 70 users.

  1. What is the norm when it comes to Stress Test Numbers in terms of that contingency % of Peak Load?
  2. What do you use as the Exit criteria of a Stress Test?

I imagine this is different to all systems. As a company we have protective measures in front of public facing systems such as Queue-It from Cloudflare.


Hi Dave,

What you have here is the difference between stress and capacity tests:
Stress Tests tend to have an open ended load profile until the system breaks or the response times reach a pre-defined fail time.
Capacity tests tend to look at the maximum limits of the system that meet the requirements.

This looks more like a capacity test definition issue, a question you should ask is ‘what are the requirements we have to meet from a capacity point of view.?’

You could use a stress test where you find the limits of the system may well define the requirements of the capacity test after analysis has been done if you don’t have a defined set of exit criteria / requirements,

Without knowing the system set up its hard to say what these would be, ie. an auto scaling cloud based solution would be different from a fixed resource data centre solution.

I’ve been in places that the system needed to take 3 times peak load but recover as long as the response times as still acceptable.

As with everything, until you get the requirements sorted out, you’ll never know what to test.

Hope this helps

1 Like

Hi David,
Thanks for the detailed response. Currently majority of our systems are not auto scaling cloud or container based (would love if they were!).

As you say, it might be best that at the requirements stage, we define what the peak load is (which is already happening) but adding a further definition that the system is capable of a possible stress to X times that peak load and recovering gracefully at that point.

The object of the stress test obviously is to find the breaking point of a system and ensuring it does recover. Not having guidelines or exit criteria in place when the system is only capable of a stress not much higher than the peak load is something our vendors are looking to exploit when it comes to requirements signoff.


1 Like

lol, this again comes back down to what your expected traffic profile is and what is the risk you are way off on the numbers and what kind of system you have. If it is a closed system where you can manage the user base, then this is better than if it is an open system, where the traffic is uncontrolled.

Also, a spike test might be relevant if you have a predicted large number of requests in a short space of time to consider, this would probably be the peak load or definitely push the system beyond the peak, again all depends on the sut. A black friday deal website for example would need this tested to death nearer the black friday event especially if there are good bargains to be had, and the marketing campaign brings people from far and wide.

Another worry may be brought to light when the data increases to say 3-5 years worth of data, if the response times slow down then, would this bring the contingency levels down below the peak load level.

Then again if the peak load will never be breached and the system performs ok up to 30% beyond the peak level for a defined period of time, then all might be good and the system has been designed very well to meet these limits.

It sadly all depends what is acceptable, it sounds like you may be getting close to the vendors limits and they wont want you going higher, it might take you to say we need to be able to sustain 100% on top of peak for x minutes to pass a stress test, and see if they start to sweat/cry :wink:

With all of this, it’s all about managing the risk, and not being sat there on go live day getting a peak load that kills the system and gives you all a headache that could have been avoided.

1 Like

Nowadays servers can be upgraded. It can even be done automatically. A major limitation is money. Here is my personal view.

Suppose I have a cinema web site.
If I upgrade the server for 1 Euro a day and sell 1 more movie ticket for 8 Euro a day, then I have an extra profit of 7 Euro a day. I could upgrade the server until I don’t sell extra movie tickets.

On the other hand there are web site visitors who only watch movie previews. This costs me money. In this case I might limit the number of movie previews, but this might also impact the selling of the movie tickets.

It is basically a question, how to maximise the profit without losing too much customers.

In general this leads to the following stress load profile:

  • determine the peak load.
  • determine how the peak load is built. What kind of users are using the services?
  • determine how the stress load is built. Are there any differences in the ratio of the peak load and stress test?
  • determine the costs and profits of these services. This should be done with the figures from contracts.
  • determine the optimal configuration of the server.
  • determine the stress load based on the configuration, while taking graceful recovering into account.

NB this approach might not be applicable for start ups which want to attract lots of customers at a high cost.

This is an area where you do yourself a disfavour by trying to put rigid, easy to acquire metrics. In its core this is all about what are the business willing to accept. Do you want to pay an additional $1000 per hour to avoid total service outage for 1 hour once per month? There is no ultimate right answer to that question. It is a conversation that the people responsible for the business have to answer. So now the job of the tester is not to have a set bar to see if you pass or fail, but to supply the best information possible to the business so they can take better decisions.

As already covered exploring different load profiles, measuring the level of the outage and the recovery time is typically good information to produce. And it is even better if they are well understood by stakeholders, developers and testers. The other thing we did, more as a service to developers where to measure the delta. As in before this change we could serve 50 users without any congestion in the system, now we can only serve 49. The reason for that might be inline with what the business already know and have accepted. Or it might be an unintentional mistake made on the way. Again information.

Hope it helps.

1 Like

Sorry if this is stuff you’ve already thought of, but my loophole-detector started twitching. To echo what others have said, I think it’s worth grounding this in clear business objectives, which could then guide the details.

Is the point of the testing to cope with certain times of the day / month / year that are known to be busy? Or is it trying to build in a more general safety margin, to protect the user experience day-to-day? Or something else?

The reason why I mention this is that it helps to define some details you’re probably interested in. A friend of mine used to work on IT systems for schools. They had big spikes very regularly - e.g. submitting the attendance register at the start of the school day, another spike at the end of the school day etc. The spikes were mostly users doing particular operations, which taxed the system in a particular way. So, a test that checked that N users could access the system simultaneously, split equally over e.g. 4 operations wouldn’t be enough in their case.

Another detail relates to the duration of the user sessions. Is the test for 50 users arriving per minute, or 50 users in the system? If user sessions take e.g. 2 minutes, then 50 users arriving per minute means 100 users in the system. This also has more of the What operations are the users doing? question - how many people are logging in / out per minute is affected by arrival rate and session duration.

1 Like