Load testing on Production.....is it the right thing to do?

Hi All,

I’m currently working with a team where we need to solve our load testing problems.

Problem

Our current way of doing it was that we would run this against a staging environment, however we have found to have some issues with this, and now have decided to do it on production when customers are not on the platform.

My thoughts

This to me feeling like it’s the best approach, my approach would be to get our staging environment into a production state.

Question
It would be great if anyone has had experience in this, and let me know what their thoughts were on this. What did you do? Were they benefits/drawbacks of testing on Production?

Thanks

2 Likes

No . for the same high level reasons you don’t test security in production.
It’s furthermore a false economy problem and seriously impacts you ability to shift-left if you cannot load test in dev or in staging. Not to mention it will be guaranteed to annoy your customers.

1 Like

What are the issues you have with testing in PreProd?

If you were to test in Prod what would happen if your load testing killed it?

2 Likes

We did this once at my last job. Running any type of automation in a prod environment makes me nervous. We were lucky that we didn’t break anything. Or corrupt company data. Or corrupt customer data. Better to create an environment with the same setup as prod and run there.

3 Likes

When I worked with performance testing the most we divided it into two parts with different goals in mind. The first part was for dimensioning the system where we tested each service individually to see their limits, tested a few integrated services to see the interference between them to create a better understanding on where the current capacity is and where the problems are. As with all performance or optimisations you want to control your variables in if you have a fully integrated system with traffic on it that you do not control it is very hard to be able to learn much from the results.
The other part of the performance testing where using a more integrated system where the variable was the traffic model (what are the users doing). This was aimed more at seeing if changes in different services changed the performance of the system. This product was running on serval different setup across the world which is why the idea of a staging was never a thing and I think that is key here.
My takeaway from that setup is that what you care about is the delta and not the absolutes. I.e. if you have a scaled down, controlled performance environment that can handle 1000 simultaneous users that have a response time within your target range, and your production system can handle 1000000 you can still see if a change in the system had a negative impact on the performance which with a high probability means that production will have it too. I think we only had 2 exceptions to that during the 3 years I worked with it. And now you can also play around with various dimensions that you cannot do in production.

Across my career a staging environment have never been a good substitute for a production environment, and since they are typically driven by a “blue print” copy of production, with neither the same data in the system or the same traffic model it is in fact so different from production that for performance reason it is virtually useless. So I would push for having a smaller single purpose environment just for performance testing.

As for testing performance in production, I would never connect a load generating tool to production. But what we did have when we have a product that was one hosted system was to have performance agents running in the system that could alert if performance changed. That uses the fact that production have the correct data and the live traffic model. And we then had the same agents running towards our performance environment to so we had comparable reports, and could get early notifications when things was happening in production that might need to be adressed before all users felt it. And with the performance testing we also had good knowledge on what probably was happening so the troubleshooting part was faster.

@greetomosquito has pointed out the biggest reason never to do this in prod. It’s unsafe to load test with a test that has write permission in the system for so many reasons that the mind just boggles. So that reason, at the end of the day, your load test in prod can only test 50% of the load scenarios, which is like, not, a great safety blanket at all really. Have to say, but @ola.sundin said it already, performance monitoring is super critical to any prod system, and it’s worth having the knowledge and ability to profile more of your live system at certain times, to find bottlenecks without having to add artificial load.

1 Like

I don’t disagree with anything that’s been said above but I recently had cause to load test a new site which was in public beta.

Genuine user traffic was low at that point but but to avoid any disruption to users, we configured our Web server proxies to point real users to one data centre and jmeter load traffic (coming from a specific internal environment) to our other data centre. This meant that one half of our production stack was being tested at load.

That approach worked well for us at the time but granted it might not suit everyone.

1 Like

I guess your company accept the risk of following this practise. If it’s one of the exercise then understood but if it’s a regular then many risk associated , one i can think immediately that someone running a test accidently when not suppose to run.

Testing what? Multiplayer games, nuclear power monitoring software, airport traffic control management, water testing results management, e-commerce for a tiling company… they all give different answers.

Depends on what your product is, who uses it and when, if you can batch your inputs to run later, what the risks are if it all goes down, if you have proven effective backup recovery when you interfere with customer data, what the risks are to security (e.g. performance can be used as a hacking bottleneck to trigger failsafes, or introducing overprovisioning gives a larger attack surface), how big your team is, how long your performance tests take and if they might need to be fixed and/or re-run, how frequently automatic checks change, how frequently and where the product changes, what schedules you run at, if the product is used internationally, how many servers you have and where they are, how important the results are, if user data is used in other testing and must remain untainted, and so on.

So you have to look at your situation and do a decent risk analysis. General heuristic: the more important your product is the less of a good idea testing in production becomes.

4 Likes