Testing / QA in Production

I saw this on article on QA in Production and it made me wonder if there were any other resources out there on the topic.

There are a couple of TestBash talks here:

Some other article:

Anyone else have useful resources or stories on testing in production to add to this?


The NetFlix Chaos Monkeys has always intrigued me. Imagine having bots kill elements in your production environment - on purpose to test the resilience… “in the wild”. Releated story:

Usually testing in production is labelled “shift right

on https://www.linkedin.com/pulse/8-reasons-shift-testing-right-lanette-creamer/ Lanette writes aobut Uptime Testing, Release Dress Rehearsal, End to End Workflow Testing, Collecting, Translating, and Isolating Customer Issues, Actionable Analytics, Security Bug Bounty Program, More Automation, Customer Visits


Katrina Clokie’s book “A Practical Guide to Testing in DevOps” contains great insights about this subject: see https://leanpub.com/testingindevops


…And “Testing in Production - Quality Software Faster” by Michael Bryzek: https://www.infoq.com/presentations/testing-software-production


A big +1 for Chaos Monkey. I discovered a bug in the load balancers at Racksapce (Our hosting provider at the time) using it and we ended up switching to another provider because it was so serious.

1 Like

You already linked to a blog post by Charity Majors, but she gave a great presentation at Strangeloop this past year:

She’s a pretty entertaining speaker, and has an interesting focus on observability over metrics and dashboards. She also stresses that she’s not saying to replace testing in lower environments, just that at large scale, you can’t test for the situations you’re likely to encounter in prod.


This bliki entry in Synthetic Monitoring from Martin Fowler’s blog is also great :smiley:


I remember when I worked for a credit card company, and the sheer joy of “testing in production”, because it meant going around shops with several of our cards, looking at their payment terminal, and going “what’s the cheapest thing in the shop … I WILL BUY IT”.

For me, a huge part of it has always been monitoring - looking at what our users are doing, and understanding the scope of what they do. We use Piwik to help with this. We’ll perform biscuit factory testing (random sampling) to make sure things are as we expect, as well as build up general models of what people do.

But also we’d develop ways and rules about doing our own thing in production. As you can imagine, when spending with credit cards, I’d come back with a lot of receipts which needed to be filed, and we also had a way of montoring “cards for testing” for misuse.

BTW - best damn test I’ve ever done in production. I created a credit card for a 18 and 17 year old. One would be able to purchase alcohol, the other was not. Sadly I was not allowed to keep the purchased alcohol.

1 Like

For a client we have build a “beta testers program” where users could sign up to access new features before they became mainstream available. They also were offered an option to be tracked during their visit on the web application as a way to improve the service and discover flaws.

With only 8% of all users signing up for the program, we had detailed end-to-end access flows through the whole web application giving us repeatable usage data that we could use in our automated test tools (like Selenium) and issues that these users generated (404 and 500 errors) were added to the priority issue tracker, with a full detail about which pages they visited, what button or image they clicked on and which components were affected.

We discovered many small but very impacting issues that we would normally not find in our pre-production tests. We also learned that users don’t use an application the way developers, testers, business owners or project managers think they will. This type of insight in user behaviour on the production web application thought us many things and made the overal application better for the end-user.

In this thread mentions of Netflix Chaos Monkey were made as well. We consider that to be part of our resilience tests to find flaws in our design and dependencies when parts of our application architecture are no longer or only partial available. It’s a necessary step to ensure that we fail safely or that we have mechanisms in place for when a dependency is not available. This has thought us to implement heartbeats and deferred execution solutions in our application architecture.

One important piece of advice: when you want to implement such a “beta program”, especially with the upcoming GDPR and other privacy regulations, make sure you have an explicit approval from the user as you will gather a lot of information, often sensitive PII. And have people re-confirm their interest every 6 months or year as most people forgot they signed up for the program.


We at Springernature test a lot in production, We tried to write something up. Its pretty high level . Let us know if its useful :slight_smile:

1 Like