Live Blog: Testing in Production: Antipattern of Future? Lukasz Raslonek

Welcome Lukasz! Itā€™s his first time in Munich.

His talk is about something thatā€™s still quite controversialā€¦ testing in production is still seen as an antipattern. Heā€™s going to take us through the whys and then the hows. I feel good about where this is going :slight_smile:

Oh Iā€™ve just realised his name has a sign in it I canā€™t find quickly. Sorry Lukasz.

Testing in production still conjures up the image of a cowboy engineer. Someone who only checks if anything works once itā€™s released. Thatā€™s not what Lukasz means. Obviously we have to test before as well. So thatā€™s cleared up.

Letā€™s look at why weā€™d want to or need to test in production. Lukaszā€™s first reason is distributed systems. This is the way of the world now. To put it in my word, thereā€™s stuff everywhere and itā€™s talking to everything. Creating a test system for that is difficult and expensive. And if you get to architecture diagrams for amazon and Netflix, it looks like an image of a virus, or a giant ball of wall, where every point is a microservice that might be on a completely different system. The more complex and distributed your architecture is, the harder it is to create and maintain your test environment. And in that kind of situation, you might not be able to trust your test results any more.

An example: if team A want to push something ā€“ which versions of which teams does it need to be tested against? We have multiple branches, multiple teams ā€“ and it gets more and more complex. You can use contract driven development and testing, but at some point, even that canā€™t help with the complexity. At some point, the only true versions are in the production environment.

When it comes to tests, we might have plenty of scenarios. That feels good. But ā€“ these are only scenarios that we have thought of. Iā€™m gonna use my own caps again for this: WE CANNOT KNOW IN ADVANCE ALL THE SCENARIOS THAT WILL HAPPEN IN PRODUCTION. Unknown unknowns, people!

Lukaszā€™s example for this is the driving theory exam. There are a lot of scenarios we go through. And yet, when we drive, other things happen. Unpredictable things happen (and, as an aside from me, this is a good metaphor, because even experienced drivers end up being surprised).

The final reason for testing in production is real user feedback. We need to know whether what weā€™ve delivered is valuable for our users.

Now we know why we want to, Lukasz is going to tell us howā€¦

The first approach is canary testing. (This sounds very animal friendlyā€¦.!). When we deploy a new version, we keep the stable version in production too and only divert a small amount of traffic to the new version. We minimise the risk of problems being wide-reaching in this way. As an example, Facebook uses New Zealand as a canary testing location. The great dislike button debate was actually implemented and pushed to New Zealand. Once the logs and metrics had been analysed, they decided not to keep the feature.

Another example is the EA example from before ā€“ itā€™s the Sim City story with the banner for a discount. The percentages are different, but the story is the same: it let the development team find out the surprising effect that the discount banner was less successful.

The next aspect is chaos engineering. This lets us test the high availability of the production environment by introducing deliberate errors. Netflix is a proponent of this and has even released tools for it. The Netflix infrastructure is on the cloud. Their tool, chaos monkey, shuts down or reboots random production servers. This is pretty darn crazy, but it does mean that teams are encouraged to make sure their services are resilient.

As if that werenā€™t enough, they also have chaos kong. This tool simulates the unavailability of a whole data centre. This is an extremely rare occurrence, but it did happen once. In December 2012 on Christmas Eve, a data centre shut down and had large consequences. Chaos kong checks whether the traffic is correctly rerouted to the next data centre.

Actions such as this are important to cause the kind of problems that can only happen in the wild.

The third topic is test automation. This is usually something we associate entirely with the pre-production environment. It makes sense to execute a smoke test suite in the production environment to ensure that the core features are still operational. On top of that, synthetic monitoring is a way of creating metrics we need using automated tests. Itā€™s a way of generating data for a synthetic metric. A concrete example of this from his team was to execute webdriver scenarios and REST API assertions and to collect them. They used it to check whether the log in system was always working.

Lukaszā€™s summary is that testing in production does not mean you donā€™t have a test stage. It means that you are also testing in production. It gives us the best way of getting real user feedback. We need to understand that distributed systems mean distributed architecture. Itā€™s too risky not to test. And you can only test the production architecture in the production environment.

Finally, he says that production is always unexpected. We canā€™t predict everything, which is why we need testing and monitoring in production.

2 Likes

A small addition and question on that:
I remember reading somewhere that Amazon(?) had a problem with its recommendation algorithm because automated checks of the production system kept putting one specific book into the shopping cart of test users, making it seem very popular.
Did Lukasz mention anything about testing in production affecting the production system - and how to prevent this from happening?

He didnā€™t mention it, I donā€™t thinkā€¦ At least, I canā€™t remember :slight_smile:

1 Like