Power Hour - Resilience Testing

I (Geoffrey van der Tas) together with Mark Abrahams started at Ordina on the same day and worked together ever since. Where Mark focused on the technical side, I tried to focus more on the process part so together we have a broad view over QA. We’re here to answer all your questions on any of the following topics:

  • Resilience Testing

  • Monolith VS Microservice resiliency

  • Chaos Monkey & Gremlin testtools for the brave

Get all your questions in by 26th June before 7pm GMT and we’ll do our best to answer them during our Power Hour!

BTW. If you want the full experience, come to TestBash Munich on Thursday, 12th September 2019 and join our workshop!

4 Likes

How often should you perform resilience testing?

2 Likes

What caveats would you say there are for people who are trying to build resilience testing in to their team/product activities? Where should these people start?

1 Like

Are you aware of any stories of Chaos Testing going very wrong?

1 Like

I work for a software consulting company. Herre’s a couple of questions I had:

What would be the best place to start for resilience testing (especially for possible short term engagements)?

How do you pitch resilience testing to potential clients (or even your own management team)?

You mention process…can you describe how you integrated resilience testing (and the various tools) into your current process, and what headaches there were there?

1 Like

What is your methodology for resilience testing?

1 Like

Can you elaborate on where chaos testing stands with regards to resiliency testing?

1 Like

What test scenarios you’d consider for resilience testing? Lets say for example: for JVM microservice app deployed to kubernetes? What metrics you’d look at?

1 Like

Depending on how much risk and the context of the situation. They are a must if you are adjusting your architecture, resiliency patterns or API communication. But we would recommend running them for every release(so automate them) because unintentional change might have occurred.

1 Like

I am aware of situations teams have not tested something, like adjusting a resilience pattern called the retry pattern. Which caused a retry pattern to cause a DDOS on their own environment, because of a small network latency. This caused the entire back-end landscape to go down. Resulting in a lot of teams being called and fixing the situation that was caused by a minor change (which was not so minor).

Netflix is the company most ahead with it and even do it in production. In the beginning it caused some issues and pain for them. But by cleaver engineering they now are at a level that it doesn’t cause them any issues.

I remember that at one of my customers we tried something and discovered the VM layer that we used was linked to production systems, so by introducing resilience testing, we had a performance issue on production for some other application.

So there are a lot of area’s and situations where you can really break stuff.

1 Like

Hi Heather,

I would say start with our workshop and get inspired. After that, it is good to start with a brainstorm session with the team to create resilience test scenarios that are applying to your application.
These scenarios can then first be explored and executed manually to get acquainted with this form of testing. Then once you get the hang of it, you should see what you can automate and build into your pipeline or maybe run constantly. As for the caveats, there are a lot of examples of companies doing their resilience tests on production. I believe this can be a good practice but I would not recommend starting on production. Start in an earlier stage to gather knowledge on how your application behaves under these circumstances. Only after you feel confident enough you can try to execute some scenario’s on production. But make sure you have everyone informed and people are ready to jump in, in case something is found.

1 Like

Chaos testing is not a term we prefer to use. Chaos Engineering or Chaos Test Engineering is a term netflix uses. Because your testing Resiliency I prefer to call it Resilience Testing.

But by creating chaos you can test the resiliency of your application. So in a form it is connected.

You can call it Chaos Testing. I just think Resilience is a bit broader term. As long as your test are structured and well tough off, you can call them both ways.

This is a complex question. Context is really part of how you would approach something like this.

In a way we use elements of Exploratory testing. You are always trying to findout new things about your software. Learn and improve it. Make sure no nasty suprises end up in production.

Besides that every Resiliency pattern or improvement on your infrastructure comes from a (new) requirement. It could be as simple as adding some examples to your requirements in that way clearifing the requirement and adding a test. So taking a BDD approach.

This is the way we approach it, not a specific methodology, because there could be more… Is this answering your question of how we approach it?

Hi Jarek,

In a microservice architecture, it is important that one microservice failing is not affecting the rest of your application. Assuming you work with replica’s in your Kubernetes cluster examples of scenarios that would be important are:

  • “In case one pod fails, will new traffic be routed to the healthy pods (and what happens to currently processed requests?)”

  • “In case one pod is very slow will traffic be routed to other pods until speed becomes decent again, or will it slow down the whole chain?”

  • “In case one pod is responding very fast but with useless data (for example responding very fast that it could not access the database) will traffic be routed to the slower pod that is responding with useful data?”

  • etc.

I think many scenarios can be thought of depending on what the purpose of your application is.
From what issues should it recover and how? I think that would be the main question.

1 Like

In the workshop we start simple by introducing some infrastructure problems into an environment where our application is running to see how it is reacting. I would suggest starting with something as small & simple as that, combined with a performance or stress test.

Look at the examples on Richards his question. I guess its always a question of how big the risk is. Sometimes by adding some resilience tests you can save on downtime/customer loss or even if your data is corrupt a bankruptcy. If the risk is high, test it!

I worked with a team that implemented Chaos Monkey on acceptance. The trouble we had is that it was a team implementing this, instead of a team. Your team should want to do it, then you get the result you want. Making teams aware is the main goal with clear requirements and offer a simple way to start (Like you can learn at our workshop) and I think teams will start doing it if it is applicable. Even more so if you can make it simple.

The team changed from running resilience test and forcing them on teams. To helping teams become resilient and making them aware. This made them way more successful.

1 Like

Thank you for all your questions! If you have more questions keep posting them! We will keep answering them. But if you really want to get all the answers come to our workshop :slight_smile: It is great fun!

Cheers Geoffrey and Mark!

2 Likes