Chaos Engineering (Simon Aronsson)

On the 12th November Simon Aronsson (https://twitter.com/0x12b) presented a talk on Chaos Engineering at the MoT Cork and Edinburgh Meetup.

These are the questions we didn’t get to answer at the Meetup

  1. Is there a checklist / toolkit that can be applied as a POc. ? Getting Netflix chaos monkey integrated to existing product could be a hard sell
  2. Can/ Does the chaos you inject vary on tech stack ?
  3. What’s different between Netflix monkey/gorilla and this toolkit ? Are there different toolkits for different appetites tech ?
  4. How do you balance your tests between giving yourself an easy time (a box-ticking exercise) and being overly zealous?
  5. Is CE a new name for OAT?
  6. What level of access do you need to / configure / execute these tools
3 Likes
  1. Is there a checklist / toolkit that can be applied as a POc. ? Getting Netflix chaos monkey integrated to existing product could be a hard sell
  1. What’s different between Netflix monkey/gorilla and this toolkit ? Are there different toolkits for different appetites tech ?

I think that depends a lot on what your setup is and what you’re looking to build confidence about. To me, it’s less about the tools and more about the approach you take to building confidence about the reliability of your systems. I know we discussed this at the meetup, but doing manual chaos engineering, or disaster recovery simulations, is a fully viable option as well.

With that said, however: Chaos Monkey I would probably avoid initially, as it’s quite an undertaking to get up and running. Mainly because it also requires Spinnaker.

are all viable alternatives with a not as steep onboarding threshold. Out of the three, chaos-toolkit is probably the most viable framework for teams that don’t use Kubernetes.

If you’d rather get started doing labs on a tool with no setup to get the hang of the idea, gremlin provides a great getting started experience. However, they lack an OSS offering.

Yes! In the demo I showed how to test the degradation of web servers when one of the replicas crashed unexpectedly. This is a common example as it’s easy to grasp, and the consequences are more obvious than for other scenarios.

The key here is that the chaos you inject should help you build confidence in the reliability of some part of your systems.

If you rely heavily on messaging queues for instance: what would happen if that queue suddenly got a spike in queue depth? Or if you get a network partition separating your MQ nodes.

If you rely heavily on relational databases. What would happen if a deadlock occurred in one of your database tables? Would it self-heal? Would the consumers just wait indefinitely, or until the server itself chooses a deadlock victim and kills the query?

This is one of the reasons I stress why planning is key. What is the components that make up your system? How can they break? Working with operations or devops, analysing past incident reports or post mortems, could be a good source of inspiration for this.

I would not necessarily recommend looking at it from that angle. It’s a practice that takes quite some time to execute and involves a lot of resources.

Because of this, I would opt for quality over quantity. Doing a few experiments, but where each actually adds to your confidence about your systems is to me far more valuable than executing hundreds of experiments just for the sake of doing it.

To some degree, the practices overlap - especially while learning. However, the end-goal of chaos engineering is to experiment on real production resources that your users would use, rather than as a pre-release assessment.

We want to use chaos engineering to build confidence around the actual systems, rather than a staging or pre-production environment. Although starting out in and combining it with tests in a staging environment definitely makes sense as well. Many teams consider a successful execution of the experiment in staging a pre-requisite to attempting it in production.

The boring, but realistic, answer is that it depends. :sweat_smile:

Is your intention to kill pods in a kubernetes cluster? Then you need enough access to be able to do that. Introduce latency in the network? Then you might need to be able to deploy toxiproxy and redirect pods to use it. To provide some kind of generic answer; you most likely will need to have quite high permissions.