Has anyone used AWS Fault Injection Simulator?

Curious if anyone has experience with this tool https://aws.amazon.com/fis/ for triggering faults in AWS. In particular I’m interested in:

  • General reviews of its usefulness pros/cons
  • Have you tried writing custom policies with AWS Systems Manager in conjuction with FIS?
  • How observable, controllable and maintainable has FIS been?

The context I’m interested in using it for is to try and encourage chaos engineering, so to effectively automate and ensure OAT scenarios occur on a regular basis to encourage engineers to build resilience in. But that requires a high degree of observability (to measure and identify what chaos were causing) and control (to allow for responsible use and a degree of trust that we are not careless damaging or slowing ourselves down).
The reason I’m curious about Systems Manager is I’m interested in scenarios such as blocking specific URLs or IP ranges across an AWS account, rather than having to modify inbound or outbound from specific hosts or services - to simulate the loss of potential dependencies.

2 Likes

Not with aws fis but it’s the same principle as Chaos Engineering: for example Chaos Monkey, ChaosHub, ChaosMesh, Gremlin, … allow you to “kill” or put pressure at random on your environments (for example pods, microservices etc) you don’t really need to use aws fis, since it’s more limited I believe. But if you use AWS, why not…

The biggest myth here is that Chaos Engineering HAS to be done in production.
You can totally do this in a test environment or a pre-prod environment.

The high degree of observability can be countered with good alerting and dashboarding. Something we sucked at but slowly build up when using ChaosMonkey.

This is the shift-right that people are talking about, you’ll get way better logging and monitoring from using chaos engineering. So it’s not just destroying stuff and see if it automatically starts up again but there are a lot more benefits.

It’s going to take a while to get used too. You’ll whitelist things it can kill but in the end, you’ll always have to deepdive into something.

But yea I hvn’t personally used AWS-FIS but the concept is the same and it’s a really nice learning experience. We can go down and I’m pretty confident our stuff will automatically go up again within X time. It’s like an extra Quality-aspect that has been unlocked in your SDLC.

1 Like

So we can’t easily use Chaos Monkey in this context, but even if I could, I’d still like to review FIS as another tool in the toolbox.
My intention is to start away from Production to begin with but the goal personally is Production because in the end its what we truly care about. However, it is simply a goal, I can accept simply making some progress in some general good practice in this area even if nots Prod.

But my concern around control and observability is relevant to any environment other than Prod too - we don’t want (ironically) chaos, lots of failures all the time, breaking our path to live or environments we depend upon for other development/testing. To me, the “chaos” is the real world, and really the engineering is us changing the theoretical real world chaos and making it happen regularly but with design, control and observability. We don’t want to target everything, because not everything has the same resilience requirements for example.

One of the cultural challenges I want to influence is the scariness of Prod and the complete faith in change control. However, that is not going to be an overnight change, but it would be a big deal to convince people that we are confident enough to run experiements in Production. However, that is aspirational and there are plenty of short term benefits to trying to make use tooling like this.
Introducing some regular, automated analysis of resilience alone is a huge win, even if its not production because currently I believe much of that work is very prone to regression and inconsistent testing.

1 Like