Outages in the wild, Cloudflare, Amazon and beyond

With Cloudflare having an outage yesterday, they published a post-mortem.

2025 feels like it’s been a year of outages.

What have you learned about them?

Have you read any useful articles or blog posts on the topic?

I’ve started creating a collection of outages in the wild to help us learn, I’d love to add more to it.

6 Likes

Thanks for the link to the postmortem, @rosie! Looks like a test of the permissions tool that was implemented and which caused the whole trouble would have been a good idea! :slightly_smiling_face:

We’re not learning from recent outages. The internet now depends on just a handful of providers (AWS, Cloudflare, etc.), yet many companies overlook the associated risk.
Investing in multi‑cloud strategies—deploying workloads across AWS, Google Cloud, and Azure—reduces reliance on any single vendor, lowers exposure to service disruptions, and strengthens our negotiating position in contracts. By diversifying our infrastructure, we protect our services and improve resilience. Multi-vendor should be the way to go.

1 Like

If an incident doesn’t create new probes, we paid tuition and learnt nothing. @martin.hynie

1 Like

I think we’ll be seeing more of these, especially with companies that decide to let AI write their code.

I found that Genichi Taguchi’s book “Introduction to Quality Engineering” helped me understand an earlier outage, and I think it is relevant to other outages too: Learning from CrowdStrike with Taguchi – TestAndAnalysis

1 Like