30 Days of DevOps Day 8: Production Alerts

On Day 8 of 30 Days of DevOps we’re asked to:

Find out how your team/company are alerted for production problems, and how they respond. Do they have a runbook or have “game days” to practice responding to prod failures?

What did you learn about your alert process?

The front line for dealing with prod outages at our company are in customer support, and they are the most knowledgeable in the whole company about logging, monitoring and observability. And, the people on the R&D side join in to help, it’s collaborative. They don’t really use runbooks, they rely on their own exploratory and problem-solving skills. They don’t have “game days” to practice. The top priority for R&D (with help from the ops people in support) is to improve logging, monitoring and start having observability so they can quickly diagnose customer problems. I’m excited to see proofs of concepts and different initiatives using new industry standards like OpenTelemetry and OpenTracing.

From Twitter we have