This article from Mohammad Faisal Khatri has a lot of good stuff in it: Actions to be taken by a QA on finding a Critical Issue in Production
It sparked various questions about the process behind what to do with critical production bugs. I’m curious to know:
- What steps do you take when you discover a critical bug in production?
- Who owns what and who does what?
- Who is kept in the loop?
- What announcements are made?
- What official documentation supports the process?
- What’s your experience contributing to disaster recovery processes and documents?
- What did you learn from the critical bug in production? What did you do differently for the next one? (if indeed there was “a next one”)
3 Likes
I’m a Testing Expert Lead at a software development company with around 200 employees. In our company, we make a clear distinction between bugs and incidents. A bug becomes an incident when it occurs in production and has a significant impact on revenue or user experience.
We have on-call developers and incident managers who are alerted through tools triggered by our alerting and logging systems. These first responders assess the situation, communicate with the appropriate stakeholders—such as developers from various teams (infrastructure, web, Android, iOS)—and, if necessary, form a dedicated task force to address the incident. Communication primarily takes place over Slack, with video calls when essential.
After resolving the incident, the incident manager arranges a postmortem meeting to identify ways to prevent similar incidents in the future. We emphasize a preventive approach to ensure that on-call engineers have a smooth experience, and we all take turns in the on-call rotation. Fortunately, we can disable certain functionalities in production, which usually minimizes the duration of active incidents.
For onboarding new engineers, we have a detailed internal document explaining the protocol for handling incidents during an on-call shift. It includes whom to contact, how to reach them, and emphasizes the importance of assessing alerts or complaints before determining if they truly constitute an incident with significant user or revenue impact.
From my four years as an Incident Manager, I’ve learned that not every production bug qualifies as an incident.
1 Like