What actions do you and your team take when you find a critical bug in production?

simon_tomes · 10 October 2024 11:48

This article from Mohammad Faisal Khatri has a lot of good stuff in it: Actions to be taken by a QA on finding a Critical Issue in Production

It sparked various questions about the process behind what to do with critical production bugs. I’m curious to know:

What steps do you take when you discover a critical bug in production?
Who owns what and who does what?
Who is kept in the loop?
What announcements are made?
What official documentation supports the process?
What’s your experience contributing to disaster recovery processes and documents?
What did you learn from the critical bug in production? What did you do differently for the next one? (if indeed there was “a next one”)

fmachedon · 11 October 2024 11:02

I’m a Testing Expert Lead at a software development company with around 200 employees. In our company, we make a clear distinction between bugs and incidents. A bug becomes an incident when it occurs in production and has a significant impact on revenue or user experience.

We have on-call developers and incident managers who are alerted through tools triggered by our alerting and logging systems. These first responders assess the situation, communicate with the appropriate stakeholders—such as developers from various teams (infrastructure, web, Android, iOS)—and, if necessary, form a dedicated task force to address the incident. Communication primarily takes place over Slack, with video calls when essential.

After resolving the incident, the incident manager arranges a postmortem meeting to identify ways to prevent similar incidents in the future. We emphasize a preventive approach to ensure that on-call engineers have a smooth experience, and we all take turns in the on-call rotation. Fortunately, we can disable certain functionalities in production, which usually minimizes the duration of active incidents.

For onboarding new engineers, we have a detailed internal document explaining the protocol for handling incidents during an on-call shift. It includes whom to contact, how to reach them, and emphasizes the importance of assessing alerts or complaints before determining if they truly constitute an incident with significant user or revenue impact.

From my four years as an Incident Manager, I’ve learned that not every production bug qualifies as an incident.

Topic		Replies	Views
What Steps Do You Take When You Find a Bug in Production? 🗄️ Archive strategy	2	406	15 September 2020
What happens when bugs are caught in production? 🗄️ Archive process	3	411	11 September 2019
QA as an escalation for production issues 🗄️ Archive	5	2431	5 September 2018
Bug vs Defect vs Error vs Failure - is there are difference or are they all the same? 🗄️ Archive	6	944	25 August 2022
Bug Reporting: ART or Battle? 🙋 Questions tools , learning , process , career-development	24	78	7 April 2025

What actions do you and your team take when you find a critical bug in production?

Related topics