Root cause analysis field on bugs?

Hi all,

What are your feelings about a root cause analysis field on your bugs? We have a bug tracker and someone suggested adding “Root Cause Analysis” as a field to fill out while you are handling the bug (not a required field to submit a bug).

The few times I have done RCA it has been done in bulk, with me defining 5-7 categories of interest and then going through the list of bugs.

I’m concerned about how the field would add of blame to the team dynamics. Even simple options like “missing requirement” add blame to someone or a group of people. Is that where we want to go?

/Jesper

1 Like

For starters I would name the field “blameless root cause analysis”. Your concerns are valid, but it’s still worthwhile documenting this.

Renaming the field is probably not sufficient to achieve a healthy culture though - some teams require something akin to a cultural makeover to to get into this space. It’s worth researching psychological safety if you feel that your team isn’t quite there.

Amazon has done some good stuff on blameless root cause analysis. It’s more focused on ops but still highly relevant: e.g. https://oliver-dev.s3.amazonaws.com/2020/04/23/16/49/40/187/Blameless%20Post%20Mortems%20-%20Code%20&%20Create%20S1E1%20Orbis%20Connect.pdf

3 Likes

I am 100% with @hitchdev .

Hey welcome to the MOT club @hitchdev . Good research there Colm. It pretty much backs up my engineering experience . I was a coder for many years before “Root Cause Analysis” (RCA) even became a useful or actionable thing. I have subsequently worked in so many places where actions coming out of a RCA, you guessed it, never get actioned. If a defect requires an action, people need to feel empowered to take that action immediately. If they have to wait for an RCA to happen, it’s probably too late. I’ve yet to see any meaningful RCA actions that cannot be prioritized in the monthly retro meeting if the team is functioning well.

I would push the entire RCA process, if we really want it, back onto someone who cares about it instead of burdening the team with it by having it take up “brain-space” on every single ticket. For me the RCA is not about blame, but more about the bike-shedding that the RCA creates because it points to people having to get group agreement on who takes the remediating actions.

4 Likes

Thanks @conrad.connected !

I agree that it’s very important to make it non-mandatory and maybe push this task on to a single person. The bureaucratic overhead of creating tickets often inhibits the reporting of bugs and the slow accretion of fields like this can be part of that weight.

3 Likes

Thank you both for confirming my hunches on this! Sure a bit of data mining on the causes of bugs is interesting - but not worth the overhead of doing this to every bug cycle (in a broad sense). Oh and blameless post mortems - very good topic!

2 Likes

I once worked in a team that did root cause analysis on all bugs that were fixed. Over time we learned a lot and substantially reduced the number of bugs that needed to be fixed, so yes do root cause analysis on all bugs!

3 Likes

Are you talking along the lines of brief, pre-listed statements or something more thought out. e.g.

Programmer Error
Ambiguity in specification
Missing requirement
etc

3 Likes

Recording root cause analysis for a bug is an EXCELLENT means of enabling a team to learn to improve quality over time. Doing this will (after a few cycles) help team members discover how their actions may be negatively be impacting the quality of others’ work.

1 Like

I have used Root Cause Analysis and found it very useful. However, I have recently found that DevOps people are not keen on Root Cause Analysis. I think that we need to explore why this is.

2 Likes

I am going to clarify, RCA is very useful in dev-ops. But if you are (and more of us increasingly are) in a situation where the goalposts are moving very fast and the dev team have a load of tech debt, an RCA discussion is merely giving you a thing to shoot at when often there are elephants that people are ignoring. You need a certain environment for an RCA discussion to work - and curiously I have a personal opinion. Which I will share now.
Trust. When you are people learning to do a job right, you will have a team. Everyone in the team will have opinions as to what the team or an individual did wrong. It’s human nature to ignore good advice on a change when it’s coming from someone you don’t trust. If that is not there, then the RCA quickly becomes the right tool that is just wielded wrongly.

Hi all,

Thank you for your good input. I was just listening to this podcast with @bangser.a

She mentions in the end that SRE (Site Reliability Engineers) discuss RCA a lot, but have realized there is rarely only one root cause for incidents. This is true for bugs too - remember the old Swiss cheese model. While it seems easy to add a drop-down list of options, it’s too simple to capture the events leading to the bug being found.

1 Like

Brilliant podcast episode there, simple questions in there the gives you the power to ask and answer good questions. All of which can help you find the multiple causes of a root fault. That’s probably the big reason I am no fan of RCA’s in each bug, because all those will do is generate waste addressing non-root causes and never getting to the process or tech debt that we are really wanting and is often not cheap/easy to uncover. Begin able to repeatedly know if a bug is really worth tracking deeper takes a skill that cannot be gotten for free by just adding a field to a Jira ticket.

1 Like

I was speaking to John Willis recently and he did sound not enthusiastic about Root Cause Analysis. So, maybe it is only some DevOps people who are no longer supportive of RCA

I just spoke to John Willis and asked him why he is critical of Root Cause Analysis, an he replied:
In short, it collapses learning. A deeper discussion involves a discussion about certainty. Dr. Deming says, “Life is variation; variation there will always be.” Diane Vaughan refers to the idea of probable cause. Here’s a podcast I did on this subject.

I think we should see how we can learn from his comments

Oh, I like that idea!

I’ve seen many bugs where the same symptoms had different causes, and others where what looked like the cause actually wasn’t.

Root Cause Analysis has too much potential to turn into a blame game for my liking: it’s one thing to track down problematic areas of the code, but blaming the people working in that code for the bugs is the easy (and bad) way to do RCA. In my experience problematic code gets that way because of shifting needs and requirements mixed with the eternal time crunch. Blaming the poor unfortunate soul who got unlucky when changes were assigned is just cruel.

2 Likes

For those of you worried that RCA may turn into a blame game, I’d highly encourage you to investigate and advocate for using some tools for developing psychological safety on your teams:

We did this on a team I worked on and it made an enormous difference not only to our productivity to but how happy we felt working there.

1 Like

That is a truly awesome link. I’ve been in a toxic workplace in the past, and I know all to well what it can do to people.

I wish this list could be forced down the throats of assorted C-Suite people - but alas I’ve seen a few too many who will tell you it’s all about whatever the current buzzword of political correctness is but when it comes down to it, they don’t change what they do so nothing else changes either.

Thanks so much for sharing.

2 Likes

I sometimes wonder if RCA is often pushed by someone who believes that silver bullets exist. Sticking with any one technique for RCA for example, is itself a studying and doing your homework laziness problem. Never every find yourself having to apologize for learning [Deming paraphrased.] Not to throw the baby out, but stepping back becomes harder if you are wedded to a one-trick-pony.

1 Like

I’m with you Mike. Been trying this out in my team too. Our "Root Cause"s tend to be brief, and more a gist of what the problem was. E.g. unhandled json, a surprise update from a third party. So in truth arent “proper RCA”.

But the gist should be easy to add since the problem has just been fixed. (Hopefully)

Maybe Root Cause is too loaded a term. And part of the blame game

1 Like

I’ve put variations of Orthogonal Defect Classification schemes into bug trackers (HP ALM at a bank and a set of conditional Excel drop downs at a FinTech).

I observed a couple of things.

Having this built into the tracker and mandating it on bug creation gives an initial assessment only. There may be other emergent causes which need to be captured later in the bug fix process. Having an “Initial RCA” as well as a “Final RCA” may add value in helping people

Even with an objective scheme that boils down to some binary choices, individuals or teams can receive blame if the communication of the issue is handled that way. I know that I am far from perfect so how can I expect perfection in others?

If the bug is presented as “the system doesn’t deliver the outcome”. For that to happen, any number of individuals collectively missed the opportunity to identify and fill the gap.

  • The product person creating the user story/problem statement … however captured
  • The engineer(s) who built the solution to solve for that problem statement
    Any number of possible questions as to why something may not have been spotted
  • The QA/tester who wrote the tests
    Static testing of the problem statement before development starts may have found the bug

“Blame” here is not down to an individual but a team. Unfortunately, we like to use the ‘U’ in bug as a shortcut for ‘You’ rather than the initial for ‘Us’.