Feedback about QA metrics. Which ones would you add/remove?

Hey everyone, I hope you can help me out with some feedback about QA metrics.

I’m new to the QA industry, and my task is to create and promote useful content. Since I’m new to the role and industry, I usually interview an expert, and create a post based on their input. But it’s still hard for me to evaluate if that’s useful for the QA audience or not.

I’m writing now about QA metrics, and would love to get your input / feedback. Here’s what I put together from an interview with a QA leader:

":bar_chart: Hard & Soft Metrics to Measure the Value of QA:

:one: Hard Metrics:
:arrow_right: Defect Backlog: This metric keeps track of the number of unresolved bugs.
Key questions to ask:

  • What is the trend? A consistent increase could indicate potential gaps in the testing processes or areas that need more focus.
  • How can we ensure that the defect backlog remains manageable and doesn’t grow excessively?

:arrow_right: Delivery Cadence: This measures the rate at which software is released.
Key questions to ask:

  • Are the teams hitting their sprint and release targets?
  • If the team is going fast, is it doing so at the cost of quality?
  • If there are delays, is it due to quality issues?

:arrow_right: Hot Fixes: This measures the frequency of urgent fixes post-release
Key questions to ask:

  • If the frequency is high, what scenarios did we overlook or inadequately covered during testing?
  • What is the response time for critical issues?
  • What other strategies can be employed to reduce the number of hot fixes post-release?
  • How can QA leaders ensure that their teams are proactive rather than reactive in addressing issues?

:two: Soft Metrics:
:arrow_right: User Satisfaction: This is at the heart of any product or service. It gauges if the team is delivering features that not only adhere to requirements but also align with user expectations. Tools like surveys, user feedback sessions, and beta testing can provide invaluable insights.

:arrow_right: Team Morale: A motivated and positive QA team is more likely to be productive and thorough. Regular check-ins and surveys can gauge team sentiment and highlight any areas of concern.

:arrow_right: Communication Effectiveness: The efficiency of communication between QA and other teams, such as Development or Product Management, can be a significant indicator of potential bottlenecks or misunderstandings that could impact quality."

Let me know what you think, please.

5 Likes

Hello!

Some random ideas for you, hopefully the theme is useful in some way.

One thing I love about the metrics you provided is that they are inquiry metrics, that is to say metrics that helps us guide our questions rather than those that drive a decision. Allowing ourselves to create and refine models of a system by inquiry allows us to use even poor metrics in a much more useful and less dangerous way.

You can read about inquiry metrics and their collection (for bug reports) here: Bug Metrics Tutorial - Satisfice, Inc.

Many people will model a system based on a surrogate metric, like a defect backlog, in a simplistic way and then draw simplistically incorrect conclusions because they cannot identify a heuristic failure. Or they don’t have the humility to accept one.

I think that if we’re going to use metrics to do something useful we should treat them very carefully and with humility - and I think that requires doing enough learning to know what we don’t know. Statistics and metrology are hard, and huge subjects, and it’s worth taking the time to consider that we’re measuring the wrong thing or in the wrong way or deriving the wrong conclusions from inaccurate or insignificant data.

Recording, collecting and reviewing the information is also costly, and the information should justify the cost.

As an example of questioning and investigation I’ll cover defect backlog, below:

This metric keeps track of the number of unresolved bugs.

Actually this keeps track of the number of unresolved bug reports. The difference is important, because one of the easiest ways to fix a bug is to do so during development with a developer so that there’s no need for a long, expensive reporting process. This means that using the number of unresolved bug reports doesn’t include unreported bugs due to such fixes, or in fact anything that bypassed the bug report system. In fact good communication in teams reduces bug reports because there is a better shared understanding of purpose - better oracles lead to fewer spurious reports.

A consistent increase could indicate potential gaps in the testing processes or areas that need more focus.

Or it could indicate that we are working on a more complex system, or a higher risk system that requires deeper testing or more precision so generates more bug reports, or we have a new client that tends to report more bugs, or we made a change to the teams and the nature of what gets reported has changed, or there is simply more software so more test surface to generate reports, or a manager has become annoyed that there aren’t enough reports being made so testers report any old thing, and so on. Has there been any changes to testability, for example?

While it could indicate gaps it could also indicate better coverage that’s finding more problems, or areas that need less focus because they’re generating a large number of bugs for something we don’t care about.

I think that a consistent increase of bug reports indicates a consistent increase of bug reports. Figuring out why is going to be a very context-sensitive matter.

It’s important to note at this point that bugs are not fungible. They can be huge or tiny, impactful or ignorable, hurt people a lot or just a little. They can stand alone, cause other bugs, cover up other existing bugs or cause bugs as part of a fix. Counting them is going to be difficult, especially in smaller numbers where they don’t average out in a statically healthy way.

Having some categorisation of bugs, such as those that come from specific teams, or that are concerned with a particular area, or are of a particular “priority” can be an indicator that something else is going on, but my experience tells me that for smaller numbers it’s usually another spike in the graph of the dynamic nature of software development projects. It’s worth considering what kinds of categories are being examined and how accurate and useful those numbers might be at indicating some sort of problem that wouldn’t be discovered through not having them.

It’s worth now reflecting on the phrase “Metrics to Measure the Value of QA”, because it’s fairly trivial to come up with examples of how any metric is actually measuring something else. Knowing that metrics are so often tricky and fallible measures of complex ideas we can measure what we imagine the state of something to be and use any patterns to indicate trends and changes, then investigate why that might be. At the very least this shouldn’t be what they are called in front of anyone because if people believe they are measured then they will change your data to create a more beneficial outcome for them.

What to measure?
I think that depends on what we’re looking at and why. What risks are we trying to learn more about? Life doesn’t fit into metrics, we have to do what we can with what we have, and I feel the same about measurement in software. I can imagine a context where collecting one metric would be very helpful and another where it would be counterproductive. Perhaps instead of starting with what metrics to collect we need to understand why we’re collecting them - what is the need that is generating this cost - so that we can ask sensible questions and monitor the right data to determine if we need to ask more.

5 Likes

Wow - now this is what I call a response. With you on the metrics driving behaviors (sometimes unwanted).

3 Likes

Welcome to the MOT community @jca . Who do you write for? As In what is your context and target audience usually reading, I mean who reads stuff these days anyway, nobody reads when the page is 50% adverts, so I’m guessing a closed audience, who is that audience?

As Chris has pointed out it’s a topic area that has been beaten to death, mainly because there is no silver bullet, there is no metric that works in all contexts, the only context we should have is to never stop moving. Experiment often and actually work hard at the right thing and do the wrong thing less often. I read tweets by a mathematician who contracts his number services out to top clients and he keeps telling us, companies collect far too much data for what they actually use for decisions. Ask any QA person and they will tell you, that the metrics tools do not work, because the tools are not only used wrongly, but are often just a way to earn subscription fees.

As @kinofrost points out, if you keep looking at things the same way with the same tool or approach, you are going to find what you are looking for. Which stupidly answers the question in one line, doh!

3 Likes

We learned in university that it’s important to look at how a metric can influence behaviour in undesirable ways.
People will try to look good in their metrics. Some will try to cut corners and manipulate data to get there. Say you’re looking at production bugs. Many production bugs are bad, so now people are more likely to raise them as pre-production bugs.

1 Like

This was so clarifying and illuminating @kinofrost !

This makes total sense. Metrics are absolutely context dependent or otherwise we risk Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

2 Likes

Chris’s point about fungibility is crucial when considering the validity of metrics. I struggle to think of any testing-related parameters that are fungible. Bugs aren’t, test cases aren’t, requirements aren’t, lines of code aren’t, people aren’t, even a man-day’s work isn’t.
So, if none of the things we can measure are fungible, why does anyone think there is value in measuring them? It’s usually interesting if the value of a metric is zero - that should probably prompt a conversation. But non-zero values are meaningless both because of non-fungibility and also because they can be gamed so easily.

2 Likes

I’m still reflecting on all of this.

@kinofrost your link to Satisfice’s website also led me to the article about Assessing Quality instead of measuring it. Assess Quality, Don’t Measure It - Satisfice, Inc.

And knowing full well this is not a Reddit channel “Explain me like I am 5”… could you please do so?

Assessment vs. Measurement: Bach emphasizes that the true goal is a useful assessment of the product’s status, involving subjective evaluations, discussion, and systematic review of relevant data.

How does that look in reality?
In a context of a sprint or project? Or cross-team collaboration dynamics.

Because I’m still stuck in, potentially an old way, to a “reporting dashboard” as the language of the leadership and as platform to display the value of the team/quality.

P.S.: I feel like I’m traveling on a carriage with horses, knowing that there should be a better way, but I do not know what a car is.

1 Like

Quality is something of value to someone. And dashboards are not a “someone” metrics are not a “someone”, and none of these are “value” as @steve.green points out, not fungible. The value lies in the stakeholder conversation, is this product something they can sell and pay the bills off of. Anything else is just fudd. We know this because many apps we know of have medium priority bugs, yet people still use the apps, there is a lot to be said for letting things feed back instead of trying to measure the wrong thing. Defect priorities are subjective, and can change based on many internal political factors. Not having a reference point makes many metrics work much like a stick in quicksand.

Much like how measuring gas in the gas tank of a car is a bad metric. The measure is in how far it actually gets you is king, because a full tank of gas is really a dumb metric if the handbrake is still up. Which is a problem that @jca won’t ever have with a horse drawn cart I guess, the horse just waits for you to figure it out.

How does that look in reality?

I suppose I could say that as quality is subjective it is a matter of relationship, not one of objective truth, and therefore has to be expressed in terms of opinion and not numbers, even if we use numbers to help form that opinion. So in reality it looks like a bunch of opinions, hopefully well formed, hopefully put in front of people who need those estimations to inform decisions.

I’ll take it from the feeling that there should be “a better way”, that you mean a better way to do something. I just need to understand what that something is.

If you’re looking for a status of the product you might be interested in this: How is the testing going? – DevelopSense

If you’re looking for a reporting on activity rather than artifacts you might get a lot out of this: Session-Based Test Management - Satisfice, Inc.

But without knowing what you feel you were getting from a system I don’t know that I can provide the alternative. Some measurements are just bad or impossible or the measuring itself can backfire, and the answer to that is - don’t do that. So if your dashboard is full of lies then it looks the same but without the dashboard.

What is your dashboard for? Who is it for? And what would happen to the world and all its beautiful inhabitants if you put it into a car crusher? When we have a handle on that I think everything will become much clearer.

Some reporting systems are really there to make people feel better or abdicate responsibility for understanding or make the suits happier or give someone a job. Some of them are used to inspire questions about what is happening so that we can investigate and find out more. Some are used in lawsuits. The reality of measurement, assessment and reporting is pretty complicated.

3 Likes

I love this topic, thanks for bringing this up!

A few thoughts of mine:

Delivery Cadence: This measures the rate at which software is released.

I believe this metric can be misleading due to several factors:

Firstly, the complexity and scope of tasks in software development vary greatly. Some tasks might take a week, others a few days, and some even a month. Delivery cadence is focused solely on the frequency of releases, so it fails to account for the depth or difficulty of the tasks being completed. A period during which several small, simple tasks are completed might appear more productive than one where a single, complex task is being worked on, even though the latter might contribute more substantial value to the project.

Furthermore, emphasizing delivery cadence can negatively influence team behavior and decision-making. There’s a risk that it encourages a preference for quantity over quality. Teams will prioritize a higher number of releases over the significance or thoroughness of what is being released, potentially neglecting more complex, high-value work.

Another concern is the potential compromise on quality to maintain a consistent or accelerated delivery pace. Such compromises can lead to increased technical debt, bugs, and unstable releases, which are detrimental in the long term.

Additionally, delivery cadence doesn’t accurately reflect the effort and skill involved in software engineering. Tasks that involve complex problem-solving, research, and implementation of sophisticated features or dealing with technical debt often take more time but are crucial for the project’s long-term health.

Lastly, a focus on rapid delivery can foster a culture of overwork and burnout. It might pressure engineers to work at an unsustainable pace, leading to decreased job satisfaction, reduced quality of life, and, eventually, a higher turnover rate.

Team Morale: A motivated and positive QA team is more likely to be productive and thorough. Regular check-ins and surveys can gauge team sentiment and highlight any areas of concern.

When team morale is measured too frequently through regular check-ins and surveys, there’s a risk that these activities might become more of a ritual rather than a meaningful engagement. Over time, team members might start to view these surveys and check-ins as routine or even bureaucratic exercises, diminishing their original intent and value. This ritualization can lead to less thoughtful responses, as team members may complete them out of obligation rather than genuine reflection on their sentiments and experiences.

Also the actionability of the data collected from such frequent check-ins might be limited. If morale is already high, frequent surveys won’t add much value, and if there are deep-seated issues, simply measuring morale more often won’t necessarily lead to solutions.

Communication Effectiveness: The efficiency of communication between QA and other teams, such as Development or Product Management, can be a significant indicator of potential bottlenecks or misunderstandings that could impact quality."

I believe there are two key components of effective communication: the initiative where team members proactively share information and the goal of minimal information loss. However, both these aspects can’t be properly tracked.

Measuring the level of initiative in communication is a complex task. The willingness of team members to share information is influenced by a variety of factors, including the culture of the team, individual personality traits, and the perceived value of the information. Traditional metrics can’t capture the reasons why some team members communicate more than others or why communication levels may vary over time.

Assessing the extent of information loss during communication is also very challenging. Information can be lost or distorted due to factors such as the complexity of the subject matter, the communication channels used, and the diverse interpretations of recipients. This loss is hard to quantify, as it often requires subjective judgment and a retrospective analysis, which might not always be accurate.

The absence of established models to accurately track these elements further complicates the use of “Communication Effectiveness” as a reliable metric. While tools and methodologies exist to analyze communication patterns, they often focus more on the quantity and structure of interactions rather than on the quality and effectiveness of those interactions.

3 Likes

When I worked at a large corp, one of the ways that the release manager would decide whether we were ready to release, was, by looking at how many release branch builds we had recently had, was the rate setting down and how many of the builds had failed (unit tests or other). Today I still remember that in the context of “mean time to recover”. And delivery cadence is a good indicator of mean time to recover. I hope that when people look at delivery cadence, they are looking at what it tells them about how quickly we can completely and fully hotfix. Any kid with imagination knows that not all tasks are equal in effort. A greased pipeline for delivery all automated, with less human stress is my take on release cadence, not, how much was in the release.

One upside of higher release cadence is that we get to actually see the stakeholders briefly, so there is that. Which is why I agree with @vitalysharovatov on working to understand our communication gaps with external teams and ourselves.

2 Likes

Another way to make this great explanation more concrete, give it a visual aspect, I add the Low-Tech Testing Dashboard from James.
I see it as a general great alternative (or at least add-on) to number-based metrics. A good answer (starting point, inspiration) what metrics to use in testing.

This doesn’t sound like a testing process metric to me. Testing finds bugs, it doesn’t fix them, so a consistent increase to me would point to a problem with development prioritizing new features over fixing what’s already been delivered. I’ve been on projects where this happened and the result is a huge mountain of technical debt, and more bugs due to layering more and more features on an increasingly buggy foundation. The testing itself can bog down because basic things don’t work, and block even getting to the point of testing whatever the new feature under test was, but again, I think this would point to a more systemic issue than the testing process itself.

@mkutz has a nice session about what not to measure :slight_smile: and he has some nice things you could measure…

4 Likes