DORA any experiences

DORA metrics are coming to my team. I’m a bit concerned, I like metrics, but also not so sure if the people who want the metrics are doing this just as part of the lightbulb effect. Hawthorne effect - Wikipedia , or have genuine interest in optimizing. It’s far too easy to be lazy or just plain wrong when it comes to metrics. I’ve seen too many great ideas never carried through in my time.

I would rather not be doing actions to help drive a bit of metrics if all it does is end up on a report that never gets actioned. Please tell me I am wrong and that DORA can be made to work for every team. I have seen far too many companies decide to build dashboards , or tools and make process changes only to find that teams go back to what works for them at the end. And I begin to wonder if most teams merely lack discipline, or whether management are too lazy to just ask us directly, how is our devops doing and what can we do to deliver more reliably? I would much prefer for example that a Wardley Maps exercise overlay the DORA metrics to show which teams are where in maturity or speed and value for example, but I’m just not convinced it’s worth pointing out that if a thing is really worth doing, we would be doing it already anyway. And don’t need a simplistic metric, DORA must be deeper than just counting releases alone?

3 Likes

The scientific process is slow and costly. We fight to overcome our disconnect with objective reality and the faulty (but necessary and useful to our species’ survival) way we interpret the world. We employ double blind trials because if a patient knows if they’re receiving a placebo or not it affects the outcome, and if the patient isn’t told but the person giving them the treatment knows even that affects the outcome. It is incredibly hard to measure things, know how wrong the measurements are and draw statistically satisfactory answers without accidentally or deliberately lying.

If the people who come up with an idea for a metric, or the people who implement it, don’t know what a p-value is then I start from the position that I have no faith in their ability to produce useful results and try to work towards something that provides more than it costs without pointless damage to people (their reputation, job, happiness, schedules, attention, etc).

Why does the ISTQB exist? Companies want assurance about something they don’t understand, testers want to give assurance about something they purport to understand, and the ISTQB likes money. It’s an appeal to authority based entirely in fear and superstition, much like human sacrifice to protect the crops. A skinner box that humanity revolves in in hopes of a food pellet.

The very first thing you wrote, to me, sums up the problem:

DORA metrics are coming to my team.

Not from your team, but to it. Why? I see value in external ideas, and I use them all the time. I’m glad I learned about the HTSM and SBTM but nobody forces me to use them. Imposed standards are usually very weird to me. I get it if it’s a legal issue or some contractual foible, but it’s odd to me to hire intelligent people and then tell them you know better, almost as if they don’t believe you are capable or have some paranoia about how much good team they’re getting for their money.

I really did mean the “why”, though. If you don’t know then that’s weird too. Whenever these things are handed out I feel like someone’s been talking about me behind my back. You know when you see an email for “everyone” with a very specific request, like “due to blockages in the drains we now prohibit the use of French presses in the kitchen area” and you’re like “oh, that’s for Sandra’s coffee grounds” because not one other person has one.

Y’know maybe the people in charge really do know what they’re doing and have evaluated the project and the product environment and the needs of the teams and the companies and are introducing an idea that will bring a lot of use, I’m not to know. All I know is that nothing improves morale more than feeling judged.

2 Likes

So I did some shallow searches and here’s what I can find about DORA metrics. I’ve listed each one, how it’s measured, and how it’s inferred that it’s used, then some quick thoughts on each one.

Development Frequency

How many releases per unit time
More = better.

I’m not sure. If a release takes more time it might have good reason to, and I don’t know to what degree we could compare one release with another. The suggestion seems to be that it’s better to release value to customers more frequently, but of course if we use toggles to release partial code and functionality then we’re not actually delivering something they can use, just the change risk into the code base and another column and row in the matrix when you change the toggle combination.

I think it heavily depends on how your company releases, and what kind of product you’re making among other things.

Mean Lead Time for Changes

Time between commit and branch in production
Shorter = better.

I feel like this is more useful if you use it to show what’s holding up one team’s flow. You cannot use this to reliably compare across teams. Even for one team the idea should more more that a deviation away from normality should be investigated rather than shorter is better. The writing I found seems to sell it as the “efficiency” of the devops chain, which I think is irresponsible wording. A reasonable problem finding measurement, though.

Change Failure Rate

Number of failures in production divided by total number of deployments

This is heavily context-dependent. What do we consider a failure? The suggestion seems to be anything that needs a rollback or hotfix. Now I can see the value in detecting changes in that number, because you can see if it gets higher when you’re not expecting it to and try to identify why that could be, but one writer puts that “for elite engineering teams, no more than 15% of their deployments result in degraded services” and another writes “Most DevOps teams can achieve a failure rate between 0% and 15%”.

Each failure is different. Also each failure now has to be assigned blame, or how do we know whose change failure rate we are counting? If someone changes the database and someone else changes how they access the database and they break who is at fault? It feels like a lot of process and procedure to assign a numeric value to how crap a team is, when the actual outcome really should be about process improvement. See how each thing can have value but you don’t have to walk very far for it to become stupid? It’s a good idea to keep an eye on failures and their cost and adjust with appropriate amounts of resources, but “elite teams” keeping it under that 15% is contextually deaf and doesn’t survive contact with reality when it comes to implementation.

Mean Time to Recovery

Time to restore a system to its usual functionality
Less = better

I don’t have much of a problem with measuring this. It helps to evaluate ideas like rollback systems and backups. I think it’s also important that we’re essentially talking about downtime, so we need to examine the cost and to whom, because one outage does not cost the same as another. Also a system with a backup parallel system can swap over, and we have to count the downtime for the user and the downtime of the borked system separately. There will be plenty more concerns and limitations.

Then, turns out that “elite teams can recover in under an hour” which, again, is a waste of letters. If the team takes longer one day because someone is ill is that taken into account? Whose fault is it really if Vodaphone put a shovel through a cable? Does it actually matter that it’s under an hour, or is it actually fine? Would it cost more to reduce the number than let it sit in downtime? Maybe it goes down overnight and nobody’s using it. Probably not even a night-time ops team for that, so because it takes 12 hours they’re not elite enough?

Final Notes

The DORA stuff is littered with sales points like “With live DORA dashboards in place, engineering organizations can start to see where they stand relative to other engineering organizations, and what the scope for improvement is in their software delivery processes”, which to me reads “harder, faster, prove your love for me” which displaces the reality of development - a team of people working together and for each other toward common goals - with older ideas of factory floors or newer ideas in call centre hell boxes.

Secondly there’s a common thread through all formalised metrics that appears just the same here. Fungibility. Something is fungible if it’s interchangeable and can be pragmatically treated like any other - money is fungible because a dollar bill is worth any other dollar bill. Houses are non-fungible because they have differing sizes, number of rooms, facilities, neighbours, access, materials and so on. Mathematics relies heavily on the fungibility of numerical values, and benefits from the reliability and consistency that provides. It feels like it has integrity and certainty. I believe that formalised, cookie-cutter metrics systems use the comfort of that certainty like a security blanket or teddy bear to fight the fear of the complexities of reality. The quantity of money, and effort spent and the cost and misery inflicted in this world in an attempt to make things seem simpler than they are is arrogant and obscene.

Thirdly there’s a theme of blame in the writing about these metrics that disturbs me. For the numbers to be used in a comparative, competitive or class-sorted way, like getting a gold star in primary school, we have to assign blame. I can imagine it feeling like working for an insurance company. No, ma’am, we’re not liable for acts of God. Do you have the police report?

I believe there is a way to use measurements in a sane, humane and logical way that also shows humility to the messy nature of reality, but I’ve never seen one of these systems come with information on how each one can be misused or abused or poorly implemented or poorly interpreted, which suggests to me that the aim isn’t the betterment of a business or the people in it, much like the psychic medium, either through ignorance or deception.

2 Likes

Ha Ha Chris. I have a similar but not quite parallel skepticism. However since I need to research the DORA system myself still and am really wanting to know how many people already have an opinion on the system. DORA is not old, it’s reasonably new as a formal measurement, and dang, it makes sense. So people actually using it here needed. No I tried reading blogs about it but most of the blogs on it are companies selling me a toolstack anyway, so I could not use their fiction to grow my technical knowledge.

So. Are we talking HTSM RST HTSM General Test Techniques - What is the difference between scenario testing and user testing? ? Well no, Not a test-centric metric. All of our teams are dev or functional teams. Testers belong to a “chapter” or community-of-practice if you like and are embedded in each dev team, and meet twice weekly. I’m talking all engineering teams using release cadence as a metric. It makes perfect sense to use release cadence as a performance metric, and as someone who has been coding for 30 years now, I like this metric. I hated it as a junior in the 90’s because release is costly, but today, not-releasing is even more costly.

1 Like

Thanks for drafting in a precis here by the way Chris. I cannot hope to read and type that fast.

Development Frequency : This one is easy to measure, and our product is suited to frequent release for a lot of reasons.

Mean Lead Time for Changes : This one requires more stats work than I would like to spend time doing, I think this metric was added to allow people to sell you a toolstack. Nice idea, but on paper only as it’s really hard to measure usefully. It does not suit our branch strategy, so we are not doing this one anyway.

Change Failure Rate : You only get this one if you implement release frequency, so yeah, this we are doing. and it’s the big thing that I see as a stress reliever. Also I dont see failure as negative. Every time you fail, you learn.

Mean time to Recovery : I do hope we are going to implement this one. I just have no idea how we will do so. When you fail you learn to recover from failure more quickly too. Meaning, next time you fail, it need not be a panic. Anything that reduces my stress levels is a plus these days.

I think just looking at your reflection I have a few better questions about implementing and viewing the metrics as an observer once we start this exercise. I just hope it does not take ages to set this up. I also hope that all the engineering teams (we are spread out in terms of our functional and user interactions/value) can all buy into the metric and make at least part of this exercise useful.

2 Likes

You do not need to read any of the following book I wrote. I have literally nothing else to do that I’m currently capable of doing.

I have a similar but not quite parallel skepticism

We are professional skeptics, after all

it makes sense

As responsible, professional skeptics we should consider the occasions and environments in which it does not.

I could not use their fiction to grow my technical knowledge.

I understand. My concern certainly isn’t with your agency and responsibility, which I would put trust in, but other decision makers with more control who are the fearful target audience for this sort of thing, which I would not.

Talking about those risks and claims is the right thing to do, and complaining about everything else is something I wanted to do. Everything I read about the system did not include any contextual risks and plenty of insane claims, but I followed the money, and the people responsible for DORA I think got swallowed by Google, which led me eventually to here. This is actually about Four Keys, a system to implement the idea, but does explain the metrics. I take two lessons from this - I was wrong to assume all the resources were awful in the same way, and people will ruin an idea by the time it’s a secondary source. Blessed are the cheesemakers, I suppose. The blog post includes the idea that you should adapt to what your business needs are, but still calls them performance indicators which I still think is irresponsible language.

I don’t think that DORA metrics are innovative, they describe things people already measure. Some companies use them, some abuse them, some abandon them just like every other system that’s come before. I hope my ranting about testing in metrics and operations catches on so that more people get more value out of these things.

My most hopeful find was actually a Reddit thread. I put a full stop there because I want that sentence to stand alone because it’s so delicate and rare. This one has quotes like “They can be useful provided that improving the metric doesn’t become the goal” and “Wondering how folks collect the metrics. From my understanding, people often underestimate the difficulty of doing this well”, which speak deeply to my experiences of unnecessary failures in businesses. It does also contain “made teams more data-driven” which underlines the importance of the data being correct every day.

Not a test-centric metric. All of our teams are dev or functional teams.

Okay, but everything I mention can be applied to teams of any kind and makeup, I wasn’t talking specifically about testers. The example of HTSM wasn’t really important and can be replaced with any external model or process you bring in. Generally speaking teams should craft their process to work for the mission of the company, not the company imposing process to force workers to adhere to the mission, and that goes 1000% more if they don’t ask first. If we want to be the cross-functional, highly synchronised Voltron robots of Team Business Alpha that we seem to want to be then the effects of any and all imposed processes become magnified and they affect everyone more deeply. My problem is one of epistemology, metrology, statistical analysis and ethics that underpins the use of all measurement in business, and that so many people responsible for so much process and so many people’s lives in a company have a combination of confidence and ignorance that frightens and saddens me. If someone says “we want to investigate DORA and use what is useful to us” I have a different reaction to “DORA is coming” (which has a “Winter is coming” sound to it).

I feel like metrics can be collected properly and used ethically but they need to be taken seriously and people need to be more critical about automatic results, both from automation suites and metric system dashboards. I can also simultaneously think release cadence is a great metric, and I can think of contexts in which it wouldn’t work, be used for evil, not fit specific situations, etc. That information is valuable in making decisions about if/how/when to implement the idea. My concern is that investigation and feedback doesn’t happen. I don’t understand when the terror of production errors means we insist on testing in a certain place in a certain way, and others are comfortably ignored with a policy of misplaced faith. Our goal is to find important problems for people and be really good at it, and a broken thermometer is worse than no thermometer.

I know you’ll work towards a system that’s useful to you and try to minimise any negative impact and pointless cost with the agency you have available to you. Not everyone has the same situation or opportunity and I care deeply about software and people and personal responsibility and hate avoidable injustice, so I hope that anything I have to say will cause people to pause and consider the difficulties and the changes they make to the social contract so they put value and love into the world and don’t take it out.

2 Likes

To take a much less scientifically-base approach and expand on an awesome post, my view on these (and any other metric, system, methodology, management approach, whatever) is that every metric and tool can be abused, and every metric and tool will be abused.

It’s like corruption in politics. It will happen no matter what because we’re dealing with fallible humans. The art lies in engineering the system and constantly monitoring the culture so that what does occur is limited in scope and damage. If in the case of politics you’ve got media ready to tear into any and every corrupt politician regardless of party or anything else, you’ll get a lot less of it than if the media is owned by the government or clearly supports one faction.

In the case of software development, there’s a few rules that apply: if people are penalized for something, they’ll do less of it, and if they’re rewarded for something they’ll do more of it. The problem is that software metrics by their nature are proxies for software quality that may represent some aspect of a particular application well. Or not.

As far as the DORA metrics go:

Development Frequency - monitoring this over time can be used to track long-term improvement in development practices. It can also be completely meaningless: at one place where I worked, in addition to the normal releases there were patch releases (which were still the full product, just delivering a specific fix to a single customer) on average 3 times per week. Including those in a development frequency metric would make the company look very good, but it wouldn’t say anything for the quality of the software (this particular organization was known among its customers for having “safe” releases and “unsafe” releases. I have no idea what the situation is today).

Mean Lead Time for Changes
Honestly, this one reads to me like a feelgood metric. Every place I’ve worked has a way to prioritize defect reports, and a set of functionality that is considered essential. So which changes are we looking at? The critical bugs that make the software unusable for most customers that have a fix in production within hours, or the low priority fixes that wait until there’s time to move them? Even the most efficient devops system has humans deciding what needs to be done when, so low priority bugs can languish for years.

Change Failure Rate
Which changes and which failures? As @Kinofrost has said, every failure is different. Do you include the database server crash which takes down the entire system for several hours? That’s certainly a failure, but it’s one that’s not related in any way to the team that wrote and deployed the software. Does the embarrassing but otherwise harmless typo count?

Honestly, I see this type of metric as one that’s there to beat the teams with. It’s a bit like penalizing programmers for bugs reported against code they wrote: team members will do all they can to avoid reporting bugs through the formal system because they know that bugs are usually the interaction of complex software, complex requirements, and the well known fact that software development is a wicked problem.

Mean Time to Recovery

It’s one thing to say that you can recover in an hour from a system outage, but that flies out the window when the system outage is caused by something utterly out of the team’s control. No amount of devops, backup generators, cloud providers and what not can cope with an event like the northern US power outage some 20 years back. If someone’s primary data center in Florida gets wiped out in the next hurricane season, and there goes the statistic.

If you can reliably measure time to recover from outages in a way that excludes outages that are caused by external factors, great. I wouldn’t want to try: there are too many variables that could impact the end result.

Frankly, I’ve found over… nearly 20 years working in testing that every piece of software is different enough that there’s no real way to build metrics that say “this is high quality software”. There are metrics that can be used to judge a team’s progress, as long as they are used consistently for and by that team, and the team knows they will not be penalized for the inevitable issues they encounter.

All I can suggest to anyone who has metrics like DORA imposed on them is to approach with caution and do your best to make sure they don’t get used against you.

2 Likes

Hi @conrad.connected ,

I saw this recent post and it provides interesting insight into the topic.

It seems like they were created for a fairly high organizational level. At some point, the authors state explicitly that the DORA assessment tool was designed to target business leaders and executives.

1 Like

That linky was very helpful Simon.
DORA is always going to be interesting as a distraction for quality professionals even if it is a tool used only by senior leadership, because it can sometimes take attention away from where it is needed at the coal face. I also am not keen for managers to get a wrong picture when the dashboard they see is not representing the true picture. We have seen too many situations where data is inaccurate and the person the data is supposed to help does not get to see that data to be able to correct it. Hence the double points of interest here.

DORA is also a side-tool to RACI Responsibility assignment matrix - Wikipedia or responsibility assignment matrix. Which is a plus when you are growing your organization and ownership of the chores becomes fluid. As a tester you want to stay on your toes for moments when managers might be re-organizing resourcing or priorities that put your project at risk.

1 Like

@conrad.connected Have you read Accelerate? It contains a lot of info about DORA metrics Accelerate [Book]

2 Likes

We use DORA metrics very successfully. They provide a level of visibility across teams that we find useful. We believe them to be fairly objective. You can read more here.

DORA are some key metrics to evidence pace (frequency, lead time) and quality/stability (change failure, recovery), they shouldn’t be seen as being imposed on you but useful for continuous improvement. They need to be measured right and that can be hard across a organisation. They start to provide some indication but shouldn’t be used in isolation as there are lots of factors that make things ok to be where they are. There is lots of good stuff to read up on here - most of this is pioneered via a lady called Nicola Forsgren, who was involved with DORA, wrote Accelerate, then did the SPACE framework, now doing some work with a company called DX who focus on developer experience.

Read Accelerate, DevOps state of nation are based around DORA, Google has just relaunched its DORA site which has alot more depth than the 4 measures, i’d also look at developer experience style surveys, Spotify health check is an example, as is DX.

2 Likes

THanks @fponknevets . I think what John Lewis did in their DORA deploy was missing in ours. Nobody really consulted teams on how to implement, all we did was agree it was a good thing to do. From my QA back-seat in this, the process actually did not got a “process owner” in engineering, and as such it’s been less useful.
It has caused some good yet funny side effects, for example patches and hotfixes that went out, no longer are below the “radar”. Patches always got recorded in our system, but now they have started to stand out, in some ways this means hotfixes get more diligence, but I believe the communication fatigue losses outweigh what I personally thought would be the benefit. A better communication benefit that would lead to easier deployments, I think we need more time, will share as we refine. It has to be an ongoing “agile experiment” or else in my experience it will peter out.

1 Like