My company is starting to look at how we measure and record AI usage for development and testing work, they want us to record something tangible that they can report on each month. Is anyone measuring time saved/quality improvements by AI. How are you doing it? Any ideas welcome.
Quite simply I stick to DORA metrics as at the end of the day, success is measured in outcomes. We adapted them slightly to meet our needs:
- Planned deployment frequency
- Unplanned deployment frequency
- Change Failure Rate (or for us frequency we are fixing production issues)
- Time to Restore Service
- Regression Testing Time taken (not a DORA metric, but one we want to measure)
- Rate of feature enhancement (not a DORA metric again, but tracking our frequency of deploying new features)
The bottom line is, is your new way of working improving outcomes? Those are the metrics that matter.
In our case, the clearest impact has been QA capacity.
We use AI to read developer-provided specs and code changes, then help design and execute the relevant tests. The useful output is not just generated test cases, but a go/no-go style report: what was checked, what risks were covered, what was not covered, and whether the change looks safe to proceed.
So I would not measure AI by prompts used or tests generated. I’d measure whether the QA team can assess more changes with the same people, while still producing useful evidence for release decisions.
The key metrics for us are: time from dev handoff to QA feedback, number of changes assessed per cycle, quality of the evidence, issues caught before release, and how much human review is still needed.
First I’d use the metrics I have like number of bugs by KLOC, time to deliver, recurrence of bugs… and them, according to the results I’d implement new metrics. I think using existent metrics first is a good approach, because you are able to compare the before and after IA. Using fresh new metrics you don’t have the a baseline.
The ROI of AI-assisted Software Development report released recently (by the DORA people) is also useful for how to tackle this, as it explicitly includes the cost dimension.
Thanks for the reply, Gary, that really helps. I wasn’t familiar with DORA metrics before, but they make a lot of sense. It makes sense to measure whether AI is actually improving outcomes rather than just tracking “AI usage” for the sake of it.
I guess stakeholders will look time saved in development and testing, because those are the easiest numbers to report. But I’d really like to shift the narrative toward quality improvements.
I really like your point about tracking the frequency of production fixes. It’s a simple, outcome‑focused way to show whether quality is improving over time.
I’m not sure how we can do this, if we want to compare AI‑assisted work with non‑AI‑assisted work, we can only do that by looking at historical baselines. Without that, it’s impossible to say whether AI is genuinely improving things or just changing the shape of the work.
Thank you for your reply, I’m building up a really useful list of metrics I hadn’t considered before. I’ve actually been recording the time from dev handoff to QA for the last six months, so it’ll be interesting to see whether that improves as we start introducing AI into the process over the next six months.
I haven’t been tracking many of the other metrics you mentioned, it reinforces something I’m starting to realise: if we want to compare AI‑assisted vs non‑AI‑assisted work, the only fair way is to measure against our historical baselines. Otherwise we’re just guessing.
That said, it might also make sense to start capturing these metrics now and then see how future AI improvements shift them over time.
It’s good point, it’s difficult to find out anything meaningful about what AI is giving you, without having the pre-AI data.
Interesting, thank you for that, I’ll take a look at that report.