I’ve done all sorts of test automation - integration, unit and end to end and have at one point or another have generated code coverage reports for all of them.
With unit tests it is often quite easy and one switch can give you detailed reports and % number. This often functions as some sort of target. You’re not supposed to hit 100%. Youre supposed to hit 80%. Why 80? Finger in the air.
On teams where we have done this somebody often had the bright idea of putting in a quality gate and if your pull request slid below the magic “80%” mark it would fail you until you checked in a test. This is not uncommon, I think.
This had a curious effect. The devs who had just completed their task would start writing two new types of test:
Does the bare minimum to run the new code without asserting anything.
Asserts that the result of the calculation is 13443234.821. Why 13443234.821? It’s the number that came out of the code when it was run with a bunch of arbitrary numbers. Is it correct? No clue. Was the code coverage threshold met? Absolutely. Was it better than no test at all? Eh. probably not.
The % number also didnt really tell us anything about how fragile the code base was. Worse, if used as a target its value as a measurement went even lower - thanks to goodharts law and these “mimetic” tests.
In another situation I was writing and running a mountain of end to end tests on an app that was riddled with technical debt. Somebody had the idea to generate code coverage reports on it.
I thought that this was not a bad idea because it could tell us which areas were missing test coverage and hence where the bugs might be. We could then use it to write test scenarios.
The results were interesting. It told us that about 50% of the code base was not covered. When I looked at the incoming bugs and the bugs that were recently fixed, though, and which areas of the code base they were in, I discovered something odd. About 90% were in the half of the code that was already covered by tests.
These were, incidentally, tests that were doing really well. They caught tons of bugs.
This was, as far as I could tell, thanks to some pareto rule of code importance. The criticality of the code followed a pareto rule. It wasnt evently distributed. 10% of the code base needed to have an incredibly high density of tests to prevent bugs slipping through - 100% coverage is not enough. Meanwhile 50% of the code base apparently didnt even necessarily need one test.
Back to code coverage.
What do we measure it for?
Is it to drive good practices? Because the behaviour Ive seen it drive is all bad.
Is it to help find bugs? Because from what I can see code coverage reports can give a very misleading view into where they are.
Or (and I really hope this isnt the answer) is it so we can give upper management a number they can put in a spreadsheet to measure our “performance”?