New here 👋 Question about flaky tests and trust

Hi everyone :waving_hand:
I’m Raja, and I’ve been working in test automation and quality engineering for the past 18 years.

One topic I keep running into — across teams and projects — is flaky tests.

Not just the technical pain, but the impact on teams:

  • CI failures that no one trusts

  • Tests getting re-run instead of investigated

  • Automation slowly becoming background noise

I used to think flakiness was just “part of the job” as systems scale. Lately, I’ve been questioning that.

:backhand_index_pointing_right: At what point do flaky tests become more harmful than helpful?
How do you decide what to fix, refactor, or delete?

Looking forward to learning from all of you here!

2 Likes

For me the tipping point is trust. The moment engineers start auto re running jobs without looking, the suite has already lost value. At that stage flaky failures are not just noise, they actively hide real regressions because people mentally discount red builds. That is more harmful than having fewer tests.

How we handled it was by forcing explicit classification. Every flaky test had to be tagged within a sprint as fix, quarantine, or delete. No indefinite limbo. If a test covered a high risk flow, we invested in stabilizing data, environments, and waits. If it was low signal, we removed it. Coverage that no one trusts is not coverage.

One practice that helped culturally was logging flaky behavior alongside release risk, not just in CI dashboards. We tracked it in Tuskr so product and engineering could see which critical paths had weak automation backing them. That visibility made it easier to justify stabilization work instead of endlessly re running builds.

Flakiness scales unless you treat it like production debt. The suites that stay healthy are the ones where instability is surfaced, owned, and pruned continuously rather than tolerated.

2 Likes

Welcome Raja, and thanks for your question.

In my view, flaky tests are harmful by definition, for several reasons (some of which you also refer to):

  • Few people enjoy investigating failures when there is a good chance the issue is in (the coding of) the test rather than in the SUT. After doing it for a while, it becomes very annoying. This can take a good deal of pleasure out of the job for several team members. This is bad.
  • As more flaky tests accumulate, team members (unconsciously?) start protecting their sanity and start ignoring red test runs. This defeats the purpose of having an automated regression test. Also bad.
  • Re-runs can hide bugs in the SUT. Perhaps even worse.

In my experience, flaky tests are usually the result of poor craftmanship. If it happens once, fixing it can help people learn to prevent it from happening again. Mistakes are a normal way of learning, no problem. But if they are not fixed OR the lesson is not learned and they keep popping up, it soon annoys.

So: Unless you are in the very unusual situation where it is impossible to prevent, flaky tests signify a poor job and should not be allowed. Let the person who made them fix them. Mentor the person for the technical side of things, but especially foster a culture where committing flaky tests is not okay. Deleting should be a last resort: If the test is valuable if it works, it should not be deleted because someone does not have the skill or the inclination to do a better job, otherwise its value is lost (and more flaky tests will follow …).

Does this correspond with your thinking?

1 Like

Really appreciate the depth you brought here. You’re absolutely right — once teams start auto‑rerunning jobs, trust is already gone. I like your point about explicit classification; flakiness only improves when it’s treated as real debt, not background noise

Thanks for the thoughtful response. I agree that flakiness is harmful by definition and that ignoring it only makes things worse. Your focus on accountability and learning resonates with me — the teams that treat flaky tests as real debt tend to build much healthier automation suites

1 Like

Once worked at a large company with TCMS and all the management things, and we used to let flakey tests remain until they got to a problem level and most people just added auto-retry code. Like @martingijsen, I also found this to be really bad move to just auto-retry and I worked to prevent people subconsciously just doing it as a stop-gap. I failed often, because all that people would do is make a change to “fix” the flakiness, then run it literally 50 times and if it passed they would commit the “fix”, without as Martin puts it, always understanding the cause of the flakiness.

As Jose Points out, flakey tests are there because they are testing a thing we want to know more about and to build trust around. My solution has often been to completely rewrite a flakey test, but use either a different path, different tool, or strategy. and then show the team and ask them if my new test is testing the same thing still, if it is, I can delete the old code and commit mine. Very often you can get agreement on changing what a test does subtly or even drastically as long as it still improves your knowledge/confidence of product behaviour, at low cost to the team.