AI systems are inherently non-deterministic, which makes me wonder how we should think about “flaky tests” in this context.
In traditional automation, a flaky test is usually caused by things like:
timing issues
unstable environments
race conditions
But with LLM-based systems, the same prompt can legitimately produce different responses each time. So where do we draw the line?
A few questions I’ve been thinking about:
At what point does response variation become a bug rather than expected behaviour?
Should AI systems be tested with semantic similarity instead of exact assertions?
How do teams prevent AI test suites from becoming unreliable over time?
Are people introducing confidence thresholds or evaluation datasets to handle this?
Curious how others are approaching this in real systems.
How are you defining and dealing with “flaky tests” when testing AI-powered features?
Firstly I think it depends on the AI system. AI is a sprawling mess of differing technologies, with different scopes and outputs and specialisations.
I think that “flaky test” is often a short-hand for “there’s a problem but I don’t want to understand it”. A “flaky test” is the result of an experiment and can be treated as such, if we choose to. We choose not to because it’s frustrating and expensive, but that’s the problem with black boxing your work into a big tool.
So I don’t really define flaky tests at all, I think it’s a nice way to blame the tool, and a way to shift the goal to “passing tests” rather than important information put into the hands of people that matter. AI simply compounds the existing problem.
The testability of AI system is absolutely dire. The lack of observability, the algorithmic complexity, the lack of transparency, it’s a nightmare. I don’t know what a sensible coded check would look like testing for stochastic results. If we are looking at a chatbot-like output where there’s no way to predict the output then traditional automatic checking tools (automation) would not be the direction I’d go in. At some point that will take the opinion of a human, just to say whether it looks right or not. You can check technical elements of a response, like correct spelling and vaguely okay grammar and minimum/maximum response sizes but you can’t say if it’s correct. This speaks to your question about semantic similarities, but doesn’t really fulfil it in a satisfying way, I don’t think.
Finding what to automate often looks at what will change in a system. What is predictable and stable. And generative AI output is neither of those things. It will, of course, depend on the system and our expectations. But if you want to trace a flaky test back and you find “AI was ere” with no further explanation.. well, yes. It does come with risks.
That’s a really interesting perspective, especially the point about flaky tests sometimes being a shortcut for “there’s a problem but we don’t fully understand it yet.”
Your comment about observability in AI systems resonated with me. In traditional systems we can usually trace failures through logs, metrics, or deterministic outputs, but with AI the reasoning path is often hidden inside the model.
It also raises an interesting challenge around automation. If outputs are stochastic, then strict assertions don’t make much sense anymore. That’s where I’ve seen teams experimenting with things like:
evaluation datasets
semantic similarity scoring
rubric-based evaluations
human review loops
But even those approaches seem to introduce their own uncertainties.
I’m curious about your point regarding human judgement. Do you think effective testing of AI systems will always require a human-in-the-loop, or do you see a path where automated evaluation frameworks could become reliable enough for CI/CD pipelines?
I tend to agree with Chris, if its flaky its either a variable you are not fully on top of or its something you have decided to accept. Often its the tool itself that has the issues and its normal to reject such tools as not worth the hassle.
Determinism has been slightly abused in testing, human centric apps have never been fully deterministic as its not in users nature and we use guardrails to give an impression of determinism.
I’m just having some discussions on GenUI SDK on flutter, new to me this week but my take is it can generate the UI real time based on users actions, preferences etc by using a catalogue of widgets. So here your coverage could be a focused subset with a level of probabilistic outcomes.
The other option which we bounced around today is having the automated coverage running llm’s real time, so rather than a check that something happens a second reviewing llm is reviewing and using a probabilistic take on things to gauge level of correctness. It’s something I suspect needs more research and there may even be early tools that do this.
Example. “hey user its lunch time, are you hungry” - “yes” - Now if it brings up a widget selling t-shirts that say “have a good lunch”, the reviewing llm should pick up on that being a bit off. Now if it brings up the lunch offer of the day and some widgets showing nearby restaurants the review will likely flag that’s a reasonable response.
The guardrails likely become more important and some level of product specific training. So here we may also need to train the tool that is doing the automated coverage, more humans at the helm to get it to be consistent.
I suspect though it is likely to be llm’s testing llm’s. Part of our role may switch to training those reviewing llm’s.
To be clear this comment is only after a short lunch time discussion, so could be way off I can see some sense in it alongside a load of unknowns and therefore risk at this point.
The LLM-testing-LLM idea maps nicely to what Anthropic calls “model-based graders” in their agent eval framework https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents . Essentially a rubric-guided LLM judge assessing whether the response was reasonable. Andrew’s t-shirt vs restaurant example is actually a perfect rubric case.
Here are a few key ideas from their recent article that are worth including:
they distinguish pass@k (at least one success in k attempts) vs pass^k (all k succeed). A task passing 90% of the time is fundamentally different from 50% and the product requirement decides which metric matters - not “is it flaky or not”. This is about how to think about non-determinism.
Not all evals serve the same purpose. Capability evals start low and improve as the agent gets better. Regression evals should stay near 100% (but we QA knows that 100% test suite pass rate is a challenge) - any drop signals something broke. When a capability eval saturates it “graduates” into a regression suite.
They emphasize reading the full trajectory(it’s a trace of all interactions) of a run - every tool call, intermediate result, reasoning step - not just the final outcome. Often what looks like a flaky failure is a broken task spec or misconfigured grader, not a broken model.
The human-in-the-loop role Andrew described - training the reviewing LLM - aligns with what they call calibrating model-based graders against human judgment. And that calibration work is ongoing and iterative but not a one-time setup.
Anthropic’s article also made me think about the bigger picture: QA in the era of AI evals might actually gain new opportunities. Building good E2E tests has always required a QA mindset - developers rarely create user-facing scenarios well. “Model-based graders” feel like the AI-native equivalent of that skill. If they become an industry standard, QAs are well positioned to own them.
That’s an interesting direction, especially the idea of using one LLM to evaluate the behaviour of another.
I’ve seen a few teams experimenting with something similar where instead of asserting exact outputs, they use an evaluation model to score responses against certain criteria like:
• relevance to the prompt
• factual consistency
• safety or policy compliance
• alignment with expected user intent
Your example with the lunch recommendation is a good illustration of this. The response doesn’t have to be deterministic, but it should still be contextually appropriate.
The challenge I keep wondering about is how much trust we can place in the reviewing LLM itself. If one probabilistic system is evaluating another, we may still need strong guardrails like:
curated evaluation datasets
rubric-based scoring prompts
periodic human review
Your point about testers potentially “training the reviewing LLMs” is interesting as well. That almost shifts part of the testing role toward defining evaluation criteria and feedback loops rather than writing traditional assertions.
Do you think this approach could realistically fit into CI/CD pipelines, or would it remain more of an exploratory or offline evaluation process?
That’s a great reference, thanks for sharing the article. The distinction between pass@k and pass^k is particularly interesting because it reframes the whole “flaky test” conversation.
In traditional testing we often assume determinism, so a test either passes or fails. But with probabilistic systems the question becomes more about acceptable reliability thresholds rather than binary correctness.
The capability vs regression eval idea also resonates. It almost feels like an evolution of how we already think about test suites:
capability evals → exploratory or evolving coverage
regression evals → stable guardrails that should rarely fail
The point about reading the full trajectory is also important. In many AI systems the failure isn’t always the final answer, but something earlier in the chain i.e., a wrong retrieval, a tool misuse, or an incorrect intermediate reasoning step.
Your observation about QA potentially owning model-based graders is interesting as well. Designing good evaluation criteria feels very similar to designing good test oracles, something testers already spend a lot of time thinking about.
I’m curious how others see this evolving. Do you think “model-based graders” will eventually become a standard part of CI pipelines, or will they remain more of an offline evaluation layer for AI systems?