Are you noticing AI bias in test creation versus finding bugs goals?

I have been doing a few learning experiments with playwright and I thought I’d share something that stood out as it could be of interest and value to others.

Early experiments were the basic plan and generate tests. This often returned a suite of fully passing tests. Raising the question are the tests of value.

When the planning agent though is given different instructions on the same site, explore and focus on finding bugs rather than focus on creating tests it would tend to find a number of bugs.

If they were asked to give quality scores the first may started to return 10/10 and the later in some cases zero.

Are they not the same goals? Is the creation of tests not about finding bugs? Has AI picked up this bias from two different testing models, testing to verify versus testing to learn? Lots of questions but maybe too deep a rabbit hole for here.

My next experiment was building an accessibility agent, using same tools that combined both test creation and a focus on finding problems and building a report with a score against wcag 2.2. The same pattern was evident, all the created tests passed but it also found things that could be a problem for almost every 2.2 criteria. Scores going from 10 to 0 again depending how it was viewed. I will not go into the iterations of that experiment here but it was about asking those earlier questions.

It also turned out when I went through the bug findings with a developer we found a lot of the issues were false negatives in our context or so minor that had the agent generated tickets directly it would have unnecessarily created a burden on developers time. When the goal is finding bugs where is the value to some person balance?

I do not have clear conclusions on this yet and I continue to experiment on finding that balance “value to some stakeholder” angle but I did find this initial pattern interesting to me and felt others may also see some value.

If anyone else is seeing this pattern, I appreciate your thoughts and views.

Note that for discovery I am not so keen on just letting an agent run but continue to experiment with AI on this area, a recent switch to goose and playwright-cli initially seems promising but in a much more interactive guided experiment by experiment way that likely counters the false negative narrative quite a bit.

I’m seeing a similar pattern.

I don’t think “create tests” and “find bugs” lead the model to the same goal in practice. When asked to create tests, the model often seems to optimise for executable, stable, passing checks. When asked to find bugs, it optimises for suspicion and observations. Both can be useful, but both can also create false confidence in different directions.

The part I’m most interested in is a third mode: collect evidence from a real user flow, then let a person decide what matters. For example, a UI flow may still pass while the browser-side API behaviour underneath it changes. That is not always a bug, but it is useful review material because it shows where the system drifted.

So for me the useful output is less “AI generated these tests and they passed” or “AI found these bugs”, and more:

  • what behaviour was observed
  • where in the flow it happened
  • what changed from baseline to target
  • what confidence or uncertainty the tool has
  • what a human should review next

That also helps with the false negative / low-value ticket problem you mentioned. The agent should notdirectly create work for developers unless the evidence is strong enough and the stakeholder value is clear.

It might be worth you having a look at the playwright-cli goose combo, I only started looking at it yesterday but I made these notes for discussion.

"A bit like command line or even voice activated testing, open website, navigate to form page, generate pairwise data and run experiments to identify potential failure situations, keep a trace of the experiments at traffic level, analyse that traffic for anomolies.

  1. Also open website, monitor my experiments, track and flag any anomalies that I may miss as I test.

Keeps distance of tester very close but with mech strengths added in. Would take a bit of getting used to but maybe a step closer to that sort of burp suite user model for testing, tester guides and mech executes."

That 2, I think came from one of our earlier discussions that we were aligned on, I have not done experiments with this part though but it says it has the capability provided I use the agent to open the website so it has the connection directly and I then continue testing from there.

Really like the Burp Suite analogy — AI as the intermediary that monitors, intercepts, and flags. That
mental model clicks.

Goose as an orchestration layer makes a lot of sense to me. It’s essentially an AI workbench for the entire testing workflow — the tester provides intent, the agent handles execution and observation. That’s a powerful paradigm for exploratory testing.

Coming at this from a developer’s perspective, I’ve been focused on a more specific problem: UI-driven API regression testing. The challenge I keep hitting is determinism — when a UI flow matters enough to become a regression check, I want the execution to be repeatable and the results to be structurally comparable across versions, not dependent on an LLM producing consistent behavior each run.

So the tool I’ve been building sits at that layer: capture API behavior triggered by UI flows, diff it
across versions, filter out noise (timestamps, IDs), and surface real structural changes. It’s less about
orchestrating the test and more about producing reliable evidence of what changed.

In the bigger picture you’re describing — AI-driven testing with human guidance — this kind of tool would slot in as one of the specialized instruments the orchestration layer calls on. The AI workbench decides what to test; the regression tool makes sure the results are precise and auditable.

The bias question from your original post is interesting in this context. An AI orchestrator might be biased toward happy paths, but a regression instrument downstream doesn’t care — it just faithfully diffs whatever flow it’s pointed at.