Playwright introduced some test agents too but in my experience if you ditch playwright’s agents and copilot too, Cursor in itself performs better. Cursor also has a built in planner mode.
We’re looking into Atlassian’s agent - Rovo. The UI is pretty good for setting context and when prompted will analyse confluence pages as well with pretty good results. I’ve used it in this to generate regression test cases for user stories.
Mabl and Testomat.io …Also there are different types of AI agent like Simple Reflex, Model Based , goal based , Learning based , Utility Based agent …problem with any agent i have seen so far is that these agents dont understand ur product they only understand patterns like
From what I can see it is designed to be used with triggers in JIRA/Confluence. At the moment it is ‘manually’ triggered to generate test ideas.
More importantly, I’ve built an agent to do a quality check on new user stories against related items and existing documentation. It looks for gaps and inconsistencies and provides suggested actions for different persona to take. My aim is for this agent to be triggered automatically when a story reaches a certain status, but still trying to get the rest of the team to buy into it.
Automation: I’ve had a look at the playwright ones, planner, generator and healer. These are web app dynamic behavior based. It’s debatable whether planning and generating should be done based on behavior rather than using an agent that generates these direct from code. Healer has a “try to please” bias and risks swapping false negatives for a smaller number of false positives but can be managed if used with care. No requirement oracles of truth at this point but evolving very fast.
Test cases: Agents generating test cases for manual execution I think is absurd but these are likely to pop up, Ill continue to avoid these for now but there maybe something from tinyideas that I will look at though I remain wary as early hints its has a bias towards humans testing to machine strength premise.
Github Copilot agent mode. This is interesting one for me. Under instruction it can likely do all of the above maybe better. UI automation direct from code may offer more benefits than behavior based routes. Build a POM framework, add test ID’s to your code, restructure both product and test code to optimise is interesting and would definitely fast-track initial automation coverage.
What interests me though is whether I can get it to be more of a discovery focused agent, I used it to vibe code simple apps so I could test the playwright agents for example so I suspect I can also vibe code some investigative tools to explore risk as needed, generate suitable data, add logging to code as needed, evaluate code for risks - example I find a double tap issue via the UI, can the agent identify this area of code, can spot the root cause and fix it. Tester access to code will likely be the norm to leverage from this technical buddy, a toolsmith and discovery focused muse..
The bit I need more time with is whether I could use an Agent to run experiments and what would I lose out on observation wise by doing so given my 100 trillion synapses running is still about 100 times that of the best agents. Example to an exploratory Agent. To in effect give the agent test charters, risk lists, heuristics and oracles of truth the same way I would use in an exploratory session. Whether its a sensible path to go down I do not know but I do want to understand its capability to experiment and explore under guidance before making that call.
Even I am currently ignoring (avoiding) these for many reasons… the top being focus on “quality” over “quantity”.
These are interesting areas. I am going to explore them more. I explored a “generate suitable data” utility once.
I see agents at 4 levels:
Rule-Based Systems (LLM) - Mostly Pre-LLM Era Stuff
Workflow Agents (LLM fused)
Semi-Autonomous Agents
Autonomous Agents
Note: Not all testing problems need agents. Sometimes, you just need an assistant
Most teams need Level 2 and Level 3
Level 4 looks shiny, but it bites unless your product and pipeline are rock solid.
Level 4 is powerful but unforgiving, like early self-driving cars
Level 1 was always there, so if someone wants to implement that using Level 2, 3, or 4-level tools, then it is over-engineering and throwing money out of the window in my understanding.
I agree: Even I am currently ignoring (avoiding) these for many reasons… the top being focus on “quality” over “quantity”.
Though the company I work at seems to be going down this path. But from what l have seen so far the agents does a copy paste and converts it into a test case with a bit of extra wording added. They seem to be going for quantity and speed.
We’re about to investigate QMetry’s agent, so we’ll see how that goes.
Interesting to read views here. As a test management tool vendor (with AI capabilities introduced during 2025, in areas of reports, dashboards, test generation) we’re always looking for feedback, ideas and relevant view/reviews on this. Keep it coming…
One of the biggest questions for these sort of tools is “Are the tests any good?”
Will they discover a specific risk should that specific risk exist?
What is the depth of coverage, is it for example more closer to shallow coverage which would lean it towards user easy to find issues rather than deep hidden risk coverage that the pro tester would be aiming at?
A mutation agent working independently of creation and healing agents would likely be a good hedge for that question on are the tests any good?
From a reporting perspective I still prefer the old risk and depth of coverage of those risks to be the preferred approach, this is still rare with vendor tools apart from maybe the obvious easy to find risks.
It may be correct that the tools have a strong bias towards known risks and leverage from a mechanical strength aspect, I’m wary that when it attempts to mimic the strengths of biological brains though it inadvertently inherits the weaknesses and we also lose out on so much learning at the same time. Generating a test plan in a couple of prompts for example but what did we lose, that may have been a small team in discussion for a couple of hours with loads of learning value just lost so exchanging that for speed I’m skeptical of.
Well, I would like to add our tool AutoExplore to the list.
It’s autonomous software testing tool. We have just updated our website and are planned to launch it to public in January 2026! https://www.autoexplore.ai/