After publishing my recent article on using AI as a partner in Quality Engineering, I realised many teams are moving beyond “one-off prompts” and starting to build more structured GPT agents for testing tasks.
I’m curious :
– Have you built or experimented with GPT agents for testing or QA work?
– What problems are they actually helping you solve (analysis, exploration, automation support, triage, etc.)?
– What didn’t work as expected?
I’d love to hear real examples, especially from teams using agents in day-to-day testing, not just experiments.
I tried playwright test agent for generating test plans , script generation and self healing, based on my experience it was okay. I tried it for website with mandatory login for accessing the feature and initially the agent was bit stuck on login.
After multiple attempts it was able to access the website.
I didn’t tried script generation for any complex features, as I was just exploring so i tried with normal and it worked.
Self healing was good and it worked as expected.
However like other gpt , the test agents results needs manual review before using the results formally.
Hey, great topic—yes, teams are building structured GPT agents for QA beyond one-off prompts. We’ve used them daily for generating test cases from requirements (cuts time 50-70%, catches edge cases), triaging bugs/logs, suggesting exploratory ideas, and light automation (e.g., Playwright code snippets or self-healing locators). They excel at speed and coverage but struggle with hallucinations (need human review), lack deep domain context without RAG/fine-tuning, and flake on dynamic UIs or big app changes. Real wins in fintech/e-commerce for transaction flows, but they’re strong assistants—not replacements. What’s your experience so far?
I have been using playwright agents, straight forward enough.
Should I build my own agent? This i am still questioning, here is an experiment I was running today.
Just using prompts - “Create an accessibility risk plan only, remember to dynamically run the app is creating the plan so it has a behavioural aspect, include some scans and keyboard only tests, I would like it to be possible if the plan eventually leads to a creation of a small report on how it meets wcag 2.2 guidelines”
I initially switched agent from playwright planner to playwright generator at this point to create tests but then ran into a few things it did not have rights to do so switched to standard agent mode, with a few refining prompts I was able to get a starting report.
To be fair the report was not bad, correctly identifying some issues and corrective action recommendations, note it was also aware of its limitations, two small extracts attached.
My level of involvement I am reasonably comfortable with, I spotted a couple of false positives a stand alone agent without a reviewing agent may have missed, and also I need to refine that report and expand some of the risk coverage. There was really not much too it all, I need to think a bit more whether its worth building a specific agent for creating this sort of report and also review value with others, might save me a coffee break of effort perhaps. Now if I was to chain agents for different risks I wonder what I would lose in the distance I’d have from the agent, even knowing its limitations could I get lazy and at some point say its sort of good enough and the testing gets shallower?
The reporting part, that I can likely find use in. Might be worth a test reporting agent in its own right I could give my days exploratory testing notes to.
I’ve noticed a lot of teams are exactly at this “observing” stage right now , especially once the conversation moves from prompts to agents. I’ll share some concrete examples from our side in the thread, maybe it’ll be useful when you decide to experiment.
This matches our experience quite closely, especially around authentication and the need for manual review. Login flows are still one of the weakest points unless the agent is very tightly guided or operating in a controlled environment.
We’ve seen the most value not in full script generation, but in:
test scenario drafting
exploratory coverage suggestions
self-healing or maintenance support
And yes, anything generated still needs human validation before being used formally . For us, agents work best as accelerators, not decision-makers
We’re seeing the same spot: agents work best when they’re narrowly scoped and embedded into existing QA workflows, not when they try to be “general QA brains”.
For example, we use GPT and Jira-integrated agents for:
generating test cases from requirements
bug analysis and clustering
refinement preparation
feature updates agent (local thing but really love it)
And a separate agent for a weekly QA bug summary that posts directly to Slack, that one is already in day-to-day use.
Also I use a lot GPT projects. And Claude code with mcp Testomat.
RAG helps a lot, but domain context and boundary definition still make or break the result. That’s why totally agree, definitely assistants, not replacements.
Thanks for sharing such a detailed experiment. this is exactly the kind of real-world feedback that’s missing in many AI testing discussions.
What you described resonates a lot:
solid high-level reporting
useful directional insights
but still requiring QA judgment to validate findings and assess risk
We’ve had similar results with reporting-style agents and they’re great at:
structuring results
summarising findings
highlighting potential risk areas
But not at reliably determining severity or business impact without human input.
I really like your thought about separating this into a dedicated test reporting agent.
In my experience, agents become much more reliable when their responsibility is very clearly defined “generate insights”, not “decide if we ship”.
Curious how this evolves for you if you keep iterating on it.
Fantastic thread, and you’ve got me started on an agent this afternoon!
I’ve created one to do a pass through of Must have requirements to generate the positive and negative scenarios needed at a functional level in a BDD format as they are easy to read.
We have a good set of requirements and acceptance criteria, and some tests already written which cover the happy paths, so the interest for me here is more on the negative scenarios generated by the agent and how accurate we feel they are.
The next stage is focusing on integration scenarios.
Its early stages in the workflow, execution will start soon, so it is a good time to see whether the agent adds value. If it creates a similar set of tests to the ones we have already created, then we can have confidence that it hasnt missed anything, but also we will see whether there are scenarios that we had missed that it has created. Thats the next step for us.
As with anything, the output will only be as good as the reference data, and as we have a good requirements set, this is a good first agent test.
We’re using Atlassian’s Rovo agent. It’s pretty good analyzing multiple sources. As per other comments, using it for test case generation has to be done with care. I created an agent that helped generate regression test ideas based on related documents and previous stories. Some were valid, other’s not. Main use case at the moment is trying to improve the quality of the story - again based on linked items and related documentation. It looks for gaps and inconsistencies and provides recommendation - for humans to review and act upon using their judgement.