Have you built any GPT Agents for testing? If so, could you share in comments or recommend any?

After publishing my recent article on using AI as a partner in Quality Engineering, I realised many teams are moving beyond “one-off prompts” and starting to build more structured GPT agents for testing tasks.

I’m curious :thinking: :
– Have you built or experimented with GPT agents for testing or QA work?
– What problems are they actually helping you solve (analysis, exploration, automation support, triage, etc.)?
– What didn’t work as expected?

I’d love to hear real examples, especially from teams using agents in day-to-day testing, not just experiments.

4 Likes

I haven’t personally, but will be watching this thread with interest,

1 Like

I tried playwright test agent for generating test plans , script generation and self healing, based on my experience it was okay. I tried it for website with mandatory login for accessing the feature and initially the agent was bit stuck on login.

After multiple attempts it was able to access the website.

I didn’t tried script generation for any complex features, as I was just exploring so i tried with normal and it worked.

Self healing was good and it worked as expected.

However like other gpt , the test agents results needs manual review before using the results formally.

1 Like

Hey, great topic—yes, teams are building structured GPT agents for QA beyond one-off prompts. We’ve used them daily for generating test cases from requirements (cuts time 50-70%, catches edge cases), triaging bugs/logs, suggesting exploratory ideas, and light automation (e.g., Playwright code snippets or self-healing locators). They excel at speed and coverage but struggle with hallucinations (need human review), lack deep domain context without RAG/fine-tuning, and flake on dynamic UIs or big app changes. Real wins in fintech/e-commerce for transaction flows, but they’re strong assistants—not replacements. What’s your experience so far?

1 Like

I have been using playwright agents, straight forward enough.

Should I build my own agent? This i am still questioning, here is an experiment I was running today.

Just using prompts - “Create an accessibility risk plan only, remember to dynamically run the app is creating the plan so it has a behavioural aspect, include some scans and keyboard only tests, I would like it to be possible if the plan eventually leads to a creation of a small report on how it meets wcag 2.2 guidelines”

I initially switched agent from playwright planner to playwright generator at this point to create tests but then ran into a few things it did not have rights to do so switched to standard agent mode, with a few refining prompts I was able to get a starting report.

To be fair the report was not bad, correctly identifying some issues and corrective action recommendations, note it was also aware of its limitations, two small extracts attached.

My level of involvement I am reasonably comfortable with, I spotted a couple of false positives a stand alone agent without a reviewing agent may have missed, and also I need to refine that report and expand some of the risk coverage. There was really not much too it all, I need to think a bit more whether its worth building a specific agent for creating this sort of report and also review value with others, might save me a coffee break of effort perhaps. Now if I was to chain agents for different risks I wonder what I would lose in the distance I’d have from the agent, even knowing its limitations could I get lazy and at some point say its sort of good enough and the testing gets shallower?

The reporting part, that I can likely find use in. Might be worth a test reporting agent in its own right I could give my days exploratory testing notes to.

1 Like

Totally fair :smiling_face:

I’ve noticed a lot of teams are exactly at this “observing” stage right now , especially once the conversation moves from prompts to agents. I’ll share some concrete examples from our side in the thread, maybe it’ll be useful when you decide to experiment.

This matches our experience quite closely, especially around authentication and the need for manual review. Login flows are still one of the weakest points unless the agent is very tightly guided or operating in a controlled environment.

We’ve seen the most value not in full script generation, but in:

  • test scenario drafting

  • exploratory coverage suggestions

  • self-healing or maintenance support

And yes, anything generated still needs human validation before being used formally :unamused_face: . For us, agents work best as accelerators, not decision-makers :smiling_face_with_sunglasses:

We’re seeing the same spot: agents work best when they’re narrowly scoped and embedded into existing QA workflows, not when they try to be “general QA brains”.

For example, we use GPT and Jira-integrated agents for:

  • generating test cases from requirements

  • bug analysis and clustering

  • refinement preparation

  • feature updates agent (local thing but really love it)

And a separate agent for a weekly QA bug summary that posts directly to Slack, that one is already in day-to-day use.

Also I use a lot GPT projects. And Claude code with mcp Testomat.

RAG helps a lot, but domain context and boundary definition still make or break the result. That’s why totally agree, definitely assistants, not replacements.

Thanks for sharing such a detailed experiment. this is exactly the kind of real-world feedback that’s missing in many AI testing discussions.

What you described resonates a lot:

  • solid high-level reporting

  • useful directional insights

  • but still requiring QA judgment to validate findings and assess risk

We’ve had similar results with reporting-style agents and they’re great at:

  • structuring results

  • summarising findings

  • highlighting potential risk areas

But not at reliably determining severity or business impact without human input.

I really like your thought about separating this into a dedicated test reporting agent.
In my experience, agents become much more reliable when their responsibility is very clearly defined “generate insights”, not “decide if we ship”.

Curious how this evolves for you if you keep iterating on it.

1 Like

Fantastic thread, and you’ve got me started on an agent this afternoon!

I’ve created one to do a pass through of Must have requirements to generate the positive and negative scenarios needed at a functional level in a BDD format as they are easy to read.
We have a good set of requirements and acceptance criteria, and some tests already written which cover the happy paths, so the interest for me here is more on the negative scenarios generated by the agent and how accurate we feel they are.
The next stage is focusing on integration scenarios.

Its early stages in the workflow, execution will start soon, so it is a good time to see whether the agent adds value. If it creates a similar set of tests to the ones we have already created, then we can have confidence that it hasnt missed anything, but also we will see whether there are scenarios that we had missed that it has created. Thats the next step for us.

As with anything, the output will only be as good as the reference data, and as we have a good requirements set, this is a good first agent test.

1 Like

We’re using Atlassian’s Rovo agent. It’s pretty good analyzing multiple sources. As per other comments, using it for test case generation has to be done with care. I created an agent that helped generate regression test ideas based on related documents and previous stories. Some were valid, other’s not. Main use case at the moment is trying to improve the quality of the story - again based on linked items and related documentation. It looks for gaps and inconsistencies and provides recommendation - for humans to review and act upon using their judgement.

1 Like

Hi Dasha,

Yes, we’ve built a few agents that we’re actually using day-to-day. We initially played with prompts, but that wasn’t giving results to our expectations. So, we built something structured that ties into our existing tools.

Here’s what we’re doing:

1) Starting with Jira (and Figma when needed)
Someone drops in a Jira ticket ID and asks for test scenarios. The agent pulls the ticket details (ACs, tech notes, whatever’s there)

If the ticket references a Figma file, it calls the Figma MCP to fetch all the design details for the nodes we’re working with. This design context gets fed back into the process, so the scenarios actually reflect what’s being built visually, not just what’s written in the ticket.

Then the JIRA MCP goes up to the Epic level and grabs all the closed tickets to understand what’s already been built. This combined context, design specs, acceptance criteria, and historical feature work, makes a huge difference in the quality of scenarios it generates.

We review the scenarios, tweak them, ask for changes. Once they look good, we move forward.

2) Creating test cases in TestRail
We push the approved scenarios to TestRail via MCP. The smart bit here is that before creating new test cases, it searches for existing ones (by feature name or Jira ID). If it finds similar steps, it just adds to the existing case instead of duplicating. This alone has saved us from drowning in redundant test cases.

3) Automation with Playwright
After test cases are sorted, we use the Playwright MCP to generate page objects and helper methods. When we’re ready to test, it creates scripts based on the TestRail cases.

4) Branch creation with Bitbucket
We’ve also hooked in the Bitbucket MCP. When we’re ready to start automation work, it creates a branch based on the Jira ticket ID and title, then checks it out. One less manual step to worry about.

So we’ve got this chain: Jira ticket → Figma designs (if applicable) → scenarios → TestRail cases → automation code → version control all connected.

What’s actually working:

  • Way faster to get scenarios with proper context from both requirements and designs

  • Less duplicate garbage in TestRail

  • Clear line from ticket to automated test

  • Saves time on the boring parts of setting up automation

  • Design details get baked into test scenarios without manual cross-referencing

What we had to figure out:

  • Early versions didn’t pull enough context, adding the Epic and Figma lookups fixed that.

  • Had to tune how it merges test cases so it doesn’t combine things that shouldn’t be combined

  • The automation code it generates is helpful but still needs a human eye.

Honestly, the biggest change is using AI to connect our tools rather than just generate text. It’s less about “write me a test case” and more about “keep everything in sync and don’t make me do it manually.”

Happy to chat more if others are doing something similar!

1 Like

I’ve been using Rowo agent as well. But I can’t rely only on it cuz it can’t cover everything I need to analize of find in Jira, that’s why I use it together with other GPT+ Atlassian API integrations. If you are interested, here’re the differences and some use cases of Rovo in GPT usage.

Atlassian Rovo + ChatGPT — Scope & Limitations Summary

Context

Atlassian Rovo is connected to ChatGPT.

Rovo is an Atlassian-native AI assistant that reads Jira and Confluence as a user, but it does not provide full Jira API or JQL access.

:white_check_mark: What Atlassian Rovo Covers Well

1. Quick Jira Context & Lookup

Rovo is suitable for fact-based and discovery questions.

Examples:

- What tickets are included in this epic?

- What is the current status of a ticket?

- Who is the assignee?

- Which tickets are currently in Ready for QA?

- Which tickets were recently updated?

:right_arrow: Best for: “show / find / explain”

2. Confluence Knowledge & Documentation

One of Rovo’s strongest areas.

Examples:

- Summarize the Nudgets feature specification

- Find documentation related to this feature

- Explain product decisions

- Summarize QA strategy pages or acceptance criteria

3. Fast Orientation (No Analysis)

Rovo is useful when working directly in Jira and needing quick answers without building reports.

:cross_mark: What Atlassian Rovo Does NOT Cover

1. QA Analytics & Aggregation

Rovo is not designed for metrics or calculations.

Not suitable for:

- How many tickets passed Ready for QA

- QA progress summary per epic

- Release readiness analysis

- Comparing Ready for QA* vs *Done

2. Status History & Transitions

Rovo works with current state, not historical transitions.

Cannot reliably answer:

- Which tickets ever passed Ready for QA

- Which tickets were reopened after QA

- Which tickets are stuck after QA

3. Automated & Repeatable QA Reporting

- No guaranteed completeness

- No stable output format

- Not suitable for recurring QA reports

:brain: What Should NOT Be Expected from Rovo

- JQL execution

- Full Jira REST API access

- Structured QA analytics

- Automated QA summaries

:counterclockwise_arrows_button: Rovo vs Custom GPT with Atlassian API

| Task Type | Rovo | Custom GPT + Atlassian API |

|---------|------|----------------------------|

| View current data | :white_check_mark: | :white_check_mark: |

| Explain context | :white_check_mark: | :white_check_mark: |

| Count tickets | :cross_mark: | :white_check_mark: |

| Ready for QA analysis | :cross_mark: | :white_check_mark: |

| QA progress summary | :cross_mark: | :white_check_mark: |

| Release readiness | :cross_mark: | :white_check_mark: |

| Automated reports | :cross_mark: | :white_check_mark: |

Key Takeaway

- Atlassian Rovo→ quick context, discovery, Confluence knowledge

- Custom GPT with Atlassian API → QA analytics, progress tracking, reporting

- They operate in parallel, not as replacements for each other

Recommendation (QA Perspective)

- Use Rovo for:

  • quick checks

  • understanding context

  • documentation

- Use a dedicated QA Jira Agent for:

  • Ready for QA tracking

  • epic-level progress

  • stakeholder updates

  • release readiness reporting