Has Anyone Replaced Manual Testing with Agentic AI for New Feature Testing?

Hi everyone,

I’ve been exploring agentic AI solutions in QA and I’m curious about real-world experiences.

Has anyone here significantly replaced manual testing (especially for new feature testing, not regression) with an agentic solution?

I’d love to hear:

  • How much manual testing are you still doing today?

  • To what extent do you trust the agent to generate test cases and accurately determine pass/fail results?

  • What worked well, and what didn’t?

Trying to understand where these tools truly add value versus where human testing is still essential.

Thanks in advance for sharing your insights!

As a first step its worth creating some sort of matrix that lists the testing activities and values and then trying to classify whether they lean towards mechanical or human strengths.

A lot of teams did this previously but when you look at the market a lot of other teams were sort of having their humans doing quite a few activities that leaned towards mech strengths.

What does your manual testing look like?

If it leans towards test cases, scripted testing and testing to verify its likely AI can both replace and accelerate a lot of this.

This is not a model I follow which means my bias will often place things like test cases of lower value so AI generating this its often yep good enough. Similarly if I use something like playwright agents to automate this it fits into a bit extra coverage that for me is good enough. Hard core test case writers and deep automaters though will challenge AI as they are often doing very domain specific coverage and complexity increases risk for AI.

If it leans more towards a learning of the product, discovery, investigation and exploration of risk for now this still very much fits in human hands but with AI empowering potentially deeper and broader risk investigation.

The latter model have a look here How are you actually using AI in testing right now? - #2 by andrewkelly2555

Now there are tools offering exploratory testing with multiple agent running. My take useful for large teams on enterprise level products but capability needs to be clearer, they remain in my view closer to crawlers and an expansion of known risk coverage for now and not the human strength stuff testers do.

There are a few interesting areas. Lets say two areas which say your testers have limited experience in, security and accessiblity for example. These AI tools will often outperform an average tester on their own but again become powerful tools in the hands of someone advanced in those risks.

If human testing drops on human centric products its high risk this is acceptance of lower quality and less innovative products and that lower bar is accepted in exchange for faster more autonomous coverage.

I really appreciate this detailed explanation. Most of our work today is exploratory testing, where we focus on understanding product behavior, identifying risks, and validating user flows rather than relying on predefined scripts.

Interesting discussion. I want to share a different angle we’ve validated in practice.

There’s an approach that sits between fully manual and fully agentic: AI pre-annotates UI elements, humans review and correct the annotations, then deterministic test scripts are generated from that. Once generated, execution costs zero AI tokens — it’s just standard Python hitting precise coordinates. No LLM in the loop at runtime, no flakiness from model variance.

This works. We’ve proven it on real applications.

But honestly, it’s not something every QA team can just pick up and do themselves.
The annotation pipeline needs specific tooling. The visual recognition models need
training data. Getting the precision right so tests survive minor UI shifts takes
real engineering effort. For most teams, the ROI only works if someone else has
already built and maintained that infrastructure.

Which brings me to what I’m genuinely curious about from this community:

What does your team actually want — better tools to write tests, or reliable test
code that just works? If someone handed you deterministic, maintainable scenario
scripts covering your critical flows, would that solve the problem? Or is the process
of writing tests itself where the value lies?

And a practical question: once you’ve achieved coverage for your core user journeys,
how much churn do you typically see? Is maintenance a constant battle, or more like
occasional updates after major releases?

Would love to hear how others think about this tradeoff.

I think you highlight some of the core challenges as a whole, what problems are you trying to solve with AI.

“What does your team actually want — better tools to write tests, or reliable test
code that just works? If someone handed you deterministic, maintainable scenario
scripts covering your critical flows, would that solve the problem? Or is the process
of writing tests itself where the value lies?”

For me none of that would solve my testing challenges, useful yes but its focused on known risks so say 10% of what a hands on tester focuses on, their strengths lie in the unknowns. Most apps are still for this point at human centric and as such have never really been deterministic even when guardrails are in place.

I added this description of testing to another thread recently which may help explain why my goals could be very different.

“Testing that highly technical tool loving activity that emphasises learning, discovery, investigation and experimentation into product risk. The one that embraces the currently unknowns, finds comfort in ambiguity, nuances, empathy and real world context. That takes a holistic whole lifecycle view of testing and applies it from day one.”

@probe_runner What you may want from AI is likely very different from I, the tools I am looking for are of data and information based, give me more visibility on what’s happening as I test, maybe I miss some odd api responses as I test, is there something in logs that give me more insight, what experiments can I do next.

An automation engineer for example may have goals much closer to the things you are suggesting but potentially a lot less so for a hands on tester. Its really important and of high value you raise this and its a great question, “what does your team actually want from AI?”. Its not going to be the same for all testers and your points are very valid, in particular token use in agentic will be interesting.

Good point — and actually what I described only tells half the story. The UI interaction is just the trigger. The real gate is underneath: every API call during a scenario gets recorded, and on subsequent runs the responses are compared automatically. So it’s less about “did the button appear” and more about “did the backend behave the same way.”

That might actually overlap with what you’re looking for — visibility into what’s happening under the hood as tests run. The difference is it happens passively: you run the scenario once, and from that point on any API drift gets flagged without extra effort.

Beyond exact response matching, things like 5xx spikes, latency drift, or unexpected status code changes across runs get surfaced automatically — the kind of odd API behavior that’s easy to miss during manual exploration.

Your point about different testers wanting different things from AI is spot on. For hands-on testers, this kind of passive regression net might be more useful than generated test scripts.

This is sort of the thing I want an agent to pick up on as I explore and run my own experiments, as you suggest it can be missed but an agent should be able to pick this up and highlight it. It’s on my list to look at say playwright cli to see if I can get it to do this in the background as I test.