AI assisted exploratory testing, how are you doing it?

I’m curious how quality people are actually doing AI assisted exploratory testing.

Do you have any stories or examples to share?

  • What type of testing are you using it for?

  • What types of risks are you mitigating?

  • Why is it so good?

  • What tools do you use?

  • What kind of prompts work well?

  • How is it helping you be more efficient?

Any if anyone wants to jump on a short video conversation covering these questions, just let me know.

2 Likes

We’ve started using a tool called AutoExplore https://www.autoexplore.ai/ to test our web app. It’s pretty great because it runs 24/7 across different environments without us having to write a single line of code. All we have to do is configure the basics—things like the URLs we want to test, login credentials, and any pages we want it to stay away from.

The tool generates a bunch of different reports, catching everything from crashes and errors to accessibility and security gaps. Plus, you can actually look back at screenshots to see exactly what the AI agent was doing right before a failure happened.

It’s not meant to replace our main test automation, but it adds a lot of value. Our usual automated tests are predictable—they follow the exact same path every time. AutoExplore is different. It’s essentially “stochastic,” meaning it moves randomly and generates new test data every time. Since real users are unpredictable, this tool helps us find the weird bugs and edge cases that our standard scripts would never catch.

3 Likes

Hi Rosie, added a few points below.

  • What type of testing are you using it for? I would say all types, i am more using it to prep, as a knowledge gathering tool to support the testing i want to do, rather than use it for a type of testing.

    • Using it more to understand the code, what has changed, area impacted - so the changes made which other areas has been impacted.
    • Any risks, potential issues, memory leaks, areas difficult to test, areas needing improvements/ refactoring, area which don’t match the structure of existing code etc
    • asking it to do code reviews and provide areas for improvement with snippets of code so i can check them
    • asking it to add logs, add timing logs so i can see which areas of the changes are taking a while so i can explore those. logs to expose what data is being processed so i can figure out where the data is coming from and what type of data is being processed. inputs > outputs - any data transformations. When you know what data is coming in , it helps figure out if that data can be manipulated
    • any scenarios which haven’t been considered / edge cases where logic could cause problems.
    • Testing theories, asking it i think this might be a problem - could it provide snippets from the code which impacts that theory so i can check and test the theory out. or review its findings and the code it used to decide so i can make my own evaluation.
  • What types of risks are you mitigating?

    • Bad code / partial / incomplete code -
    • Code which is completely different from code base
    • code which takes a long time , potentially heaving processing or small data but very frequent - timing issues or heavy processing so more usage of resources
    • poor test coverage
    • potential for malformed data eg if you see logs where anything you input comes through with no guard rails its a red flag
  • Why is it so good?

    • Better understanding of what is happening internally and externally. Generate .md file of what the code is doing and the folder structure so you know where things are and what each are doing…
    • Helps provide more visibility - adding additional logs
    • Helps with writing scripts or code to generate data needed for the tests eg if the code reads data from a db, it can use that to write a script to populate variations of that data. Then you can amend that to expand on it.
    • Reviewing unit tests and suggesting gaps or potential bugs. It does write terrible tests sometimes and test which don’t make sense so needs a lot of review, but output is still useful especially for mocking in tests. Plus ensuring tests are isolated
  • What tools do you use?

    • Copilot + Claude
  • What kind of prompts work well?

    • My prompts aren’t great - not structured as they should be, i just make it clear what the context is, what i know and what i want to know and add list of questions i have. So my prompts are more to find things out, generate sample code and feed it what i know and get it to compare it to what it finds
    • Ensure not security or identifiable data is used. Always sanitise what is sent
  • How is it helping you be more efficient?

    • Write tools which help with the testing. Seeding data tools, analysing logs and identifying differences and showing the differences in a clean easy to read way
    • Quick feedback on if the code is good or problematic and ideas around areas worth checking out. Some code just works but fails badly because of its resource usage or its processing alot more data that was expected. Useful when this gets highlighted quickly if there is potential for it we might not spot.
    • Asking it to generate curl commands for endpoints, so instead of checking the code to build up how to call certain apis it can give you curl commands to hit the endpoints. Much quicker way as docs and code are not always so obvious when they haven’t been written well or are out of date. Have used it to spot that schema + docs didn’t match the actual implementation for an api
2 Likes

Kira, I had some comments on autoexplore on one of the creators threads, he gave some good answers but perhaps you as tester using it could also give me your thoughts on my comments?

1 Like

Thanks for this insight. I found a subtle RLS bug by asking Claude Code questions while stepping through the project code, it would have been a bit of a nightmare to discover through the UI.

Recently I used Claude to build a browser plugin to help with exploratory testing; currently it covers autofill detection, test data injection, and word scanning. I basically add a new feature when there’s a need for it.

I have a test case generator prompt but I only use it at the very end of an exploratory testing session to highlight anything I may have missed. Since gaining access to Claude Code, I’ve accidentally fallen into being more hands-on with creating and improving tickets ready for refinement, which has been a surprising outcome. I’m hoping higher-quality tickets will reduce the current rework rate and make the exploratory testing phase smoother, improving efficiency.

1 Like

Hi, Andrew, you raised excellent questions and concerns.

First, as a Quality Manager — not just a tester or test automation engineer — I need to think strategically. My key objective is: how do I minimize critical incidents in production? To answer that, I have to consider the resources I have, the budget, and the complexity of the system under test.

Our application has 1,000+ URLs. Even though I have people doing manual testing, test automation, and domain experts on the team, the sheer complexity makes thorough regression and exploratory testing a serious challenge. We have E2E test automation, unit testing, integration testing — yes, all of it. But you know as well as I do that these tests follow the same workflows every time. They verify known paths. They don’t surprise you.

AutoExplore gives us something different: a daily quality pulse of the system by exploring the app autonomously — paths we didn’t script, data we didn’t prepare, combinations we didn’t think of. Is it shallow compared to a skilled human exploratory tester? Sometimes, yes. But it runs 24/7 across environments without consuming my team’s capacity. That’s the trade-off I’m making consciously.

Your point about context awareness is valid and I see it as an opportunity for the product to develop further and generate more value. Right now it catches generic issues — crashes, errors, accessibility gaps, broken flows. Adding context-aware exploration (requirements, business rules, heuristics) would take it to another level.

I’ve also experimented with Playwright agents that dynamically navigate the browser. My take: the potential is real, but there’s a significant gap between a promising POC and a robust solution that works day after day without requiring maintenance from my QA team. For a small team with limited resources, a production-ready tool like AutoExplore is the practical choice today.

To your bigger question — what do we lose by having machines simulate human investigative testing? I don’t see it as replacement. I see it as coverage expansion. My human testers focus on complex scenarios, business logic validation, and the “does this actually make sense?” questions that require domain knowledge. AutoExplore covers the surface area that no human team could realistically touch every day on a 1,000+ URL application.

1 Like

How is the token usage and costs?

The following came up in another discussion, I am not sure how that small example compares to what you get from autoexplore but is sort of what I think of when ai and exploring are discussed. Extrapolating that sort of thing to all risk and a whole app, as you said 1000+ urls so likely a lot of risks to explore would that not be fairly costly?

How does the example below compare, is it overkill and how would the costs relate?

“One factor though of agents doing investigative testing via the browser could be costs, running 50 experiments dynamically for a checkout page given a prompt containing 50 risks to looks for without some sort of optimisation for example.”