What Should We Measure in Exploratory Testing?

What Should We Measure in Exploratory Testing?

at AutoExplore, we’re building a benchmark to evaluate and improve agentic exploratory testing capabilities. Our goal is:
Create a fair, transparent way to measure how well autonomous agents explore, detect issues, and learn from software systems.

This is a call out for testing professionals and developers to give input what metrics should we measure. Measuring the wrong things can distort behavior. Measuring the right things can unlock real progress.

Here’s what we’re currently planning to measure:

:one: Exploration & Coverage Metrics

  • Number of discovered views / pages / URLs

  • Coverage of interactive elements
    Links tested (number / total)
    Buttons tested (number / total)
    Inputs tested (number / total)

  • Total number of operations performed
    Data entry variations
    Order-of-operation permutations

Why this matters:
Exploratory testing is about navigating unknown territory. We want to measure how broadly and deeply the agent explores. Without telling it what to do and how to do it.

:two: Defect Detection Quality

We’re thinking in terms of confusion-matrix-style evaluation:

  • True Positives (Positive–Positive)
    Issues reported that are real defects

  • False Positives (False–Positive)
    Issues reported that are not defects

  • False Negatives (Positive–False)
    Real defects observed but not reported

  • True Negatives (False–False)
    Observed situations that were correctly ignored

Why this matters:
Finding bugs isn’t enough. Reporting noise is costly. Missing critical issues is worse.

We want to measure:

  • Precision

  • Recall

  • Signal-to-noise ratio

:three: Efficiency Metrics

  • Test time

  • Time to first defect

  • Time to critical defect

Exploration isn’t just about coverage. It’s about discovering value quickly.

What Else Should We Measure? :thinking:

This is where we need the community.

We’re Also Looking for Benchmark Applications

We’re searching for:

Test applications with a known, documented list of issues.

Ideally:

  • Publicly available

  • Designed for testing practice

  • With known defect catalogs

If you know any applications built for this purpose, we’d love to evaluate them for inclusion in the benchmark and give the credits for you and the original author!

3 Likes

How are you defining exploratory testing?

Some will question whether agents can do this at all and maybe what it does is something else, something new that fits between mechanical script execution and human discovery, simultaneous design and learning focused testing which is often referred to as exploratory testing.

That bit for me at this point its still something different but is getting closer.

Measuring exploratory testing often comes down to “what of significant value did you discover and learn about during test execution”, there is also the flip side that you can track that you learned something did not have significant issues.

Tracking what risks covered, how deeply they were covered and the useful things discovered are likely the most value.

To put that into context. Consider exploratory test session charters. I looked up an example here.

Mobile Connectivity: Explore the mobile app with a poor network connection to discover if it still works as expected or if it provides helpful error messages”

What metrics would you add for that charter and can you then extrapolate those metrics to all charters, this sort of goes back to that idea of risk lists, coverage and findings of value.

I suspect the agents could output reports in this format but what I am not sure about is whether they can simulate a pro tester executing test charters with that exploratory testing aspect of simultaneous design and learning.

Those counts you mention not so much value in my view, maybe suitable to scripted testing or for measuring the tools stability itself but as you are asking specifically about exploratory testing measured which leans towards discovery and learning being its prominent value.

3 Likes

Some will question whether agents can do this at all and maybe what it does is something else, something new that fits between mechanical script execution and human discovery, simultaneous design and learning focused testing which is often referred to as exploratory testing.

I think the definition for exploratory testing here is that we don’t tell the agent what to test. The agents need to define themselves what they do test and what they report as issues. We are only expecting results from them which are then identified / classified into different categories for metrics.

Measuring exploratory testing often comes down to “what of significant value did you discover and learn about during test execution”, there is also the flip side that you can track that you learned something did not have significant issues.

This sounds like great idea for another full benchmark :slight_smile:

Those counts you mention not so much value in my view, maybe suitable to scripted testing or for measuring the tools stability itself but as you are asking specifically about exploratory testing measured which leans towards discovery and learning being its prominent value.

As a concrete example the agent can simulate unstable network traffic and then it reports the finding in english. Another agent or set of agents / humans define if the report is one of:

  • True Positives (Positive–Positive)
    Issues reported that are real defects

  • False Positives (False–Positive)
    Issues reported that are not defects

  • False Negatives (Positive–False)
    Real defects observed but not reported

  • True Negatives (False–False)
    Observed situations that were correctly ignored

Here is one example how defect report looks like for registration / sign in page.

Expected

After clicking “Luo ilmainen tili” (Create a free account), the user should be navigated to a registration or multi-step onboarding flow that exposes required input fields for name, email, phone, and address so the test step “Complete flow with valid data” can be carried out.

Actual

Despite the history indicating that “Luo ilmainen tili” was clicked, the page at xxx still shows only the login form with two inputs (email and password). No registration or multi-step form containing name, phone, or address fields is rendered, preventing completion of the required registration-flow step.

Reproduction Steps

  1. Open xxx in a browser.

  2. On the landing/login page, click the link or button labeled “Luo ilmainen tili” (Create a free account).

  3. Observe that the UI remains a simple login form with only email and password fields and no visible registration form with name, email, phone, and address inputs.

These reports we can then classify into categories to measure agent success

2 Likes

That’s a good example. You have a report that includes expected result and actual result that puts it into a known risk confirmatory test much closer to a scripted test. Exploratory would rarely have that it would be closer to investigate registration utilising common risks associated with these features and look for anomolies or problems that occur.

I’d recommend moving away from the term exploratory testing, its going to really confuse those who know that well. In a playwright discussion the idea of risk guided “ behavioral simulated user tests” came up, its not perfect but much closer than comparison with a professional tester doing exploratory testing.

Exploratory can be a good sales term as it can convince some managers that they can drop their pro testers and get a lot of savings in effort from the tool. There is though a risk that testers themselves feel its pretending to do something its not so they avoid it completely without giving it a chance to be useful at the thing it actually does. I think this does offer that value but I’m not seeing it as exploratory.

1 Like

That report tried to demonstrate one outcome from the testing.

As an input the AI Agent was given instruction: “You are professional tester …. start testing this application at” We are not telling expected results for it. The agent needs to work on its own to figure out what to test and what to expect and what to report.

Human work as judges, this is the idea in the benchmark

I think this would be useful for broader discussion. “You are a professional tester” part of me thinks this bit could be challenging, its a machine so should it be “you are simulating a professional tester” and does it know the difference.

Take playwright agents, that model seems to be closer to “user behavior simulation” but in your case you are going that step further for pro tester?

This is the interesting bit, is it capable of that and to what extent. I do not know the answer to this bit but in understanding this it may help the direction we are going in with AI in testing.

Here is AI’s own take. “AI agents can already do some exploratory testing well, but they do not yet match a strong human exploratory tester across the full spectrum.”

That leans towards you use of exploratory being okay but also picks up on my thinking its not quite the same and maybe needs a different name for what it does.

AI extra comment may also be key “Dealing with ambiguity could be AI’s hardest wall”.

This seems a very important debate, likely a level of self preservation but its definitely and advancement on usual automation, crawlers and scanners.

Are you able to give your agent test charters? It would be interesting to see how it deals with those and may give good insight into that exploratory testing capability.

Yes technically we can give it test charters. We thought it would be more realistic exploratory testing without them as the idea is to minimize users impact on the agent. Agent needs to create its own test charters or as we call it test plans. However, this might be something we should try out.

Here is screenshot the agent autonomously testing some vibe coded application.

It has created its own test-plan and verifying the application

1 Like

I’m currently experimenting with creating an accessibility agent that uses playwright agents as sub agents, even that single risk area can get quite complicated. The agent is really good on the reporting side but I still regard it as a entry point for deeper hands on testing at this point.

It could be worth a separate agent for charter creation similar to a test planning agent.

It would be interesting to have different agents for different risks and chaining them but likely overkill which is where a tool designed for the broad risk coverage may offer more than teams quickly building their own agents which they can likely do very quickly and control the context more.

1 Like

One thing I would also consider.

When I do exploratory testing, particularly early in a products lifecycle my output is not only an evaluation of the product and highlighting things of interest often bugs but also opportunities I also tend to have an output of multiple questions for developers, this is likely linked to the ambiguity element.

Questions for designers and developers may also be good output from an exploratory agent.

1 Like

Cloning from another discussion here. How is the token usage and costs?

How does the example below compare, is it overkill and how would the costs relate?

I’m wondering if extrapolating the cost of something like below to a whole application and plethora of risks could end up costly, any recommendation on this front, what you would not use it for, what areas could be high cost or the target level to keep costs consistent. Are you continuously running, perhaps on each PR, overnight, adhoc?

What do the costs look like, this might be an interesting metric alongside risks covered and things of interest discovered.

“One factor though of agents doing investigative testing via the browser could be costs, running 50 experiments dynamically for a checkout page given a prompt containing 50 risks to looks for without some sort of optimisation for example.”

1 Like

The cost discussion goes maybe little bit off topic.

From AutoExplore point of view we are selling the service with a static price per month. It is our interest to optimize the token usage and other fees. Customers typically run the agents 24/7.

We have a mix of heuristics based approach and AI based approach to minimize the costs

1 Like

That’s fair enough, the static price makes sense. I don’t have enough data on the token usage yet with other tools, just that sense that I would not put heavy usage into PR triggers but was hoping others had more knowledge on this.

I fear we may end up with ‘manual exploratory testing’, hah.

I remember, @jmosley5 I think mentioned once, that people in here team tagged whether code was created with the help of AI, so that it is at least noted somewhere when it goes for review, it can then be reviewed with a lense of where risks might more likely exist.

I wonder if we have something similar for exploratory tests, we tag them somehow on whether they are agentic or human, but what about some that may be a bit of both?

The learning aspect of exploratory testing is interesting too, I know it’s (often) defined that we learn as we go when we do exploratory testing, but I think we do that as humans in everything we do. Would it be unreasonable to drop the ‘learning’ aspect from the definition? :hot_pepper:

1 Like

There’s a little distance between the original Cem Kaner definition and how it evolved in the RST namespace, but basically current thinking (in the RST namespace, which is where ET was worked on and evolved) is that ALL testing is exploratory, and there is something called “scripting” which is essentially the factors in testing that take control away from us. This might include instructions, automation and so on, but also includes biases, ignorance, culture, the detail that’s obfuscated by using tools.

To my mind there’s nothing in this OP about exploratory testing. I’d say that any tool that takes exploration away from a tester is antithetical to the idea of exploratory testing - it is scripting. It may be empowering, valuable, and so on, but, to me, it has not a thing to do with exploratory testing.

That’s fine, what I think of as ET and what someone else thinks of as ET can differ, but I think if we’re going to talk about alternative definitions (so that someone like me can translate between them) we should talk about what the new definition actually is, and if it’s understandable and defensible. Otherwise I’m (and any anyone using the term as taught by the people who worked on it) likely just going to ignore it. I think that @andrewkelly2555 pretty much nailed it already, and asking for a definition was their first question. I think that was right to be first.

I have been seeing multiple posts talking about manual coding this year, “quietly walks away”.

On exploratory testing, lets narrow it down to say session based exploratory testing, I personally believe it would be unreasonable to drop the learning aspect as learning is one of it’s primary goals, not a byproduct or a nice to have its a specific goal, you want to learn something useful usually with regards to a specific risk.

I did not learn much brushing my teeth today but a few weeks ago I tried out a new toothbrush and my approach changed to almost an exploratory session where I learned quite a bit about that toothbrush as I had a decision to make, do I like this new one more than my old one?

Whilst its an nice idea that all testing should be exploratory and as such all testing is about learning I am not convinced that is always the focus. Some models of testing can lean more into learning than others, if you are looking at known very well areas of the system and taking a confirmatory or verify approach to your testing you do learn but its not the focus. When you lean more towards discovery, investigation and experiments which are traits of an exploratory test session the focus is learning.

Anyway for me I think it is key element of session based exploratory testing. We have though established it can mean different things to others but I’ll still go to bat on that learning aspect as its core to the my own testing approach and I believe it helps me stand out compared to when I follow other models of testing.