at AutoExplore, we’re building a benchmark to evaluate and improve agentic exploratory testing capabilities. Our goal is:
Create a fair, transparent way to measure how well autonomous agents explore, detect issues, and learn from software systems.
This is a call out for testing professionals and developers to give input what metrics should we measure. Measuring the wrong things can distort behavior. Measuring the right things can unlock real progress.
Here’s what we’re currently planning to measure:
Exploration & Coverage Metrics
Number of discovered views / pages / URLs
Coverage of interactive elements
Links tested (number / total)
Buttons tested (number / total)
Inputs tested (number / total)
Total number of operations performed
Data entry variations
Order-of-operation permutations
Why this matters:
Exploratory testing is about navigating unknown territory. We want to measure how broadly and deeply the agent explores. Without telling it what to do and how to do it.
Defect Detection Quality
We’re thinking in terms of confusion-matrix-style evaluation:
True Positives (Positive–Positive)
Issues reported that are real defects
False Positives (False–Positive)
Issues reported that are not defects
False Negatives (Positive–False)
Real defects observed but not reported
True Negatives (False–False)
Observed situations that were correctly ignored
Why this matters:
Finding bugs isn’t enough. Reporting noise is costly. Missing critical issues is worse.
We want to measure:
Precision
Recall
Signal-to-noise ratio
Efficiency Metrics
Test time
Time to first defect
Time to critical defect
Exploration isn’t just about coverage. It’s about discovering value quickly.
What Else Should We Measure?
This is where we need the community.
We’re Also Looking for Benchmark Applications
We’re searching for:
Test applications with a known, documented list of issues.
Ideally:
Publicly available
Designed for testing practice
With known defect catalogs
If you know any applications built for this purpose, we’d love to evaluate them for inclusion in the benchmark and give the credits for you and the original author!
Some will question whether agents can do this at all and maybe what it does is something else, something new that fits between mechanical script execution and human discovery, simultaneous design and learning focused testing which is often referred to as exploratory testing.
That bit for me at this point its still something different but is getting closer.
Measuring exploratory testing often comes down to “what of significant value did you discover and learn about during test execution”, there is also the flip side that you can track that you learned something did not have significant issues.
Tracking what risks covered, how deeply they were covered and the useful things discovered are likely the most value.
To put that into context. Consider exploratory test session charters. I looked up an example here.
“Mobile Connectivity: Explore the mobile app with a poor network connection to discover if it still works as expected or if it provides helpful error messages”
What metrics would you add for that charter and can you then extrapolate those metrics to all charters, this sort of goes back to that idea of risk lists, coverage and findings of value.
I suspect the agents could output reports in this format but what I am not sure about is whether they can simulate a pro tester executing test charters with that exploratory testing aspect of simultaneous design and learning.
Those counts you mention not so much value in my view, maybe suitable to scripted testing or for measuring the tools stability itself but as you are asking specifically about exploratory testing measured which leans towards discovery and learning being its prominent value.
Some will question whether agents can do this at all and maybe what it does is something else, something new that fits between mechanical script execution and human discovery, simultaneous design and learning focused testing which is often referred to as exploratory testing.
I think the definition for exploratory testing here is that we don’t tell the agent what to test. The agents need to define themselves what they do test and what they report as issues. We are only expecting results from them which are then identified / classified into different categories for metrics.
Measuring exploratory testing often comes down to “what of significant value did you discover and learn about during test execution”, there is also the flip side that you can track that you learned something did not have significant issues.
This sounds like great idea for another full benchmark
Those counts you mention not so much value in my view, maybe suitable to scripted testing or for measuring the tools stability itself but as you are asking specifically about exploratory testing measured which leans towards discovery and learning being its prominent value.
As a concrete example the agent can simulate unstable network traffic and then it reports the finding in english. Another agent or set of agents / humans define if the report is one of:
True Positives (Positive–Positive)
Issues reported that are real defects
False Positives (False–Positive)
Issues reported that are not defects
False Negatives (Positive–False)
Real defects observed but not reported
True Negatives (False–False)
Observed situations that were correctly ignored
Here is one example how defect report looks like for registration / sign in page.
Expected
After clicking “Luo ilmainen tili” (Create a free account), the user should be navigated to a registration or multi-step onboarding flow that exposes required input fields for name, email, phone, and address so the test step “Complete flow with valid data” can be carried out.
Actual
Despite the history indicating that “Luo ilmainen tili” was clicked, the page at xxx still shows only the login form with two inputs (email and password). No registration or multi-step form containing name, phone, or address fields is rendered, preventing completion of the required registration-flow step.
Reproduction Steps
Open xxx in a browser.
On the landing/login page, click the link or button labeled “Luo ilmainen tili” (Create a free account).
Observe that the UI remains a simple login form with only email and password fields and no visible registration form with name, email, phone, and address inputs.
These reports we can then classify into categories to measure agent success
That’s a good example. You have a report that includes expected result and actual result that puts it into a known risk confirmatory test much closer to a scripted test. Exploratory would rarely have that it would be closer to investigate registration utilising common risks associated with these features and look for anomolies or problems that occur.
I’d recommend moving away from the term exploratory testing, its going to really confuse those who know that well. In a playwright discussion the idea of risk guided “ behavioral simulated user tests” came up, its not perfect but much closer than comparison with a professional tester doing exploratory testing.
Exploratory can be a good sales term as it can convince some managers that they can drop their pro testers and get a lot of savings in effort from the tool. There is though a risk that testers themselves feel its pretending to do something its not so they avoid it completely without giving it a chance to be useful at the thing it actually does. I think this does offer that value but I’m not seeing it as exploratory.
That report tried to demonstrate one outcome from the testing.
As an input the AI Agent was given instruction: “You are professional tester …. start testing this application at” We are not telling expected results for it. The agent needs to work on its own to figure out what to test and what to expect and what to report.
Human work as judges, this is the idea in the benchmark