I think its a really interesting area.
I’ve been experimenting with playwright’s agents that also dynamically navigate the browser so I can share my early thoughts that may be worth a comparison.
You can give these oracles and heuristics to specifically look for. For example you could ask it to explore for accessibility issues where it would use both scanners and navigation, include keyboard navigation flows, screen reader ability, zooms, contrasts etc, etc.
So it becomes guided exploration of a website to an extent, once this is set up and you find a value point you could likely turn that into a specific accessibility agent and just change the target url. I have not done that step yet. I sort of want to get it to a level of a starting point of AA compliance check for example. This approach can be extended to other risks in the same way.
It does find problems, these problems tend to be fairly generic though, things a scan would pick up are the basic level testing, the navigation approach can pick up things for example no confirmation message displayed to a user on an action that by the standards should have.
Are the issue found fairly shallow level, like an automated e2e test level or can it once it finds an issue run its own experiments and do deeper testing? This I have not seen yet so would be interesting if this tool did.
Then you have context awareness, how are you feeding in context and requirements so it can explore with those as guides? Out the box a lot of tools will not have that context awareness so it tends to get stuck at the generic known very well risk level, that is not entirely a bad thing as there are often a lot of generic issues in apps.
You also have the consideration of what do you lose out on, loads of human learning happens when you explore an app, bio/wet brains have thousands of times more receptors running than the machines currently. Is losing that learning worth the risk of letting a tool explore?
What is not there? This can be addressed by feeding in requirements but it not usually out of the box.
Unknowns and wtf’s - how good is it at picking this up, the dancing gorrilla in the background that should not be there for example. Real world issues in real world context.
Cost and time is another angle, I found in many case I could find things quicker than the agents running, the costs is not something I have evaluated.
How good is at testing, are the tests any good - usual measure is if a risk of a specific type exists this will give the best opportunity to find that risk that would not be found efficiently otherwise. This is likely where you look at measuring value.
So a few points.
Shallow or deep testing? Generic issues or context aware issues only? Costs and value vs other approaches. How well does it use oracles and heuristics its given?
The big one remains for me is what do we lose by having machines trying to simulate a professional human investigative tester? Note I also see waste in humans trying to simulate mechanical testing models like following test cases and scripted testing but this is potentially the opposite using machines to inefficiently simulate human wet biological brain strengths on products that are very specifically designed for a bio brain environment?
Even if we can do this, should we?