Feedback for AutoExplore autonomous software testing tool

Hello!

I’m excited to share a first look at our new product AutoExplore! We are trying to push the boundaries of automated exploratory software testing. Its goal is to be able to explore any software with Web UI. We are continously improving its capabilities of what issues it can detect!

I’d love to get your honest feedback what resonates, what is unclear, and what could be improved.

Your insights are really valuable as we continue refining this tool for testers like you.

Thanks in advance!

1 Like

In terms of the tests it can do, it doesn’t look any different from the test monkeys were using 25 years ago. Am I missing something? It’s certainly more polished than any test monkey I’ve seen, and it’s nice that events are triggered by mouse and keyboard actions rather than by DOM scripting.

How does it know what it can do?
It would be interesting to know how it recognises components and how it decides how to interact with them. Does it rely on the HTML being correctly coded as links, buttons etc.? What does it do, if anything, if it encounters a <div> element with a JavaScript event handler, but no programmatic role or state? How does it recognise if a component is draggable, such as a carousel? How can we tell if it didn’t interact with something because it didn’t know it was interactive?

How does does it know what’s right and wrong?
The tool doesn’t explore in the same way a human does, and it doesn’t ask “what if…” questions. It doesn’t think “That looks odd” and decide to dig deeper into something. It’s not even apparent how it decides if the right thing happened or not. Of course it can find 404 and 500 errors, but what about the following:

  • Can it tell if a calculation has been done correctly?
  • Can it tell if the data validation that is applied matches the specified rules and error messages?
  • How does it know about, understand and verify the business rules?
  • Can it search for the existence of boundary values if it doesn’t know there even is a boundary?
  • If it finds a boundary, how does it know that the correct behaviour happens for each value?
  • Can it identify rendering errors that don’t affect the functionality?
  • Can it search for and find responsive breakpoints if it doesn’t know there are any?

Metrics
The provision of coverage metrics really troubles me. Managers who don’t understand metrics (which is all managers) will see a high coverage metric and believe the application has been thoroughly tested even though real test coverage may be extremely low.

Finally
We always viewed the use of a test monkey as an addition to whatever testing we were planning to do. We would never use one to replace any of our testing. I would view this tool the same way.

Test automation and test monkeys always look impressive, but the more you learn about them the more questions arise regarding what they can and can’t do.

2 Likes

@steve.green First of all, thank you so much for the review! I appreciate it!

Yeah thats right, in this current state that is pretty much what it does, however within next few weeks we will add additional capabilities to it, f.e. security scanning and using LLM reasoning and some accessibility scanning capabilities and HTML validator.

Yeah it was implemented this way to trying to minimize false positive issues.

It detects those elements using our own machine learning model. It creates “in-memory representation” of rendered HTML and CSS and them aggregates those html elements into “nodes”. We have also modified certain parts of Chromium web browser source code that we get a bit more information about the web page. It for example knows all nodes that have “hover” -effect and what happens if cursor is moved on top of the node without actually moving the cursor. ( Except for the JS logic ) that information is then passed to the model and it decides if a given node is interactable or not.

Dragging elements is not supported yet.

Yep, thats right. Its not aware of business rules or application domain. We are trying to answer part of these questions with “Reasoning” -feature. It may still not be perfect but it should be able to give a list of issues which then human person can go through to validate if a given issue is worth looking into or discard the finding if not.

Thanks for the explanation. Since we are all testers, you’ve got to expect us to be sceptical. It’s not enough to know something works - we want to know how it works, so we can figure out what it can and can’t do. In my experience, tools never tell you what they didn’t do unless you specified what you wanted them to do in the first place. If you don’t define the tests yourself, tools never tell you “I wanted to do this test but I couldn’t or didn’t for some reason”.

We might have a record of what the tool did, but it doesn’t give us a gap analysis to show what it didn’t do and what the remaining risks are. It’s a concern that clueless managers (which is most of them) will see a coverage metric of perhaps 90% and decide that that’s enough testing. They won’t realise that at best, 90% of the nodes the tool found have been subject to some kind of test. Those tests might only represent 20% of the tests a good tester would consider sufficient.

This is important because I suspect this tool will mostly be used by non-testers. Just as automation is often subsumed into the development function, this tool will too, further justifying (in the mind of managers and developers) the elimination of specialist tester roles. Then there won’t be anyone left with a sceptical mindset to ask the questions no one else does.

I would like to try the tool at some point, perhaps when its more developed. It would also be good to know the pricing model, to assess how viable it is.

Finally (for now)
How does it deal with branching logic, such as a multi-stage mortgage application form where the contents of each stage depend on the data entered at previous stages? To achieve full coverage, we have to explore the application and work out what data we need to enter to go through all the branches and see every variant of each page. Can the tool do that? Will it ever be able to?

Finally (really, but maybe not)
I don’t see how the tool can test applications where you need to view and/or enter data on more than one machine. An example would be an auction website where you need to test with multiple bidders. Another example would be a system where multiple client machines need to synchronise data with a central server - I love testing this type of system because data loss is so common (as anyone who had an early iPod might recall).

Yeah, it is not something meant for managers but for engineers to understand where the agent has been, without that it raises a question if it has tested anything

The pricing is based on number of agents

If the branching is based for example dropdown selections etc. It should be able to do that. Selecting at least each different option once. It currently prefers unseen options over already seen options.
Combinations should be covered in time, depending on how long it has been running and how complex the application is.

There can be more than one agent and in settings login credentials can be defined. I think I need to make another video where I go more deeply into these topics :slight_smile:

I think smart organizations understand testing and QA work creates competitive advantage and is sales driver rather than just cost

Sadly, I have not seen any evidence of this. Most organisations don’t understand testing, and testing practices have got much worse in recent years. Much of the testing in agile teams is insanely bad.

I can’t think of any organisation that regards quality as a competitive advantage. For instance, Facebook used to boast that they didn’t have any testers, although that may not be the case now. Google’s philosophy was to launch new products and features as fast as possible with minimal testing, and rapidly fix any issues users reported.

Both approaches are reasonable for products whose quality doesn’t matter. Who cares if there’s a bug in Facebook? How would you even tell if Google’s search results were wrong? However, vast numbers of teams blindly applied those philosophies to products for which quality does matter.

Branching is sometimes done using dropdown selections, but this is not viable where there is a wide range of permitted values, in which case a textbox works better. In the case of a mortgage application, all the inputs such as income and expenditure will be textboxes. There may only be a few possible outcomes, but the tool would have to try hundreds or thousands of permutations to find them if it is selecting values randomly.

This can also be the case with dropdowns if they contain a lot of options. You need to test a huge number of permutations if you don’t know at least roughly where to expect the boundaries.