What are the risks of trusting your tools too much?

Have you ever had a test tool tell you everything’s fine, only to find out later that something important was missed? I have. A performance test once gave me perfect-looking results. Green ticks, fast response times, everything looked great. But it turned out the test was running against a tiny, unrealistic data set. When we looked closer, we realised the production environment had millions more records, and if we’d used those same test settings, things would have run about 30 seconds slower.

It only looked good because the tool was checking what we told it to. It didn’t understand the real-world conditions, and it definitely didn’t know what questions to ask. Tools don’t think. Testers do.

So here’s a practical challenge to help you think more critically about tools!

Your Task

Think of a tool you use in testing. It could be a test case management tool, a performance test tool, an AI assistant like ChatGPT, or anything else that supports your testing.

1. Describe the tool

What the tool does and how it’s meant to help with testing.

2. Spot the risk

What’s one way it could give you a false sense of confidence or cause problems if you relied on it too much? Maybe it:

  • hides bugs behind “green” results
  • Misses context that only a human would notice
  • Blocks collaboration
  • Becomes unavailable
  • Encourages copy/paste over understanding
  • Raises privacy or security concerns

3. Apply your judgement

What could you do to catch or avoid the risk?

  • Sanity-check the results manually
  • Ask someone else to take a look
  • Question what the tool isn’t checking
  • Talk to a developer or rubber duck

4. Share your example below

What tool did you choose, what are the risks you spotted, and how could human thinking help? You’ll likely find that the similar issues recur across different tools. And the more we talk about these risks, the better we all get at spotting them early!


4 Likes

When doing accessibility testing, we always start by doing as much manual testing as the budget allows. Then, if the website is suitable, we use an automated testing tool to extend our test coverage. Our tool of choice is SortSite because it has the best balance (for us) between cost, features, false positives and false negatives. It can test against any WCAG version and level and can predict some of the issues that screen readers will have.

We know that the tool will get things wrong - even tools costing upwards of a hundred times more get things wrong. So we analyse every single issue it reports and we categorise them as follows:

  • A false positive that can be ignored.
  • A false positive that can easily be fixed, making analysis of future test results simpler.
  • A genuine issue that must be fixed.
  • A genuine issue for which the tool has reported the wrong cause and/or fix. We identify the correct cause and fix.
  • A genuine issue that can be ignored because it has no impact on the user experience.

False negatives
Nothing can be done about the false negatives i.e. faults the tool does not find. If you want to find those, you need to do more manual testing, but by definition we are using the automation when there is no more budget for manual testing.

Almost every time we use SortSite, we find false positives and report them to the company. To their credit, they acknowledge and investigate each report and publish a fix in the next month or two.

What I find really scary is that some of our competitors just take the report from the tool and copy and paste it into their report format with no analysis. I know this because I have obtained numerous reports where this has been done.

Stuff that doesn’t even get tested
An important factor that I believe the vast majority of accessibility testers overlook is that most automated tools only test each page in one state. Any content that is hidden by means such as “display:none” or “visibility:hidden” will not be tested, which includes the contents of dropdown menus, accordions, carousels, lightboxes and other overlays.

Tools that test a single page, such as WAVE, Deque Axe and ARC Toolkit are capable of testing a page in whatever state you manually put it in, so you can test all those content types. However, tools that crawl a website and test every page they find can’t do that - they typically test the page in its initial state.

There are some stupidly expensive tools (multiples of £10,000 with very restricted licensing) that you can script such that they can submit forms and test all the “hidden” content when they crawl a website, but that takes a lot of time and the scripts are probably at least as fragile as any other automated testing.

Will all tools share the same issues?
That has not been our experience. Although all tools test for conformance with the same WCAG success criteria, the way they do it is unique. Some use more heuristics than others in order to find more issues, but this increases the false positives. Others, especially those used in the CI process, only use rules, so they should not report any false positives (although they do) but they miss more issues.

The W3C have a project designed to increase the consistency of automated testing, and many vendors take part in this. However, they don’t share everything because each company’s rules and heuristics are their intellectual property and are the basis of their competitive advantage. The tools will therefore always give different results.
https://www.w3.org/WAI/standards-guidelines/act/rules/

1 Like

When leading our automation working group we spent some time looking into using copilot to write automated unit tests. This was on the back of using copilot to write the code. Unsurprisingly the tests all passed and we talked about using it for more types of testing.

The risk here is that copilot doesn’t fully know whether it has achieved what we wanted. If we over relied on AI to implement, test and review its own work, maybe the code does work fine but does it solve the problem? The AI tool isn’t using the software and trying to see whether it does what the customer wants. It also doesn’t have the context to understand all the potential challenges as performing good testing involves a lot of investment in domain knowledge - something we might be wary of giving AI.

Additionally after the session we realised that some of the code needed refactoring as it wasn’t taking advantage of utility classes. We actually ended up investing more time in getting the tool to speed up our work than if we’d done it ourselves. Better prompt engineering and giving better context may have helped.

Whilst this was focused largely on unit tests but the same can apply in other contexts.

As a side note, I’ve touched upon this topic in my blogs:

1 Like

This week, I did a crazy thing, I installed a long out of support version of the product and fired up the basic test app (after making sure the app re-compiled itself automatically) . That proved the interface had not broken in the last 5 years, but then the test app just hung. It just hung. Incredulous I started a new run with the debugger assuming it was just hanging because it was missing a file and the test was just waiting… In the debugger I stepped through a loop manually 15 times, almost got bored, and then I spotted the problem. Something that should have been caught in code-review.

End users would never have seen this bug, unless they did what I had done, installed an older version after a newer version and chosen a non-default install path. Only the test code and one SDK would go into a dead loop if you did that. I was so struck by the bug it took me a few hours before I reported it! Not a biggie, but it had been hiding from code review for years.

  1. The tool : setup.exe
  2. The risk : users could install old versions and get unusual behaviour, if and only if they were unlucky to run the specific diagnostic app
  3. Apply Judgement : Initially thought it was serious, then realised it affected almost zero users
  4. Share an example : As above, basically try to run tests with an incomplete or invalid installation.

Result: I’m going to take installer upgrade and downgrade test coding more seriously. Always make sure that your tests all fail if you remove any single pre-requisite.