Three walls I keep hitting on AI-in-QA rollouts. How are other teams getting past them?

Looking for honest, non-LinkedIn answers from people who’ve been through this.

Over the last few years I’ve led QA teams through several “now with AI” adoptions — including building an in-house AI-feedback plugin for our Playwright suite at one of my previous companies, using GPT-4o and Langdock. Each time, the starting pitch was great: faster test writing, more coverage, less drudgery. Each time, we hit the same three walls about a month in.

Let’s be honest about the baseline first. Most of what we actually use day-to-day today is Copilot completing Playwright locators in the IDE. Which is fine — it saves keystrokes — but it’s closer to AI-assisted *typing* than “AI for testing” the way a PM or exec means it when they ask. The moment you try to push further — generating whole flows, maintaining a suite over time, catching coverage gaps — the real walls show up.

The review paradox. Review every AI-generated test and the speed-up disappears — you’ve just moved the work from writing to reviewing. Don’t review, and the suite quietly fills with tests nobody trusts. I haven’t found a middle path I’d confidently recommend to another lead.

Blindness to what it skipped. When the AI produces 80 tests for a feature, I have no clean way to see which parts of the flow it chose *not* to cover. Coverage metrics rise; confidence that the important behaviours are exercised doesn’t, not in the same way.

The maintenance cliff. The first release after a real UI change turns half the suite red. The AI that wrote the tests can’t (or won’t) fix them the way a human would — it either can’t find the new selector or confidently writes the wrong one. The whole ROI argument can collapse in a single sprint.

What I can’t tell from my own vantage is whether these are solved problems somewhere — and I’m just missing the practice that works — or whether the wider “AI testing” wins people post about are narrower than they sound.

For leads / senior QA folks who’ve run this in production for a year or more:

  • How does your team actually handle the review paradox? Sample audits? An AI reviewer on top of the AI writer? Trust-and-prune plus bug-escape tracking?

  • Do you have any clean way to see what the AI chose *not* to test, before you merge it?

  • Has AI-generated coverage survived its first major UI change on your product, or has your team ended up rewriting most of it by hand?

Happy to hear “we tried this, it was a trap” just as much as success stories — probably more useful..

1 Like