Where would you trust an AI to write your tests — and where absolutely not?

Hi, I’m Ruslan. I live in Berlin and I’ve spent about two decades working in and around QA, Testing, CI/CD, and developer tooling.

For the last year or so I’ve been building a product called QualityMax. It tries to take on some of the repetitive work that sits around a QA function — generating Playwright, k6, Golang, Rust, Python tests, reviewing pull requests, running static security checks, flagging coverage gaps, updating tests when selectors change. There’s also a separate CLI tool called qmax-code that does similar work from a terminal.

I’m posting here for two reasons, and I want to be straightforward about both.

First, I’d like honest feedback from people who’ve actually done this job. I’ve been too close to the product for too long and I no longer trust my own instincts about what’s good and what’s lazy. I’d rather hear that a direction is wrong now than keep building in it for another six months.

Some things I’m genuinely unsure about:

- When an AI self-heals a test after a UI change, is that actually useful, or is it hiding a change the team should be reviewing?

- What would a generated Playwright test need to look like before you’d consider merging it into a real suite?

- Coverage gap reports — do they help, or is most uncovered code uncovered on purpose? - AI-assisted PR review — signal, or more noise on top of the noise you already get?

- Are there parts of testing where AI shouldn’t be involved at all? I’d like to hear where you’d draw that line.

Second — and I’ll put it plainly — if any of this sounds useful to you, I’d be glad if you tried it. There’s a free BYOLLM tier, but just today I’ve set aside 10 Starter-tier spots for people from this community. Sign up selecting the Starter tier and use the code mot_qualitymax_free_starter.
With this you can use a normally paid tier with all of the latest AI capabilities for free!
No payment, no call, no follow-up unless you want one. If it’s not useful, I’d like to hear that too — that’s the more valuable outcome for me at this stage.

I’m not going to claim the product is further along than it is. It helps some teams, it frustrates others, and there are things I know it gets wrong. I’d like to learn from people who’d notice the things I can’t see anymore.

Thanks for reading.

-– Ruslan

Use AI to write/do… anything.

As someone who has used AI for 2 job interviews, one of which helped me ace it. And only really use AI as a way to research and to “scaffold”, I find this continuous stream of noisy AI use-case signals rather like a trip back in in time. Back to the time, I was aged about 16 and got conned into paying far more for a 10 speed bicycle than I should have. Yesterday I got asked, no forced, force to choose between using Gemini or carrying on using Google Assistant. Both claim to have an AI component, but frankly neither of them feel that revolutionary to me. So I’m puzzled, exactly how does an AI “write” tests, I cannot even get it to send whatsapp message to my friends while driving my car to the office. My car uses AI too apparently, but I have to turn some of it off, generally it’s a frustration area. I do not believe that writing tests should be frustrating. Always the missing piece is that the AI never knows “why” it suggested a particular test. And so for me, I welcome the use of AI to write tests, but…

Ultimately we all know that anywhere up to 50% of our tests are just tests that are trying to increase code-coverage, and thus duplicates many paths, all of which have very real cost. AI is really good at seeing things I cannot even hope to see though. And I do wonder if we just end up throwing more resource into the bucket. Which is kind of what you are implying @bestofdesp_qamax ? But since I’m not yet using AI to generate, I cannot tell how costly it is, sorry.

Conrad — thanks for this. Your line about the AI never knowing why it suggested a test is the cleanest articulation of the core problem in this space I’ve seen. And the cycling-at-16 story made me laugh because it’s exactly how I feel every time a vendor demos a magic “autonomous QA agent” on stage.

Let me try to answer your actual questions straight.

How does it actually “write” a test?

Two paths in practice, and they’re very different:

1. Crawl-driven — an agent loads the app in a real browser, walks the UI, extracts real selectors, records the flow, and emits a Playwright script. The AI is used for navigation decisions and for choosing what to assert. The selectors are observed, not invented. Output looks boring, like a smarter record-and-replay.

2. Intent-driven — a human writes “test checkout with an invalid coupon” and the AI generates the script directly. This is the demo path. It’s also where most of the bad tests come from.

Path 1 tends to produce tests that survive review. Path 2 tends to produce tests that look impressive in a screenshot and rot in six months. A lot of what I’ve spent the last year on is shifting effort from path 2 to path 1.

On “it doesn’t know why”:

You’re right, and I can’t fully solve that. What we do is force the human to write the “why” in plain English before any generation runs — a test case with an acceptance criterion, not a freeform prompt. The AI then has a stated intent to generate against and to check its work against. That’s mitigation, not a cure. An AI-generated test without a human-authored “why” behind it is, in my opinion, a liability.

On the 50% coverage-padding observation:

That one landed. A “generate tests until coverage hits X%” product is worse than useless — it just multiplies maintenance cost. Half of what I’m trying to build is a gap analyser that separates uncovered code that carries real risk (user flows, error branches, money-touching logic) from code that’s uncovered for good reasons (dead, trivial, framework-owned). Whether that’s genuinely working yet, I can’t honestly tell you. Not enough users have pushed back on it for me to know.

On cost:

For a typical crawl + full test-suite generation on a mid-size SaaS, the LLM bill is a few cents to a few euros per run. A PR review is cents. Not free, but small compared to the human-hours it saves when it works — and wasted when it doesn’t. I think the industry hides this number badly; I’d rather publish it. The framing I keep coming back to is the one you already used — scaffold. AI as scaffold, not AI as co-worker replacing the tester. If that framing doesn’t hold up in practice, the product doesn’t deserve to exist. That’s why I’m asking.

— Ruslan

This is a question I keep coming back to.

From my experience, the split isn’t really about what kind of test — it’s about when AI gets involved.

Generation: yes. AI is genuinely useful for producing test scaffolding, turning a spec into boilerplate, or suggesting edge cases I hadn’t considered. I treat the output like a junior engineer’s first draft — useful starting point, always needs review.

Execution: careful. This is where I’ve been burned. We had a self-healing tool that silently re-targeted a selector after a UI refactor. The test kept passing, but it was clicking the wrong button for three weeks. A failing test is annoying. A passing test that’s wrong is dangerous.

My current rule of thumb: AI can help write the test, but the test itself should run deterministically — no AI deciding what to click or what to assert at runtime.

Conrad’s point about “the AI never knows why” resonates. If I can’t read a test and understand what it’s verifying, it’s not a safety net — it’s noise.

Ruslan’s distinction between crawl-driven and intent-driven is useful. I’d add a third axis: does the
generated test require AI to execute, or does it produce plain code that runs the same way every time? That’s the line I care about most.