We start to develop together with “claude” AI, we have some guidelines for using AI but how can I check as QA apart that the function is working that AI & developer did a good quality job.
Please share your experience or direction, How you handle this?
Greetings from the sunny netherlands Silvi
Make sure your guidelines for Claude include writing tests. What type, technology and levels depend on the app, but for example; an electron app I was creating with Claude included vitest unit, component and integration tests plus playwright e2e tests. With the vitest tests I asked for it to include coverage metrics (Istanbul) so I could quickly see what code wasn’t touched by testing.
Make sure that the guidelines include running and updating tests for new features.
Then look at the tests to see if what is tested makes sense.
Imagine you are the manager of a very enthusiast but barely out of school developer intern, who yearns to please you. You must delegate all the coding work to them, but you are still 100% accountable for their output.
How would you check they did a good job?
Now, you get a second intern, very similar to the first one, but their only mission is to check the first one did a good job (i.e. do your job). What kind of tools, processes, guidelines, evaluations….. would you ask them to follow to achieve this goal?
I know I’m not answering your question, I’m just reframing it , but I think it still helps answering it.
Once you have answered it, the only real question is: How does this translate to MY context (technology, business, product, work organisation…)
Hi christophe, I like your proposal! My question is: Did you request this new responsibility, or was it assigned to you? And are you feeling ready for it?
Validating AI features is genuinely one of the harder QA problems right now because the outputs are non-deterministic. You can’t just assert an exact response like a normal test, you’re validating behavior within acceptable boundaries which requires a completely different mindset.
What’s actually worked for me is defining quality criteria before a single test is written. What does “good enough” look like for this AI feature, what’s a clear failure, and what’s acceptable variance. Without that upfront alignment with the PM and dev team you end up with subjective bug reports that go nowhere.
The other challenge is test run management gets messy fast when you’re running the same scenarios repeatedly to check consistency. Started using a tool recently that keeps run logs tight and surfaces patterns across executions without a ton of manual overhead, that part alone saves a lot of time when you’re doing repeated AI validation cycles. The tooling around AI testing matters almost as much as the test design itself at this point.
Pre-scriptum: I try to use “gen. AI” (i.e. generative AI) instead of just AI, whenever I can and think of it, because AI has been around for decades. Only gen. AI has been broadcasted recently
I’m lucky enough that in my company we’re slowly experimenting with gen. AI to find real use cases for us (internally) and for our product/feature/client (externally) . We’re not rushing to it, so I’m happy to grow with the field.
No, I’m not feeling ready, but I’m confident I can keep up so far, which is good enough for me I guess
If you’re talking about using gen.AI yourself or within your company, then this is a “Human & Process” topic. And my opinion is that it is very complex to find the right balance. (Be beware of personal or cognitive overload due to too many stuff generated at too many steps).
If you’re talking about providing gen. AI features to others, then this is really a “Quality” topic. And my opinion is that we are moving from a “binary, deterministic, and expert-based system” to a “probabilistic, stochastic system with inference”. Evaluating the Correctness, Goodness, or Worthiness of a feature with gen. AI will now require to adjust for this fundamental change.
Hi @soconnor2017,
Keeping your validation focused on more than just “does it run.”
Always look at whether the AI outputs are consistent, accurate, and aligned with your requirements. Test edge cases or tricky inputs to see how it behaves under stress. Check integration quality, error handling, fallback responses, and user experience when the AI fails or watch for inappropriate outputs.
Document findings clearly so developers can trace and improve.
This way, you’re not only confirming functionality but also ensuring the AI is trustworthy and usable in real scenarios.
From my experience of using AI while expanding a test automation framework with elements in Java and Python…
When writing prompts, give examples and set constraints. Its specification, not vibing. Ask your AI of choice (or whatever has been forced on you from above) to write tests. Don’t be too surprised if the tests leave you scratching your head and wondering what more you could add. Likewise, don’t be at all surprised if some of the testing is found to be incomplete. That’s just the nature of the beast.
Really like how you framed the shift to probabilistic systems. That idea of evaluating “worthiness” instead of strict correctness is exactly where things start to click for teams.
Also agree on the cognitive load point. Without some structure, AI across multiple steps can get messy fast. Test management tools like Tuskr or Qase help keep those validation cycles and patterns visible.
This is such a real challenge right now.
What we’re seeing is that validation becomes much harder once you move beyond controlled environments.
With AI-driven flows, things can behave very differently across users, devices, and contexts, so the question isn’t just “does it work” but “how does it behave in the wild”
Curious how others are approaching this in practice