OpenAI and what it says about testing with agents

This is what OpenAI says and advises about agentic testing.

What are your thoughts?

What is good about it?

What is not so good?

What would you wish it said? :melting_face:

4. Test

Developers often struggle to ensure adequate test coverage because writing and maintaining comprehensive tests takes time, requires context switching, and deep understanding of edge cases. Teams frequently face trade-offs between moving fast and writing thorough tests. When deadlines loom, test coverage is often the first thing to suffer.

Even when tests are written, keeping them updated as code evolves introduces ongoing friction. Tests can become brittle, fail for unclear reasons, and can require their own major refactors as the underlying product changes. High quality tests let teams ship faster with more confidence.

How coding agents help

AI coding tools can help developers author better tests in several powerful ways. First, they can suggest test cases based on reading a requirements document and the logic of the feature code. Models can be surprisingly good at suggesting edge cases and failure modes that may be easy for a developer to overlook, especially when they have been deeply focused on the feature and need a second opinion.

In addition, models can help tests up to date as code evolves, reducing the friction of refactoring and avoiding stale tests that become flaky. By handling the basic implementation details of test writing and surfacing edge cases, coding agents accelerate the process of developing tests.

What engineers do instead

Writing tests with AI tools doesn’t remove the need for developers to think about testing. In fact, as agents remove barriers to generating code, tests serve a more and more important function as a source of truth for application functionality. Since agents can run the test suite and iterate based on the output, defining high quality tests is often the first step to allowing an agent to build a feature.

Instead, developers focus more on seeing the high level patterns in test coverage, building on and challenging the model’s identification of test cases. Making test writing faster allows developers to ship features more quickly and also take on more ambitious features.

Delegate Review Own
Engineers will delegate the initial pass at generating test cases based on feature specifications. They’ll also use the model to take a first pass at generating tests. It can be helpful to have the model generate tests in a separate session from the feature implementation. Engineers must still thoroughly review model-generated tests to ensure that the model did not take shortcuts or implement stubbed tests. Engineers also ensure that tests are runnable by their agents; that the agent has the appropriate permissions to run, and that the agent has context awareness of the different test suites it can run. Engineers own aligning test coverage with feature specifications and user experience expectations. Adversarial thinking, creativity in mapping edge cases, and focus on intent of the tests remain critical skills.

Getting started checklist

  • Guide the model to implement tests as a separate step, and validate that new tests fail before moving to feature implementation.

  • Set guidelines for test coverage in your AGENTS.md file

  • Give the agent specific examples of code coverage tools it can call to understand test coverage

Source: Building an AI-Native Engineering Team

What it potentially can do can be very different from what you have set it up to do and again different to what comes out the box and whether it will do things the same every time.

This reads like developer level testing in a way.

Huge level of trust on requirement documents being accurate and understood by the agent in the same way every time. However if you watch something like playwright generators you may notice they have not factored in how to pass requirements yet, its not out the box.

Code based test generators will be a bit different from observed behavior generators. I’m thinking the former more mechanical coverage and edge cases etc and a lot of developer tests in my experience are fairly close to this. I see value almost straight away in this one.

Observation based i.e those creating tests by running the browser will likely get a different coverage my early experiments point to common known issues, yes they seem to be able to leverage from oracles and heuristics “Example The oracle was: “Clear indication of submission status” it will find those when they are missing. In theory you should be able to feed it your oracles, heuristics and bug taxonomies and let it find those easy to find issues based on this and create tests around those. I’m not convinced this is a sensible path though even as a basic entry point to deeper testing.

A key point I have seen discussed is the goal of these tests. Are they about finding bugs should they exist, finding change when it happens or even as I had one (healer) agent do last week trading resilience versus signal by its own accord (removed the tests ability to find bugs).

Even with decent edge case ability are they still fairly focused on shallower testing? How does the agent know a good test from a not so good one? This is not clear in those statements.

No mention of mutant testing which seems to be a good agent to add to the picture.

More experimentation is needed in my view before those values become clear and how much oversight is needed.

Your model of testing will impact things a lot, testing to confirm models compared to testing to discover and learn which may need more of that real time learning that most agents do not have yet.

An article I read has pushed me down a path of the consideration of wet and dry brain based testing and whether I can better use this to determine what’s suitable for the agent to cover mostly on its own, that’s a thread on it’s own though but would be interesting if we could add some actual science to our understanding of agents abilities.

These could be starter queries for OpenAI to consider and flesh out what it says about agents, it will forget though and not learn from the discussion unless it’s very lucky to get those discussion logs flagged. If you are training your own AI worth asking though.

3 Likes