How do you keep testing grounded when you don’t have realistic production data?

In a lot of environments, especially where systems deal with sensitive information, regulated data, or government integrations, testers don’t get access to real production datasets. Instead we end up working with synthetic data, partial exports, or carefully sanitized samples.

On paper that sounds fine, but in practice it often means edge cases only appear once the system hits real usage. Strange formatting, unexpected blanks, weird encodings, inconsistent records, or volumes that nobody anticipated.

I’m curious how others handle this in their testing strategy.

Do you invest heavily in synthetic data generation?
Do you build libraries of edge cases over time?

1 Like

I think that often testing does not need to be grounded. The nature of data is a big way that we interact with the product. Considering edge cases should be part of nearly any decent test strategy. Production data is fantastic, but if we use only production data we only find edge cases where the vulnerabilities just happen to match the threats produced by data we don’t actually control. Especially if the testing comprises of narrow capability checks. That’s only really acceptable if we don’t care too much about the state of the product.

Often it can be useful to use self-referential data to follow data flow and observe processing easier. Sometimes extreme data testing in a defocused fashion can find more problems quickly. Sometimes known risks are exercised with data given small changes. Or a combination of all of these during bug investigation or following risks.

Ideally I would want synthetic data generation. A smattering of databases in various states. Maybe a tool to put the database in a particular state. Perhaps a randomised data set for defocussing…

Well, ideally I want to be able to fully know the state of the product, be able to put it in any state I choose and process it any way I please. From that impossibility we work within our resources.

Libraries of cases are certainly part of my testing, as I want to build up an idea of what’s going wrong, what tends to fail, so that I can improve my testing. When a bug is found in production that’s something we can now anticipate next time - the process should absolutely be self-healing.

I’ll also say that randomisation, fuzzing, extreme data and behaviours, disregarding protocols, state abuse through multiple access points, hacking and other mayhem is a fantastic way to bring unanticipated problems to light. If the process is vaguely in place then it’s anti-fragile. The harder we hit it the stronger it will become. There can be an emotional factor where people want to treat the software delicately (especially if it’s tested by people who design/build it) - but users will not.

Beyond that there should be time in testing to try to anticipate them. That’s part of the built skill of testing, using an understanding of the product, company, environment and users (with some experience, empathy and some lists/mindmaps/notes) to consider alternatives to the accepted reality (and illusions) of the product. I use the HTSM and similar lists for this all the time - what have I not considered, and would that be important here?

Obviously nothing comes for free, and if we don’t invest the time and resources into testing properly then we have to accept that capability checking with an exploratory afterthought will consistently leak problems into production. But perhaps that’s okay too. It’s not always most profitable to be good.

2 Likes

In my head:
Production data is best
Sanitized data is a good fallback

In all honesty, I usually have to craft some amount of synthetic data just to account for edge cases.

I do work on building a library of edge cases over time. As problems come up, log them and capture some relevant test data and scenarios so you don’t get burned by the same thing twice.

I’d be curious about today’s landscape, with AI, does or can AI help address this area of concern? Minimizing the needed grunt work of things like data sanitization and synthetic data generation, perhaps based off production data sampling, with guardrails in place of course for how the source data is interacted with.

We had an interesting case of using production data for testing, which saved a lot of manual effort, but still required some manual intervention now and then.

Our reporting pipeline was backed by a big-data system, and the stage environment already had production data refreshed daily. So data access was fine; the oracle was the problem.

When something changed in the reporting API, we had no practical way to independently verify whether the output was correct. We had integration tests with simple synthetic data as smoke tests in CI. Going beyond that, however, would have meant understanding the equivalence classes of the pipeline well enough to generate realistic cases with known expected outputs. Which effectively meant reimplementing a significant chunk of the pipeline logic.

So we had the data, we just couldn’t tell if the output was right.

Production data helped in two ways:

For case selection, I analyzed the prod API logs to identify which customers, request types, and parameters represented the most common and highest-risk scenarios.

For the oracle, I used current production output as the expected baseline. I built a tool that replayed those requests against two system versions, typically a branch-deployed version vs. stage, which always ran the latest production release, and compared both responses and response times. Differences or significant latency changes flagged a failure, and the whole thing ran in CI.

To be clear, this was about detecting regressions, not exploration: it wouldn’t catch a bug that was already in production. We were blindly trusting production output as correct and checking whether a new version still produced the same results. (And we did check for obvious failures like no output data. :wink: )

One tradeoff was that this assumed production data was stable between runs. For historical date ranges that was mostly true, but edge cases existed where data had been updated, which caused false discrepancies.

Although flagged failures still needed a human to verify whether the diff was a real regression or a data change, this turned hours of manual regression testing into minutes of targeted review.

The self-healing process point is something I want to dig into more. The idea that every production bug becomes a test case you now anticipate is straightforward in theory but in practice it only works if your team actually closes the loop consistently, logging what failed, why, and what was added to prevent recurrence. That discipline tends to slip under pressure.

The way we’ve tried to handle it is keeping run history and case libraries in the same place so when something slips through production there’s a clear path back to what was covered and what wasn’t. We use Tuskr for that and just having the run record attached to the case makes the retrospective conversation a lot more grounded than trying to reconstruct from memory or CI logs.

Your point about emotional factors is underrated too. Teams that built the thing often test it against their mental model of how it should work rather than how a user would actually interact with it. Synthetic and randomised data helps break that pattern because it doesn’t respect the assumptions baked into the design.

1 Like

The idea that every production bug becomes a test case you now anticipate is straightforward in theory but in practice it only works if your team actually closes the loop consistently, logging what failed, why, and what was added to prevent recurrence. That discipline tends to slip under pressure.

And it doesn’t have to become a test case, of course. Firstly, and I think probably obviously, it should only become a explicit case (read: automated check, written test case, repeated charter, regression note etc) it if it merits the cost of explicit cases existing. Sometimes bad things happen that won’t happen again, sometimes they’re not important enough, sometimes it’s in so much flux that it doesn’t make sense to check for it again, sometimes bugs are what we get from limiting resources in areas for good business reasons etc. It may be that discipline, applied to the goal of “no repeat offenders”, doesn’t make business sense, or that the effort doesn’t match the return. I’m not accusing you of doing so, I just think it’s a good thing to note in general and keep an eye on, and I see a certain worship of prematurely and overly formalised tools and processes in what is basically looking at things and telling people stuff.

Testing is also entirely human-powered, in an epistemological sense of the intent and models and observations and inferences and reporting being in the mind of humans, so the existence of previous problems may just feed back into the mental models of testers, giving them better understanding of existing patterns of behaviour. What typically shows problems, the nature of those problems, and so on. That’s just added value without any real admin at all. The existence of things like “5 whys” exists for the same reason - to provide information to self heal the system. I often map problems to the teams or developers that tend to produce them, giving me better direction when I work for them. There’s no official capacity in which this happens, it’s just something I do to try to test well, and try to teach new testers to let them self-improve. A good source of information on this, that’s criminally underused, is support data. The nature of the kinds of calls/reports that are received, by which customers. All are useful heuristic oracles (tautology for emphasis) to guide test strategy.

A bug is also more of a point of view than anything. That may explain the nature of arguments about what “is and isn’t a bug” against technical definitions, in environments where formality and standardisation replaces the flexibility of thinking and good faith effort. Breaking a bug down and understanding its parts I think is very useful for this sort of self-healing process. I see a bug in terms of threat, vulnerability, problem and victim. Not my invention, I think it was on the RST course. The idea is that some condition or input (threat) that triggers something about the product that causes failure under some condition (vulnerability), which causes the product to do something a person wish it did not (problem), which harms a human in some way (victim). In this way I can see problems as being associated with certain users and examine viewpoints that may lead to someone calling something a bug, by examining the victims of existing bugs or risks. Or I could examine bug patterns by the vulnerabilities in the product and look for typical areas of failure which would then be added to a risk portfolio. Breaking a bug down like this means that we’re not looking for a case regression, we’re looking for the kinds of problems we expect to see more of and the ones people care about more. And we’re also not looking for one-off issues. And we might find bugs in our processes - it could be that improving communication or adding risks to a kick-off checklist or informing the design process would reduce problems earlier and cheaper.

So there’s a lot of direction to go to self-heal testing, and I think a lot of value in those considerations that can go hidden, both in terms of value added and cost prevented. Being able to test the testing is one of the benefits of having good testers, I think.