How do you prepare test data sets?

How do you go about preparing test data sets?

I’m specifically interested in how you consider various diversity concerns with preparing these data sets to ensure that you’re covering a wide range of potential users.


Very interesting topic!

#automate it.

I prep my test data in isolation… very important!
So for all my automated cases I create test data from scratch, which takes up a second.

When part X has to be tested, data for test cases X will be setup. Which can be multiple test cases. Since I know which tests will run, I will also create edge cases so my tests have no issues running.

How do I cover a wide range of cases? That comes from the acceptance criteria & user experience. Mainly business who comes up with stories of X & Y who did 1,2 & 3 in prod, then we create a case for it in our script and we can add it to the regression set if required. Many edge cases are automated, there will also be some to be tested manually but we’ve automated the creation of test data for those also. (or with a push on a button) Then our manual testers can go nuts with fresh data and not get stuck looking for a good data.

Afterwards I delete the isolated test data, that has been created automatically to keep a ‘clean environment’.


To manage the different combinations without drowning in data Pairwise Testing using tools like GitHub - microsoft/pict: Pairwise Independent Combinatorial Tool


Two relevant podcasts for this topic:

(shamelessly plugging here, because I was one of the guests :grinning_face_with_smiling_eyes: )

1 Like

The focus on our test data is not so much on the user, but more on the patient data that is handled by our systems. For that we rely on real life usage data (which comes back to us anonymised, GDPR compliant). This data gets aggregated and stored in an ever growing sample database, which we inject in our test systems.

Also, we use fake data generators, like Faker for Python


(There was a thread about this some time back? Tools for supplying high quality data for test - #14 by adamattonic ) But I prefer to use a copy of a real customer database (anonomised of course), too. Because fake data infrequently covers all the relationship combinations in systems that have interaction across the entire system. Things like exercising non-default security perms on objects in one component as well as consistency of behaviour in a different component that lives in a different table or even a different database are non-trivial to fake. Upgrade/migration testing also tends to work better on real data. I mean that’s why customers buy your product, because it’s a system, not because it is components.


I also prefer to use an anonomised copy of a real database if possible. As @conrad.braam says fake data rarely covers all the relationship combinations. I also think it depends on how well the backend is understood by whoever is creating the data as if you are creating test data without really understanding the full extent of how the data should be created its easy to create data that doesn’t have much value.


It entirely depends on the app under development. If your project is data-driven and operates in the real world, then you will have a ready-made source of test data to draw upon.

This was a big part of my job for 15 years, as I was working on a system to capture regulatory data from UK utility companies, and my test manager had emphasised that test data should be completely anonymous; certainly when we started out, the companies providing the data considered it all to be commercially confidential. I was able to review previous years’ datasets and compile a fake dataset based on likely values for things like length of water mains refurbished, number of complaints received and so on based on applying my “skill and judgement”, as they used to say in newspaper competitions.

I was then able to extend that dataset into a time series by making assumptions about how certain measures would increase or decrease over time - so, say, mains refurbishment would increase at 2% per year, customer complaints would decrease by 3% per year, and so on. Eventually, I had a fairly robust dataset that looked pretty much like the real thing.

The strange thing was that I wasn’t the only person doing this. One company were doing something uncannily similar in drawing up their own actual return to the regulator, though this only came out in the Serious Fraud Office investigation. (The company was lucky, IMHO, to get away with a £35 million fine and have some of their former senior people avoid custodial sentences.) It was the fact that some of their numbers looked uncomfortably like my test data that first set alarm bells ringing. Sometimes testing has direct impact on the real world.


“some time back” – 2 years ago, strong memory! nice :stuck_out_tongue:

I can see why you prefer it and in some cases it’s nice like indeed migration tests etc.
But if you have to make a copy & scramble prod every single time to get data, that takes up a lot of time, that’s why we create our own data. If you have a specific case from prod, you just re-create it (automated) on your test environment in isolation. So it only takes a second to create that data instead of doing a copy from prod over and over again. What if your data gets corrupted each test cycle?

If you have a regression test each week or 2 weeks, doing a copy is a bit overkill no?

For a fast deploying company I prefer to create our own fake data based on cases of prod, because what if the case in prod disappears?

Don’t get me wrong here… in some cases I do prefer it also, depends on the situation I suppose :stuck_out_tongue:

True! But what kind of work are you doing then if you don’t know what you are creating?
Just doing stuff to do stuff & keep busy? :smiley:

So yea you defo need insights in test cases from prod :wink:


haha you caught me :stuck_out_tongue:


Defo have to anonymise any customer test data that you do use. It’s very context dependant to be honest. But having ways of making sure that when a customer, heaven-forbid has a table with 20 000 rows when you only expect them to have around 1000, it suddenly helps to have a pre-existing read-only database handy. It’s very context dependant.
Most of the time I’m using data from one of the magic lists that other people share on this forum from time to time to be honest.

1 Like

To be fair, the business emphasis in that role was on data accuracy and maintaining confidentiality; application performance was a means to an end, And back in the mid-1990s, there weren’t the ‘magic list’ resources available.