How do you handle your test data?

I made a racket about this: Test Data Management - Prod Data vs Test Isolation | Racket

Using:

  • Test isolation
  • Gold Copies
  • No copy from prod

Some copy pasta notes I had:

Gold Copy Data
What is a Gold Copy Data?

In essence, a gold copy is a set of test data. Nothing more, nothing less. What sets it apart from other sets of test data is the way you use and guard it.

  • You only change a gold copy when you need to add or remove test cases.
  • You use a gold copy to set up the initial state of a test environment.
  • All automated and manual tests work on copies of the gold copy.

A gold copy also functions as the gold standard for all your tests and for everybody testing your application. It contains the data for all the test cases that you need to cover all the features of your product. It may not start out as comprehensive, but thatā€™s the goal.

Managing the Coupling between Tests and Data

When it comes to test data, it is important that each individual test in a test suite has some state on which it can depend.

Only when the starting state is known can you compare it against the state after the test has finished, and thus verify the behavior under test. This is simple for a single test, but requires some thought to achieve for suites of tests, particularly for tests that rely upon a database. Broadly, there are three approaches to managing state for tests.

  • Test isolation: Organize tests so that each testā€™s data is only visible to that test.
  • Adaptive tests: Each test is designed to evaluate its data environment and adapt its behavior to suit the data it sees.
  • Test sequencing: Tests are designed to run in a known sequence, each depending, for inputs, on the outputs of its predecessors.

Test isolation

Test isolation is a strategy for ensuring that each individual test is atomic. That is, it should not depend on the outcome of other tests to establish its state, and other tests should not affect its success or failure in any way. This level of isolation is relatively simple to achieve for commit tests, even those that test the persistence of data in a database. The simplest approach is to ensure that, at the conclusion of the test, you always return the data in the database to the state it was in before the test was run.

A second approach to test isolation is to perform some kind of functional partitioning of the data. This is an effective strategy for both commit and acceptance tests. For tests that need to modify the state of the system as an outcome, make the principal entities that you create in your tests follow some test-specific naming convention, so that each test will only look for and see data that was created specifically for it.


Notes

Iā€™ll send you my full notes on Test Data Management, itā€™s quite long XD

5 Likes

Iā€™ve mostly seen setups where you do a restore from a production database and obfuscate all of sensitive GDPR data, like scrambling credit card numbers and such.

For smaller projects Iā€™ve seen people using libraries such as Faker to generate fake data:

3 Likes

Obviously it depends a lot on your application. But we also have a bunch of bot agents (random based) that we can run to generate random data. Similar to ā€œ1000 users do random things for 10 minutesā€.

4 Likes

Our product creates and manages timetables for universities and higher education establishments. We have a tool that generates test datasets because our clients will be using the product with staff numbers in the thousands and student numbers in the tens of thousands.

The generator produces staff, student and establishment records, with staff and students generated at random. This does lead to some anomalous records - sometimes individual records are misgendered, sometimes you get an incidence of unusual name combinations that would be highly unlikely in real life - but more problematically, there are sometimes records that are internally inconsistent when applied to some of our more complex scheduling tools, such as those that create exam schedules or generate whole courses from lists of subject modules.

This in turn does mean that we testers have to be able to distinguish between issues in these tools that arise from inconsistencies in the test data and those which represent genuine bugs. The best guide to that is the frequency of observed issues; individual anomalies may not represent a bug. We have to rely on the product knowledge and experience of the test team to be able to separate bugs from data quality issues. If an issue can be replicated across a number of different test subjects, then it is more likely to be a bug.

Well, it helps keep life interesting!

4 Likes

Hi Melissa

I think as others have said, it depends on the type of data you need.
I work on an ecommerce site and each of our test environments have data that is ā€œcut backā€ from production, with the aim of being to do this after each code release if needed. This means our test environments have products and banners/layouts etc that are representative of how our site is currently being populated. No customer data is included in this though.

On top of that, we have a set of excel files that are exports of data we have created for specific purposes, for example, standardised product data and promotions etc that may not be in production anymore. This gives us standard data for our automation an allows us to regression check certain scenarios without spending time creating all of the data. These files are all exported/imported using a tool in our CMS.

This doesnā€™t cover every scenario so we still have some manual data creation but itā€™s usually feature specific, this covers us for regression and all major scenarios though.
Any user data we just create on the fly.

Hope thatā€™s helped some!

5 Likes

In the Netherlands the use of random data has been discussed during several meetups and conferences of TestNet, the Dutch Special Interest Group in Software TestingS

  • Random data like Social Security Number can be generated using a random generator while checking on validation rules.
  • From a legal point of view there is a chance that you might use an existing Social Security Number. Even if the person whose number is used, does not use the software, then there is private data in the test data.
  • Personally I would not be surprised, that privacy laws like GDPR and CCPA are broken in this particular case.
  • For testing purposes, there is a limited set of Social Security Numbers available in the Netherlands.
  • The random data issues might also apply on credit cards numbers and the like.
  • I am not a legal expert, so I advise you to talk with the Legal department.
  • One last warning: I once tested a tool which obfuscated production data. During my exploration of the environment I found the production data on an unexpected location.
4 Likes

Obviously, Iā€™m more familiar with UK than with Netherlands law, but I would suggest that if a random generator inadvertently created a genuine social security number, that wouldnā€™t breach personal data rules if it was associated with a randomly-generated name. On its own a social security number cannot be personal information if it is not associated with the person it was allocated to in the first place. An anonymous data item is not personal information; a dataset that enables an individual to be identified is.

3 Likes

According to me GDPR or General Data Protection Regulation is applicable in the UK.

A Social Security Number can be used to identify an individual. If the name of the person is changed, then this single number can still be used to identify a person. Suppose you use an existing Social Security Number for testing, then this person must give his, her or their consent for the use.

In general there might be some exceptions on using personal data for testing purposes, but that needs a good legal basis.

Disclaimer: I am not a legal expert. My advise is to contact the Legal department.

1 Like

Right now, Iā€™m testing a product with s smaller recordset, so on-the-fly works for us. In the past jobs I have used a similar approach to @marissa , using a DB of data with customer permission, and running a tool over it to randomize content further. The customer database in our case has no personal identifying data, only machines, which are easy to randomize out more usefully anyway. A real customer DB is very useful, to get data consistency as inconsistent (for making it as evil if you like) as possible and have used many versions of such a ā€œcopyā€ to simulate all manner of upgrade problems especially. Making real evil data is very hard to do on-the-fly for so many reasons.

3 Likes

Iā€™ve had it differently depending on who I reported to.

I had a manager who wanted to see blackbox testing consistency, which meant data passes in should create the same data out results. Weā€™d build our test data and our test assertions around that. This means the test data is mostly static, and we can build test data using simpler means such as the front end or queries. And we could just manage it using spreadsheets for most part and pass it into our tools and scripts, since it was data driven. This allows us to have simpler tools and scripts.

More recently a newer manager preferred dynamically created data. We script data creation and this allows us to be more mobile with our test tools and scripts. Weā€™re aiming to have it compatible with local developer environments, docker images, and in our release/build pipelines. While this allows us to use the tests in more parts of our SDLC, it increases the maintenance of the scripts. Itā€™s one of the trade offs.

5 Likes

You cannot beat Production data in your testing for quality output but as mentioned needs to be heavily obfuscated or anonymised around PII .

While you might review and decide that yes, we need to generate certain types of data with special characters, Nulls, etc. the data from Production in our test databases does always bring out some data related functional bugs which we wouldnā€™t have possibly seen with test data generated.

I have also seen that we integrate with other partners via their APIā€™s and we have no control over their test data (generated) that brings about issues we see further on in Production as their test data is not similar to Production.

3 Likes

I disagree (for some cases). Prod data isnā€™t always the best choice, you lose so many edge cases & even happy-flows are often overlooked.

Production data is nice indeed but you want to re-create this data by scripts and not copy-paste from prod.
Imho if your generate test data doesnā€™t pick up these functional bugs, you donā€™t have quality-test-data and you should review your test-data-generation scripts. Can you cover it for 100%? Of course not but you can work towards it, thatā€™s were the Gold Copy comes in in Test-Data-Generation. Which you build up with cases.

1 Like

Welcome to the MOT @davecoleman .

I can relate, as I hinted earlier post, a ā€œrealā€ customer database without lots of sanitizing is really good at finding consistency issues which especially force customers to ā€œnever upgrade because it breaks their systemā€. A entire class of risks to the business which due to the risk averse nature and the continuously updating nature of a generated environment might find hard to uncover without extra work.

3 Likes

@han_toan_lin raised the issue of randomly-generated data coincidentally giving rise to real world data existing within test datasets, and he drew my attention to the GDPR. Iā€™ve looked at that, and I think that no liability attaches in this case. I speak here from the position of someone with (in a previous existence) 30 years of applying administrative law in real world situations in some (admittedly out-of-the-way) corners of the UK government.

Looking at the document Han Lin pointed us to, on page 8 the guidance says:

The GDPR applies to ā€˜personal dataā€™ meaning any information relating to an identifiable person who can be directly or indirectly identified in particular by reference to an identifier.

So first, we are looking at data relating to an ā€œidentifiableā€ person. if you have generated your test data randomly, the fact that you happen to hit a valid social security number could be considered coincidental. The person that relates to is not ā€œidentifiableā€ without further action. For your purposes within the application under test, the person that the number applies to is not identifiable. It ls only if you go looking using other tools that identifying the person becomes possible. And in such cases, I would say that it was the person going looking who is committing the breach, rather than the person or persons using a randomly-generated number not associated with a real person for purposes of testing a completely different application.

Later in the same paragraph, the guidance says:

Personal data that has been pseudonymised ā€“ eg key-coded ā€“ can fall within the scope of the GDPR depending on how difficult it is to attribute the pseudonym to a particular individual.

If your application under test is not intended to identify individuals from that one data item (the social security number), and the data allowing identification - a real name that can be associated with a real social security number - does not exist within the application under test, I would suggest that passes the ā€˜difficultyā€™ test.

On page 10, the guidance says:

Article 5 of the GDPR requires that personal data shall be:
ā€œa) processed lawfully, fairly and in a transparent manner in relation to individuals;
b) collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes; (and)
c) adequate, relevant and limited to what is necessary in relation to the purposes for which they are processedā€¦"

The key word here is ā€œprocessedā€. Having a randomly-generated social security number, not associated with a real personā€™s name, and using that to test an application meets the requirement to "processā€¦lawfully (and) fairlyā€¦*. The purpose of holding the information also applies here. The purpose of testing an application is perfectly legitimate. And simply using that randomly-generated number to populate a test dataset meets the requirement to be ā€œlimited to what is necessary in relation to the purposes for which they are processedā€.

In my opinion, it would be the act of setting out to associate that number with any individualā€™s name which would go beyond these provisions and cause a GDPR breach.

At least, thatā€™s the explanation I would send to our legal team if this challenge arose in practice. Different organisations will obviously react differently. When I was working in the Government sector, I would be expected to be familiar with the law and its application in practice, and our legal team would give their opinion based on the position of expecting to start out on firm legal ground. A private company might want to err on the side of caution, depending on how confident their legal team are in their own opinion.

And now you see why lawyers charge such high fees!

1 Like

You might have a point, but I am not a legal expert.

Looking at the situation in the Netherlands it might be difficult to defend the use of a random generator, if a special set of Social Security Numbers is available for testing purposes.

During my short search on the internet I did not find a set of test Social Security Numbers for the UK.

If you want to find an equivalent test dataset for the UK, search on ā€œUK National Insurance numberā€ or ā€œNI numberā€.

1 Like

Thatā€™s really interesting - the thought of dividing data problems with product issues. thanks for sharing that with me.

thanks @marissa That did help to understand how you do it.

1 Like

yikes! that is a worry to find production data in unexpected places. It makes me think that we need to reduce the risk and create our test data ourselves. Itā€™s quite a risk in our product (HR software), but i suppose we could explore it somehow.

haha love that " making real evil data"