Tools for supplying high quality data for test

Hi folks,

Curious what data folks at MoT use to test? I’ve seen a lot of companies where the dev and staging database environments were hand-made and entirely lacking in the variety and amount of data as compared to production. Have you guys seen this and if so, how did you overcome it? One possible solution I’ve seen implemented is to just use the production data in the testing environments but that comes with a whole suite of other issues.

Full disclosure here, the reason I’m bringing this up is because I recently built a product that helps to overcome this problem. It’s called Tonic (https://tonic.ai) and it keeps your test databases in sync with production while simultaneously removing PII from the production data. This allows testers to freely work with production-like data and get their jobs done more quickly and efficiently.

1 Like

Hello @adamattonic!

We do something similar. Production data is scrubbed, and placed into the non-production environment for many of our applications. I agree that production data is the best source of data diversity.

Joe

I agree it is the best. Depending on the type of data you have in production though you can hit some barriers.

  1. If the data is highly sensitive some organizations won’t allow it to be used in any lower environments.
  2. If the data is highly sensitive some organizations will allow the data to be used in lower environments but they will treat it as production data and with that comes a lot of red tape, regulations, rules etc, which serve to keep the data safe but also slow people down who are trying to test.

As a way around both issues #1 and #2 Tonic creates production-like data that has the same look and feel (i.e. statistical properties) as the real data but is completely and entirely fake and devoid of PHI. In this way you get the best of both worlds. You get production-like data to test on and you don’t have to deal with the rules and regulations.

@devtotest I’d be curious if issue #1 or #2 affects you, given that you work in insurance and possibly deal with PII or PHI.

@adamattonic

Both issues receive constant and vigorous scrutiny in general. Our scrubber has a similar outcome.

Joe

We have issues with keeping test data clean - due to a whole lot of legacy factors, the PII is used as search criteria.

We start with copies of the live data and adjust the elements that could cause emails to get sent to customers and so forth, but we’ve never found a way to anonymize PII that won’t also break the searchers our users rely on.

We’re currently working on field-level encryption along with whole-database encryption and hoping that will work for us.

We use a couple of methods to protect our data. Scrambling names, address, date of birth means we retain all the data but searching by names becomes impossible to see anything sensible.

We also randomise names, addresses and date of birth where we do need some form of sensible search criteria for customers.

Both use an encrypted key so they can’t be ‘undone’

Hi @katepaulk,

I’m curious about your comment on field level encryption. Are you saying you guys might encrypt the test database? If so, how will you test with it?

Your other comment on keeping test data clean definitely rings true with my experience. Cleaning up data once is sometimes easy but doing it on a regular cadence is trickier and becomes trickier still as your schema changes over time. Tonic solves this by integrating with CI/CD pipelines so you can get a fresh set of test data at some regular cadence and by sending notifications to users as schemas change to ensure that PII never reaches the test db. Admittedly though, its a tricky problem and even our solution could use some improvement.

@adystokes Ady, nice talking with you. I’m a fan of your periodic table of testing.

I have a few questions if you don’t mind. When you say that you ‘randomize’ do you mean you just shuffle values in a column around so that even if the names are real you’ll get a random address and birth date for the user?

If that is the case how do you handle heirarchies such as <City,Country>? Do you shuffle cities and countries together so you won’t end up with something nonsensical like <London, Germany>?

What about statistical relationships between columns? For example, does it matter if you have a high percentage of older customers in France as opposed to Italy? Because independently randomizing will lose interesting relationships like those.

Finally, when you scramble columns how do testers handle UI testing? I imagine the UI would look weird and it might be hard to test in that situation?

Thank you so much @adamattonic :blush: and of course I don’t mind questions. If anything below is unclear please let me know.

To answer your first question, essentially yes, our randomisation takes those columns and mixes them up so that the customers main details such as first / last name, date of birth and address come from multiple records. I should say that as all our records are UK based their is no concept of country so the scenario you describe isn’t applicable. And to further clarify, its the first 3 rows of the address that are fully scrambled/randomised and the second part of the postcode is changed to 9ZZ. County is left as is so the county and first part of the postcode make a ‘sensible match’.

I like your question about interesting relationships and statistics. Our analysis is generally done by a specialist business intelligence team but we can pose questions to them if it will help our work.

In the UI with scrambled data the language looks unusual rather than weird so looking at Mrs Jane Smith could become Mrs Ifdj Zytsm! If needed we can manipulate the customer records so when I’m testing our online application and want to register etc.I can make is ‘sensible’ again. Unless I’m doing something specific with names a lot of them end up with my surname as I can remember that easily!

Hope that helps?

Honestly, I’m not sure. It’s a work in progress at the moment.

What we’re most worried about is that for reasons involving the legacy systems we need to integrate with we have a lot of places where the primary lookup and connection field is the US Social Security Number. At this point we haven’t found a way around this - we can’t randomize it because we have to comply with format and value rules for the thing, and we have to make sure that the value for a given person in the database is the same value in every location where that value is stored.

1 Like

@adystokes , it does help. That is a lot of clarity, thank you. One final question, how do you guys handle keeping your test data fresh? E.g. what happens when there is a schema change in production or when you need to add more up to date records to the test database?

@katepaulk Is the SSN column an actual FK column in the DB? Regardless though, that is a tricky problem especially if you want to preserve joins and referential integrity.

The legacy system is on a monthly release cycle so test bases are refreshed monthly too. The test bases are a full replica of live so records are no more than 30ish days old. Complimentary systems vary and can be weekly or sooner depending on needs.

Thanks Ady. I’m all out of questions. I really like your teams approach.

1 Like

@adamattonic - sadly, the database system doesn’t have that much referential integrity. It’s handled programmatically, partly because it’s actually a valid case to have a person who is in the export data records but no longer in the primary table (we do payroll - we have to keep history records of employees, but we also have a requirement to remove terminated employees. Which makes it legit to run a report on the history records where some of them don’t actually match to current employees. Running that same report and not being able to match up to any employees because SSN randomization disconnected things would be… an issue)