GDPR and Test Data Privacy

Dear all,

Here is Europa everyone is talking about the GDPR and the Test Data privacy, honestly I was expecting at least one Topic for this discussion, but seems like I will have to start :wink:
By May 25, 2018, any company that serves and possesses data on EU-based customers needs to comply (including many US companies) with all aspects of GDPR.

Did you Testers and QA Managers started preparing for the big day?
We as testers are dealing with tons of Test Data, which often handle personal data, we must rethink out Test Plans and Test Scenarios and start dealing with masked, disguised data.

Do you guys have a Test Plan or Strategy for the upcoming period,
-functional testing against the functionality after removing / masking personal data from the system,
-test scenarios against the personal data displayed and outsourced,
-reports with users data,
-use of some scanning tools?

Please feel free to share your ideas and opinions.
Thanks a Lot!


So I don’t deal with sensitive data (yet), but I am intrigued by how this is going to work across the board. I’ve often wondered how much data I as a tester should be exposed to without anonymisation. It feels uncomfortable for the dev team to be seeing customer or client data when those people might not realise that we’re seeing it all.

For me I guess what I would be thinking of when we do start dealing with data is more along the lines of what does the right to be forgotten entail? Obviously all the personal details will be removed from databases, but where else is that data being used and is it anonymised enough that it doesn’t need to be removed? Is it used in reports, or presentations, and how easy is it to track that down?


Personal names in my test datasets are drawn from some of my favourite novels; Mervyn Peake’s ‘Gormenghast’ and the Culture novels of the late Iain M. Banks feature strongly. These have the advantage of not being able to be mistaken for anyone in the Real World, unless you are personally acquainted with people with names like Cheradanine Zakalwe, Vyr Cossont or Shohobahaum Za…

When I was working with really confidential data in the days when I worked for the UK water regulator Ofwat, I created a test set of financial and physical company performance data by looking at past performance and creating datasets using my “skill and judgement” (as they say in all the best competition rules). So, for example, in creating a dataset for the measure “Length of mains refurbished”, I’d pick a number that seemed realistic and then increase that by 5% per year to generate a five-year set of values for that particular measure. I then repeated this across the other 1,400 or so other data items. These numbers went on to inform regulatory decisions, such as whether bills needed to be increased for a particular company at any point between the regular five-yearly reviews of bills across the whole industry.

The only problem was that I began over time to notice that some companies’ actual returns to us had numbers that looked worryingly like my test dataset. This either meant that my methodology was pretty sound and realistic, or that some companies were employing people with minds as devious as mine. Possibly the second, as a couple of years later, following revelations by a whistleblower, we fined one company £35 million for falsifying their data return to us, on top of their being investigated by the Serious Fraud Office…


As under the current European Data laws access and sharing of personal data is rigorously controlled GDPR isn’t impacting our test data because we already can’t use personally identifiable information.

We take approach similar to @robertday and replace (Tokenization) any personally identifiable information with made up data.

Usually the production data is so sensitive that this isn’t good enough and we have to generate our test data. We generate it by querying for extreme cases in our production dataset but don’t return the actual data e.g. length/size, character range, number of parent-child relationships. These stats are then used to generate a synthetic dataset

1 Like

It’s a big issue in my department. I work for a university where we as IT department host and manage educational and administrative systems. Testdata on acceptation environments used to be near production data, but we have to rethink that strategy now. The easiest way around is to set the same security policies on acceptation as on production - but that would imply that an extra account for testautomation can’t login, or would have to use the credentials of an existing real user with access to the system.
We worked out a basic policy document on how to cope with GDPR and testdata, which will be evaluated next quarter on a couple of scenario’s. Both pseudonymisation and anonymisation tooling could be in scope.
For me as testmanager the GDPR legislation could be a blessing in disguise… I’ve been trying to work on a decent synthetic testset for a while, these circumstances might give management a push towards a structural solution. :-).


I was at an event yesterday with plenty of startups which do SaaS heavy stuff. As far as the testing is concerned they were all cool - :sunglasses: is easy if the tester has to do it. In the past I used a lot of synthetic test data and test case design to get around the privacy topic. In some cases it is inevitable to use personal data or access data in your tests. Especially when you do backend tests and have a system like Herman.

What concerns me are the effects on the test itself. Big shifts in tooling away from non GDPR compliant SaaS apps and the fact that you will have to test a legislative thing in its effect which is not clear as of now. The fun fact is that there are also parts which can vary from country to country which will make testing apps with a paneuropean distribution even more complex.

Nevertheless. I think that these will also bring more awareness for testing since it can support your proof that you fulfil GDPR.


This talk was just posted on the MoT slack by @rightsaidjames. It looks like an interesting one for getting to grips with GDPR


Thanks for cross-posting that here @heather_reid :slight_smile:

For those who haven’t heard of her, Heather Burns (the speaker in the talk you linked) is a web designer-turned-legal-adviser (albeit not a qualified lawyer) who has some some great resources on all sorts of topics where Tech/the Web collides with the law. She’s also one of the co-founders of Web Matters, a fledgling professional organisation for people who work on the Web (full disclosure: I’m a committee member in charge of comms). Anyway, the reason I’m mentioning all this is that I highly recommend checking out her site for lots of authoritative resources relating to UK and international law.

We currently have to comply with HIPAA (American health insurance) law as we work with American hospitals and patient data. We currently have no patient data in our test databases, and you need special permissions to access this data in our production environment. It’s completely locked down just now.

It does lead to problems though, and something which has always troubled me. If we can’t see the data that clients are uploading to our production environment, how can we reliably reproduce the problems they see? In our industry, the file format/structure often deviates from the documentation specification (a massive pain!). If clients are deviating, and we can’t see their data, how are we meant to test effectively?

We’re currently discussing whether we can loosen these rules and use client data as part of our development process.

In an earlier existence, I worked with forensic accountants on reviewing company performance across a single sector. Each company’s data was commercially confidential; we had access to it because that access was laid down in law, and their confidentiality was defended at our end through the stupendous powers of the UK’s Official Secrets Act 1911.

We wanted the accountants and consultants to grow their expertise in the sector and apply lessons and techniques learnt in the audit of one company to a different company when they changed projects or contracts. This had to be done whilst the accountants exercised professional compartmentalisation of their knowledge of company’s information to preserve the client company’s confidentiality. The protocols that governed this were known within the profession as “Chinese Walls”.

Over a period of a few years, consultancy staff would rotate between projects and so achieve the sort of learning points that we as regulator was looking for. But we wanted to see if that process could be short-circuited, as we wanted to try to achieve the benefits of shared expertise sooner than five years.

I proposed the idea of a “Chinese Window” - a window in a Chinese Wall, through which consultant A could show consultant B something interesting and helpful without consultant B actually getting their hands on the data.

It sounds as though you need to have a method of extracting metadata about file formats and record structures without actually accessing patient data. You could then replicate error conditions by creating test files to the same format as those that caused problems in the live load.

1 Like

This is very interesting chat to read, as most of my testing career has been within insurance/financial sector which is heavily regulated around the use of personal data. You never use personal production data unmasked. I also created test data from automation scripts or manually dependant on the complexity required. All created data used real street address’s (needed for insurance purposes) and the business postal box as a security measure.

The only time I looked at personal production data was in response to an issue being raised from the business and I needed to triage what had happened. Logins and fully monitored digital footprints were part of our normal day for both front & back-end.

1 Like

Quick question. In an Insurance company, would the Policy Number be seen as Personal data. In so far as, once in possession of the policy number, it can be used to ‘mine’ personal data ?


Interesting question but no the policy number was not considered part of the Privacy Act 1988 we were required to adhered to. A policy number is assigned by the organisation as a way of tracking internally but provided to the policy holder for their convenience to access their information. Refer to this link for more information around the Australian Privacy Act 1988. Please be aware other countries operate under different regulations. In a previous professional life I was an Insurance Underwriter prior to finding my passion of Testing. :grin:

1 Like

Opps forgot to mention what you said about mining… It will never happen. Only a very exclusive amount of people have access to production db and as soon as they touch it everything is logged & monitored. Not sure if it’s daily or weekly but a report is then save about who, what, when entered that DB which needs to be secured down for audits if required. And it is required as we have an independent governmental department that audits financial institutes regularly with the ability to shut down the organisation or heavily fine breeches.

Thanks for the reply Kim, makes sense - we need to have some identifiable field as a constant.

As far as mining goes, I was thinking about it in a similar context as being advised never to give your bank account number out. In so far as it is only a numeric value assigned to your specific details. Likewise, a policy number with personal details removed, would still highlight details of investments, personal wealth etc. Particularly is the Assurance area of fund investments etc.

1 Like

The thing is the GDPR also extends the definition of personal information to include IP address and cookie id’s

1 Like