Here are my notes on TDM: (copied from reports/books/articles -edited here and there)
But this is a brief summary and has epic insights into Test Data management and cleaning up your data.
(all publicly available, so feel free to share & use it
)
Test Data Problems (challenges)
Low test coverage
53% of respondents in the 2018-19 World Quality Report stated that they lacked appropriate test data.
Data for testing new functionality
Production data is often drawn from the expected scenarios that users have exercised against a previously released system. It therefore lacks the data needed to test new functionality. Imagine that a new field has been added to an unreleased version of a web registration form: how can production data possibly already contain this data? Data has to be prepared for new functionality.
Outliers and edge cases
Production data is typically narrow and highly repetitious, focusing on expected user behavior. It lacks the outliers and edge cases required for sufficient test coverage.
Negative Scenarios
Test data copied from production is “happy path”. QA must mitigate the risk of unexpected behavior, for instance testing error handling rigorously in advance of a release. Therefor a copy from production for test data will not work for negative scenarios.
Poor Data Consistency
61% of respondents in the latest World Quality Report cite “maintaining test data consistency across different systems under test” as a test data challenge. End-to-end testing requires complete, interrelated data sets with which to test every combination of API and Database call, as well as any UI activity. The challenge is retaining the referential integrity of this complex, interrelated data when preparing data provisioning for test environments.
Slow and manual data refreshes
Copying complex data is complex, and manual data refreshes can therefore be slow and error-prone.
Crude data sub-setting
Subsetting test data is valuable in lowering storage costs, as well as reducing data provisioning time and the time required to execute tests. However, simplistic subsetting techniques often neglect interrelationships in complex data.
Overly manual data masking
Masking is another valuable TDM process in that it helps mitigate against the risk of exposing sensitive Personally Identifiable Information (PII) to less secure test environments. However, masking must also respect the relationships that exist across tables: mask data in one row, and that change must be reflected consistently across tables. These relationships are often highly complex. Imagine you are testing an online banking system. Transaction logs for bank accounts contain temporal trends that reflect the time of withdrawals, transactions, and purchases. The data is also interrelated mathematically, including for example sum totals made up of numerous other variables. Reflecting these temporal and numerical relationships accurately while masking is highly complex, masking manually or using certain commodity tooling will rarely be able to retain the relationships.
Test Data waiting times
36% of respondents to the 2019 Continuous Testing Report state that over half of testing time is spent seaching for, managing, maintaining and creating test data.
Test data bottlenecks can arise as testers wait for data to be provisioned. QA teams are often dependent on a central team responsible for moving production data for test environments. These Ops teams must perform several TDM tasks to move data, including subsetting, masking, and copying.
Hunting for the “right” data
When the data is provisioned to test environments, testers must spend further time finding the exact data combinations needed to execute their test cases. The challenge is production copies are large, unwieldy, and repetitious. Finding exact combinations of interrelated data is therefore slow and frustrating, and 46% of respondents to The World Quality Report cite finding relevant test data in large data sets as a test data challenge. Finding particular combinations might be accelerated using database queries or a set of scripts. However, these must be tailored to individual tests, while the tests and data are subject to change. The queries or scripts must be updated or recreated each time test cases or test data change, eating up time within an iteration.
Manual data creation
The frustrating hunt for data combinations will often bear no fruit, as production data lacks the data needed for rigorous testing. Testers must then create the complex data needed to fulfill their test cases, particularly the outliers and unexpected results needed for sufficient test coverage. This is time-consuming and error-prone, particularly when performed by hand. Test failures arise from data inconsistencies, while data is created for particular test cases. The time spent creating data must therefore also be repeated as the system under test changes.
Cross-Team constraints
The time and cost associated with moving production-size copies of data to test environments further means that there are never enough copies of data. This creates cross-team constraints, undermining parallel testing and development. With traditional TDM techniques, testers are often forced to wait for an upstream team to finish with the data set they need. Further delays mount when another tester uses or deletes data, or when useful data is lost during a data refresh. Testers must then repeat the time consuming and frustrating hunt for new data or must create new data by hand.
Data Storage Cost
Firstly, there is the infrastructure cost associated with storing several full-size copies of production data. The fast-decreasing cost of data storage helps, as do technologies like database virtualization. Nonetheless, the lack of discernible advantage to retaining unwieldy copies of low-variety, low coverage data means much of the storage cost is waste.
Test run requirements
More problematic for test teams is the resource-intensity of running the large data sets during test execution. Automated test execution with terabytes of data will be highly time consuming and costly, as will executing queries against the data. The test runs will furthermore produce unwieldy resultant data that then needs to be analysed. Test teams can waste a large chunk of their time assessing millions of rows of complex data, comparing it to expected results to produce run results, and 56% of respondents to the World Quality Report cite managing the size of test data sets as a test data challenge.
Increasing regulatory requirements, increasing risk
46% of organisations cite “complying with data security and data privacy regulations for test data” as a test data challenge.
TDM “best practices” at numerous organisations risk non-compliance with data protection legislation, and increase the risk of costly data breaches. This is true in spite of the wealth or writing warning against using potentially sensitive production data in test environments, and the fact that global data protection legislation has been growing more stringent for over two decades.
Recent regulation includes the EU’s General Data Protection Regulation (GDPR), as well as the California Consumer Privacy Act of 2018. The GDPR carries staggering maximum fines of 20 Million Euros or 3% of annual turnover Much has been written on the challenges of ensuring that test teams have consent to use sensitive data in test environments, while new data protection regulation present challenges that current storage techniques struggle to fulfill.
Throw in the fact that test environments are necessarily less secure than production environments, with a greater associated risk of data leaks, and the risk of non-compliance grows further.
Key Technologies for a Modern TDM Strategy
- Data modelling and “data crawling”, to retain referential integrity as high-speed TDM tasks are performed.
- Sub-setting with a view to reducing test data size while maintaining variety and the interrelationships in data.
- Reliable data masking to mitigate against the risk of costly non-compliance with data protection regulations.
- Data cloning to provision test data sets rapidly in parallel, while preserving rare or useful data sets.
- Synthetic data generation to supplement production data sources and create every combination of data needed for maximum test coverage.
- Automate “Find and Make” provisioning, where the exact combinations needed for given test cases are searched for automatically, with any missing data needed generated on demand.
- Dynamic data definition, generating test data at the same time as test cases.
- Automated and repeatable test data preparation, in which previously fulfilled test data requests are performed automatically.
- Data comparison, enabling test teams to compare expected results to the data produced during test execution, rapidly formulating accurate run results.
- Virtual data generation, creating every Request-Response pair needed for accurate service virtualization.
Why TDM is critical to a project’s success
- Your test data determines the quality of testing
- No matter how good your testing processes are, if the test data used is not right or of adequate quality, then the entire product’s quality will be affected.
- Your test data should be highly secure
- It is absolutely mandatory that your test data doesn’t contain data from production without being masked. If the data is not secure enough, then there is every chance that a data breach might happen, which can cause the organization dearly.
- Test data needs to be as close to real time as possible
- Not only that test data needs to be of quality, it should be as close to real time data / production data as possible. Why? Simple reason is we do not want to build a system/application/product for 6 months and fail in the production just because there was not adequate real time data to test.
- Lowers test data creation time which results in overall test execution time
- This is self explanatory. This drastically reduces the overall test execution time.
- Testers can focus on testing rather than test data creation
- The main focus of trying to automate the test data management process is to allow the testers to focus on the actual testing than worrying about how the data is created and the technicalities surrounding it. This allows the team to remain focused on the job at hand (The actual testing) so that it can be done more effectively.
- Speeds up time to market of applications
- Faster & Effective test data creation leads to faster & effective testing, which in turn leads to faster time to market for the application. It is a cycle and hence it has a compounding effect, release on release.
- Increases efficiency of the process by reducing data related defects
- Due to the accuracy of the test data, data related defects will reduce enormously, thereby increasing the efficiency of the process.
- You can manage lower volumes of test data sets more efficiently
- Any time, managing lower volumes is better and more cost effective than managing higher data volumes. The maintenance costs associated with higher volumes will increase over time and will affect the operational costs.
- Process remains same even though team size increases
- This is a critical point, you would not need to reinvent the wheel if the team is ramped up. The same process can be followed/extended even if team size increases.
Test Data Life Cycle
Requirement Gathering & Analysis
This is pretty straightforward. In this phase, the test data requirements pertaining to the test requirements are gathered. They are categorized into various heads:
- Pain Areas
- Data Sources
- Data Security/Masking
- Data Volume requirements
- Data Archival requirements
- Test Data Refresh considerations
- Gold Copy considerations
Planning & Design
As the name indicates, based on the requirement analysis an appropriate solution is designed to solve the various pain areas in the Test Data. After looking at the problem scale and the feasible solution, a suitable test data process is suggested and we would need to choose between an In-House solution or a Commercial Product or a combination of both. Also in this phase, an effort estimate is done for the entire project. And a test data plan/strategy is also developed that will propose a direction that the project will take and what approaches will be followed to solve the test data problems. That could be either in the form of process improvements or in the form of an automated solution.
Test Data Creation
In this phase, based on the Test Data Strategy, the solution is developed and test data is created through various techniques depending on the project test data requirements. It can be a combination of manual and automated techniques.
Test Data Validation
In this phase, the created test data is validated against the business requirements. This can be done by Business Analysts or using automated tools if the volumes are very high.
Test Data Maintenance
This is similar to a test maintenance phase, where there might be requests for changes in the test data according to the changes in the tests. Hence again the entire life cycle is followed for maintenance of the test data. This might include creation of Gold Copy for future use, Archives for size management, updating of Gold Copy, Restoration of older data for testing, etc.
Gold Copy Data
What is a Gold Copy Data?
In essence, a gold copy is a set of test data. Nothing more, nothing less. What sets it apart from other sets of test data is the way you use and guard it.
- You only change a gold copy when you need to add or remove test cases.
- You use a gold copy to set up the initial state of a test environment.
- All automated and manual tests work on copies of the gold copy.
A gold copy also functions as the gold standard for all your tests and for everybody testing your application. It contains the data for all the test cases that you need to cover all the features of your product. It may not start out as comprehensive, but that’s the goal.
A Hybrid approach to Gold Copy Data
Sub-setting with Data-Masking & Modelling
Sub-setting production data reduces the time that must then be spent masking and copying it to test environments. It helps to quickly provision smaller but representative data sets, so that QA teams spend less time hunting for data. However, data subsets must be complete and coherent for testing, retaining all the relationships between tables in the production data. This can be complex when performed manually or using tools that require manual definition of all the subset criteria.
“Data crawling” refers to the automated technique by which you retain Primary and Foreign key relationships during sub-setting. You automatically “crawl” up and down Parent and Child tables, collecting all the data needed for a coherent data set. This is performed recursively until a complete data subset is made, producing smaller data sets that retain full referential integrity.
Reliable Data Masking
The sub-setted test data must be anonymous to ensure regulatory compliance with data protection laws. This is where data masking comes in. Any effective test data masking must perform two interrelated tasks: first, it must scan the data for sensitive information; second, it must mask this information. For testing, this masking must furthermore retain the referential integrity of the data. Modelling the relationships at the same time as scanning data for PII is an effective way of achieving this.
Data Cloning
Data cloning facilitates parallel testing and development by rapidly copying isolated data sets to multiple environments. Like sub-setting, cloning reads data from a source database or schema and copies it to a target. Also like sub-setting, effective cloning copies a complete and coherent data set, retaining full referential integrity. Adopting cloning at the same time as sub-setting and masking is a natural step for organisations with distributed test teams. Provisioning numerous copies of isolated data avoids the delays caused by cross-team constraint, while sub-setting the data prior to cloning reduces the cost of maintaining the numerous isolated copies. It furthermore reduces the time required to mask data.
Working from isolated data sets removes the frustration of useful and rare data being cannibalized by another team, while useful data can be cloned and preserved during a data refresh. Alongside these logistical benefits of rapid and parallel provisioning, data cloning furthermore enhances testing quality. It can be used to multiple the data needed for particular test scenarios. This is particularly useful for automated testing that burns rapidly through data, as it ensures that new data is always readily available.
Synthetic Data Generation
Cloning, masking, and sub-setting alone are capable only of satisfying test scenarios for which data pre-exists in production data. The first section of this article highlighted the low coverage associated with low-variety production data, and how using production data alone therefore leaves a system exposed to costly bugs. Adopting synthetic test data generation is therefore a key step in enhancing testing rigor, supplementing pre-existing data sources.
This streamlined approach to data generation is capable of creating data for complex and distributed systems, providing a range of techniques for inputting data into test systems:
- Direct to Databases: You can generate data directly into numerous databases, including SQL Server, My SQL, Postgres, and Oracle.
- Via the front-end: There is not always direct access to the back-end database, and VIP therefore also uses automation frameworks like Selenium to input data via the front-end. (NOT PREFERRED)
- Via the middle layer: Alternatively, You can leverage the API layer, inputting data via SOAP and REST calls.
- Using files: VIP can generate data in flat files, XML, EDI, and CSV.
- Mainframe emulation: Mainframe data can be particularly difficult to create manually, and Curiosity still find organisations inputting data laboriously via green screens. You instead provides accelerators for creating complex Mainframe data via emulators, for instance creating synthetic data for X3270.
Test Driven Provisioning: Automatic “Find & Make”
Test data provisioning within modern software delivery projects must accordingly be test driven: data must be found rapidly for every test cases that needs executing within a short iteration, and any required data that cannot be found in existing data sources must be generated on demand.
Test Driven Provisioning: Dynamic Data Definition
Creating test data at the same time as test cases is another approach to test-driven data provisioning. Model-Based Testing provides is an effective approach to generating test cases and test data in tandem.
Repeatable preparation
TDM tasks are necessarily rule-based, with logical rules dictated by the source data and the test cases that need to be executed. TDM tasks are therefore ripe for automation, and 77% of organisations cited in the World Quality Report state that they are using or considering bots for test data generation.x If TDM tasks are automated and rendered repeatable, QA teams can invoke them directly. This significantly reduces the burden on central data provisioning teams, who can focus only on fulfilling new test data requirements. These requests thereby become repeatable in future, and QA teams increasingly do not need to wait for data to be provisioned.
Data Comparison
Analyzing data to formulate test run results can be time-consuming and cumbersome, especially when feeding large data sets through automation frameworks. QA teams can waste time scanning high-volumes of complex data and comparing it to expected results. This is not only laborious, but is also subject to human error, undermining the reliability of testing. Robotic Process Automation excels when performing rule-based and repetitious processes, and data comparison is no exception.
Virtual Data Generation
Service Virtualization can deliver significant time and efficiency gains to QA teams, providing on demand access to realistic test environments. It is also a technology that depends on effective data management. With Service Virtualization, test teams no longer need to wait for constrained or unfinished components of a system and can instead work in parallel from readily-available, production-like environment. However, accurate service virtualization requires realistic virtual data with which to simulate components. This data must furthermore be capable of satisfying the full range of requests made during testing, and effective service virtualization for testing therefore requires a rich set of Request-Response pairs. The full range of RR Pairs are rarely found in production data and are absent completely for unreleased components. Just like test data, virtual data is therefore prime for synthetic data generation.
Test Data Ageing
This is useful for Time based testing. Let’s assume you create a customer and it requires 48 hours for activation of that particular customer. What if you have to test the scenario that will occur after 48 hours? Will you wait till 48 hours for that scenario to happen for your testing? The answer is No. Then how will you handle this scenario?
There are basically 2 approaches by which we can do this:
- Tamper the system dates
- Although it is possible in some cases to tamper the system dates and continue with the testing, this method will fail if the date is generated by a database server or an application server instead of the client.
- Tamper the dates in the back-end
- This should be most viable and practical solution for such scenario. In this approach, we modify the date at the back-end so that it reflects the new date. But care should be taken to ensure that data integrity doesn’t get lost or the data semantics doesn’t get lost.
This method of modifying the date according to the scenario needs is known as Test Data Ageing. Depending on the scenario that needs to be tested, we can either Back date or Front date the given date.
Data Archive in the Scope of Test Data Management
- Maintenance of test data
- Typically used in maintenance of test data over a period of releases.
- Archival of older release data
- You can always archive your older release test data so that it remains intact for future use
- Archival of multiple environment’s test data
- If there are multiple test environments, the test database size grows proportionate to the number of environments. In this case, archiving the data would save a lot of disk space.
- Restore whenever necessary
- An archive should be easily restore-able.
- Release/Build/Cycle wise snapshots for easy restore
- Snapshots can be maintained as per the project release cycles. This is useful in case of production support wherein, we would need an older environment for testing the production support release.