FYI, found a related forum post at Tools for supplying high quality data for test, but it doesn’t quite fit my question.
I came across an interesting post recently: https://conferences.oreilly.com/velocity/devops-web-performance-ny-2015/public/schedule/detail/45012, which is similar to the forum post.
So you take production data or a subset of it, scrub it, and then can make use of it for testing.
On a related note to the main question, are there public data scrubber tools/frameworks for this, or most folks build this in house?
Now, scrubbing the production data is something I can easily comprehend the use of, loading it into a single/main database for use.
The tricky part to comprehend is the interaction of the production data (the original unscrubbed data and the scrubbed test version).
Say for example, a system that has (in additional to the main RDBMS SQL-based database) real time streaming (incoming/input) data (over Apache Kafka), intermediate cache storage of data in alterative database systems like redis, and timeseries or TTL-based databases (Cassandra, InfluxDB).
How does one effectively pipe the unscrubbed live streaming data into test environment, scrubbing it in the process to match the prescrubbed main database? One could capture such streaming data, scrub, save, and use for future replay, but doing so makes the data more static, and outdated over time, and less production like over time as a result. Or the process would have to be done/refreshed frequently to be on parity with production.
Also consider massive amounts of streaming data if captured over a long period (1 hour to 24 hours), in the GB, TB, and beyond, hence live streaming is more preferable. If capture a small snippet of duration 15 minutes or less and replaying that in a loop, you don’t have the diversity/variability of a longer capture duration. Replaying GBs/TBs from file may not really effective to do for scale testing I think.
Also for the streaming data, consider we want to upscale the original input 10x, 100x of production traffic, not in terms of just messaging rate, but like unique clients/IDs. So that means you have to partition the incoming streaming input so that you can easily say for client A (who belongs to organization B), generate a client A1, A2, A3, … An (all who belong to organization B1 or also B2, … Bn). For example, for every unique client and organization coming in the input produce 9 more of them (for 10x) while maintaining the ratio of uniqueness.
And back to the cache and timeseries database. How to clone from production and scrub together effectively with the main database? Also, since it’s timeseries TTL based, the data expires over time, and technically for realism, you’d refresh the scrubbed data more frequently than the main database.
I’m wondering how large organizations with microservices, streaming data, and multiple databases deal with this. This seems tricky to do at scale and with more components in the system, when you deploy and test this as a whole end to end system (rather than dealing with just a subset of the whole system, and mocking many parts away).
Anyone here have insights to this that they can share?