Using scrubbed production test data at scale beyond a single database

FYI, found a related forum post at Tools for supplying high quality data for test, but it doesn’t quite fit my question.

I came across an interesting post recently:, which is similar to the forum post.

So you take production data or a subset of it, scrub it, and then can make use of it for testing.

On a related note to the main question, are there public data scrubber tools/frameworks for this, or most folks build this in house?

Now, scrubbing the production data is something I can easily comprehend the use of, loading it into a single/main database for use.

The tricky part to comprehend is the interaction of the production data (the original unscrubbed data and the scrubbed test version).

Say for example, a system that has (in additional to the main RDBMS SQL-based database) real time streaming (incoming/input) data (over Apache Kafka), intermediate cache storage of data in alterative database systems like redis, and timeseries or TTL-based databases (Cassandra, InfluxDB).

How does one effectively pipe the unscrubbed live streaming data into test environment, scrubbing it in the process to match the prescrubbed main database? One could capture such streaming data, scrub, save, and use for future replay, but doing so makes the data more static, and outdated over time, and less production like over time as a result. Or the process would have to be done/refreshed frequently to be on parity with production.

Also consider massive amounts of streaming data if captured over a long period (1 hour to 24 hours), in the GB, TB, and beyond, hence live streaming is more preferable. If capture a small snippet of duration 15 minutes or less and replaying that in a loop, you don’t have the diversity/variability of a longer capture duration. Replaying GBs/TBs from file may not really effective to do for scale testing I think.

Also for the streaming data, consider we want to upscale the original input 10x, 100x of production traffic, not in terms of just messaging rate, but like unique clients/IDs. So that means you have to partition the incoming streaming input so that you can easily say for client A (who belongs to organization B), generate a client A1, A2, A3, … An (all who belong to organization B1 or also B2, … Bn). For example, for every unique client and organization coming in the input produce 9 more of them (for 10x) while maintaining the ratio of uniqueness.

And back to the cache and timeseries database. How to clone from production and scrub together effectively with the main database? Also, since it’s timeseries TTL based, the data expires over time, and technically for realism, you’d refresh the scrubbed data more frequently than the main database.

I’m wondering how large organizations with microservices, streaming data, and multiple databases deal with this. This seems tricky to do at scale and with more components in the system, when you deploy and test this as a whole end to end system (rather than dealing with just a subset of the whole system, and mocking many parts away).

Anyone here have insights to this that they can share?

I complained about this and a vendor approached me with a tool they offer called condenser. I think if you are a small enough org doing this for one database is totally possible. However, at a certain enterprise size with multiple databases, endpoints, tools I doubt most businesses care to tackle this. It would need a full time team dedicated to simulated data and I doubt that’s cost effective. I would love to be proved wrong though.

Hello @daluu!

I’m not familiar with available tools for scrubbing. Many of your comments and questions caught my eye.

Yes, I believe this is the idea. I agree with you that the data may become static but it depends on the information objective of your test. If you want to learn about your system’s performance, then a static representation may fulfill the information objective. If you want to learn about your systems behavior to diverse data, then you may need something more dynamic. It will also depend on the type of test: performance or behavioral. In my opinion, the test data needs are different.

Streaming video is a representation of movement. If you point the camera at a tree trunk, there won’t be much diversity because all of the frames would appear the same. However, if you point the camera at a busy intersection, waves on a beach, or a fan blowing on some streamers, you might approach the diversity you need. It might be fun to profile the diversity in production and experiment with scenes that match the diversity but not necessarily the content. In this manner, scrubbing is not needed.

I tripped over the word “realism”. I’m not sure test data has to be real as long as it is representative. The system should not know if the data is “real”. If it does, you have a different problem.

I agree. I go back to your information objective. If you need to explore “realism” and you want to explore it “at scale”, you might want to submit a proposal to mirror the production database for your tests.
I’m not a big fan of end-to-end tests. They are mostly for smoke testing rather than evaluating application behaviors or performance. I favor more isolated, deep dives over end-to-end even for performance testing.


Thanks for the comments. By the way, for streaming data, I did not actually mean streaming video (or audio), but rather real time asynchronous data streams coming in as input that may be event based or not, data coming off kafka for example, JSON text based or protocol buffers encoded. From testing perspective, a near constant flow of data coming in that needs dynamic scrubbing if we’re replicating it over from production to test environment rather than using a time sliced capture of the stream at some point in time.

Hello @daluu!

Thanks for the clarification! Streaming JSON simplifies the scrubbing. While you could dynamically scrub the data (that is, removing CHS information as the data arrives), an alternative is to log the JSON and scrub later. In that manner, there is a smaller impact on an operational database.

I worked with a team that logged all requests into their endpoint, scrubbed the requests overnight, and placed the scrubbed data into a non-production database. I had all the data I needed for testing and it had production diversity.
For a performance test, the remaining challenge is a method to retrieve and present the data to the application.


One of the benefits of a unified durable log is being able to replay the data through a system and verify the functionality is the same.

Further clarification, the streaming JSON for example isn’t actually used to write (or update) more entries in the primary database, some maybe but a small %, rather it gets processed live by microservices, and go into time sensitive or TTL based storage where they expire out of storage over time. The streaming JSON data also has timestamps associated with them as well.

It is meant to be live processed. But for testing purposes, adapting to your suggestions, would be to store the messages over some (longer) duration and then scrub them before replaying into the test environment whether they’re scrub’d before storage, after storage but before replay or while being replayed. The timestamps would also be part of scrubbing then since it’s not live replay. As for storage, that would be into a test database used to store test data or just store them as (log) files.

So in the above scenario scrubbing live doesn’t impact the databases, but rather may reduce the input messaging rate since the scrubbing takes additional time before the source data is passed into the test environment adding to the latency, reducing the max messaging rate.

1 Like