How would you test a data warehouse?

I’ve been in our MI/BI team a while now, and it is all around data warehouses, databases and SQL. No web pages, no HTTPS, no APIs, just lots of SQL queries.

And it got me thinking, if you were put onto a project/team where you were testing data had been put into your database correctly, and that when reports are made to query that data, it was built correctly…

How would you test it?

Would you know where to start? Would you ask other testers in your team how they do it? What if you’re the only tester and don’t have that luxury?
Would you Google it, to try and understand, only to be confused by terms such as ETL, EDW, dimensional models, OLAP, data marts?

It then got me thinking, would a guide (or even course) onto what a data warehouse is, SQL fundamentals (as if you have never needed to query a database before, would you know how?), how you might test the report from the back-end compared to the front end and looking at the output, and so on.

For those of you who do MI/BI testing with a data warehouse, how did you get into it, and what keeps you with it? I know it is a niche area, but with data and reports on that data continuing to grow, it’s something I feel as testers we should be aware of, even if not able to use every day.

4 Likes

How would you test it?
Learn about the tools, about the data, about what it means, what it is used for, by what, how often, get some diagrams or re/create some flows, etc…
Get a mission: see some risks/problems and/or talk to the managers - is that something that would benefit or interest them?
Start looking for experiments and investigate for issues…

1 Like

First I would break down the different part and approach them differently. They typically contain three elements.
The first step is the producers. The product or services that produce the data. There typically need to be some kind of contract for the format of the data that is produced. Here you can test that the produced values correspond with the saved format in the data warehouse. That the different combination of values are read / stored correctly. And you can test resilience to errors, duplicates, updates.

The second step is typically aggregation, normalization and transformations where you take the produced data and reformat it to something that is more useful for reports. Similarly you can test these to see that it can handle different inputs and that no crucial parts are lost in transit. These typically also follows a specific contract.

For the most part I have personally recruited help from business people to actually look at the validity of the reports. Mainly because in the solutions I have worked with have allowed for a lot of customizations (another area that you can test in itself). I found it faster for the account managers to have a look if the important information is retained in the reports instead of trying to understand all different usages.

Then as with all things all the terminology and tools you need to learn and understand for you to be able to test here is something you need to acquire. Some of it is BI inherent and a lot of it is business / domain specific too.

Finally another great pattern here if available to you is self referencing data. If a producer produces a steady stream of related data at a specific interval and you can control the reference or id call it “stream_5min_entry1”. Or if a combination have omitted all optional values “all_optional_omitted”. Or “data_1030_15” if it was produced 10:30 and the value was 15. This helps a lot with trying to figure out if the data is retained in transformations or in any validation of values.

1 Like

I think with data warehouses, data lakes, and all things data related are still pretty ignored on the side of actual “testing” and validation. To this day a lot of companies do not have a concept of data lineage or data provenance. If the data consumers know the data well they can have a sense of whether it is correct or not. However, the task is too big for one QA person. There are a lot of tools coming online in the market to begin to tackle all this stuff in various ways. It comes down to a pipeline testing and validation issue.

The question to me becomes, what can be done to stop this being ignored?

Is it painful enough for businesses?