Data Warehouses and Automation: Success stories (if any!)

pateeto · 28 June 2021 03:29

I’m currently working on ETL Testing. You know the deal: Data goes from A to B with some logic in the middle and voilá, testing needs to be done.

So far, the solutions I’ve seen to automate this have been in-house built frameworks. Sometimes with Python, sometimes with Java…

As someone who has been automating APIs and UI for more than 10 years, I find it curious the lack of standards and frameworks there are out there to take care of this automation area.

So…for all of you testers in ETL doing automation: What have been your challenges, failures and successes attempting automation in this not-so-explored area.

Cheers!

crunchy · 29 June 2021 21:03

Ha ha, I noticed no one answered.

Yeah it’s a toughy… basically as you described.

Its been a couple of years since I did a project like this and can’t give an definitive answer but can provide some strategies that might help.

Tests are usually examples. Given data X I expect answer Y. So first strategy is to start with known (synthetic data) that covers the various test partitions and and boundaries. Hopefully these are clear from the acceptance criteria before the Dev was done! so you just populate the data and check the result.
In theory you covered every possible scenario with synthetic data above. Customers and the real world love to proof you wrong. So now take some actual data from production and watch your etl jobs crash and burn. Try popular reports and ensure there are no obvious errors. There the ones customers get cross about.
If ther is an existing or legacy report, compare some reports with that. Customers hate changes in numbers even if it’s an improvement or correction.
Having said tests are are examples, etl can be tricky to have known outcomes for. Some times you can write some code to validate the repot results. This works because some times your code is higher level, can focus on individual calculations and ignore edge cases so is simpler. Also it doesn’t need to scale or be distributed. But sometimes it’s just a second opinion. So ok the reports don’t match but 50% you are right and 50% Dev was right so you investigate and fix the wrong one.

crunchy · 29 June 2021 21:09

Ah I didn’t really read the question In full…

Regards frameworks. Not sure it matters. It totally depends on the technical architecture. I have had success with big oracle databases, pentaho, Aws shift and snowflake. It’s not much different to testing any other service or application.

pateeto · 29 June 2021 23:35

Thanks for replying! Regarding the frameworks, I think that the lighter the better in this case, as it shouldn’t be needed much complexity to test ETL (in the end, it would be a glorified SQL executor in my case, which needs to checks that a count is X or Y).

kristof · 30 June 2021 06:27

I don’t think a lot of people do Data Warehouse testing and especially automate it.

A friend of mine did it and all I know is that he build his own framework to test it.

So true, what vendor would make a tool for this, since people would only need it once every hopefully 25-30 years

april · 21 February 2022 13:33

Hi I do ETL testing and am looking at automation too. I noticed the same as you lack of standards…Our team are looking at moving to the cloud so have been advised to learn python and then later on API testing. As a team we don’t have a CI/CD pipeline so not clear yet where the role of tester is needed as yet (apart from the API bit).

bobs · 21 February 2022 14:52

Sorry not to reply to this before - I hadn’t noticed it, so I’m grateful to April for bumping it up to the top.

In the past I’ve done DIY automated checks using reconciliation. E.g. you count the number of rows going into stage X and the total number of rows going out of stage X, plus the total of some property on each row e.g. duration of a movie download, cost of an online purchase etc. These were just pairs of SQL queries for the before vs. after values. The idea is that the pipeline shouldn’t suddenly gain or lose rows, or its rows gain or lose duration / cost / etc.

It needs the pipeline to have been designed for testability - in this case it means that there needs to be an explicit reject output stream for each step that could reject things. E.g. if you’re joining records against some reference data (following a foreign key from the record) then if it’s an inner join and the record finds no match, that record will silently disappear from the output. So the counts in the automated checks won’t always add up. From a testability point of view, and also for monitoring in production, it’s better to route such bad records to a reject output. The check can compare the input count against the total across all outputs (valid and invalid).

In our case we refined these checks by subdividing the records (and their counts/sums) into sub-groups that made sense from a business point of view. That helped us to pinpoint errors more easily. But again, this was all DIY stuff.

Another thing that might help is separating out input from processing code as much as possible. E.g. the pipeline reads from API X, then does loads of stuff to the results before writing the data to a database table. I’m suggesting that you split this into two: a tiny pipeline that just reads from API X and immediately writes the data to a staging table, and a main pipeline that reads from the staging table and does the rest.

The cost of this is: 2 pipelines rather than 1, an extra db table, and extra reads and writes because of this extra table. The benefit is that the bulk of the logic reads from a table under your control (the staging table) rather than from an API that you possibly don’t control. This would make it easier to write an automated check - it can load the staging table with known inputs, so you can look for expected outputs. I don’t know of any frameworks to help with this, and it depends on how the pipeline is designed.

A blog I wrote about the first thing (comparing before / after counts): Testing data preparation for a BI database – Random Tech Thoughts

ubbamiller · 24 March 2022 16:57

Hi,

I am starting a class in Informatica soon and this topic interests me. Seems the big vendors have tools for this and there are companies that have open source tools. This article goes through the ETL testing process and the testing tools. ETL Testing: Importance, Process, and ETL Testing Tools | by AltexSoft Inc | Medium

Hope this helps.

John

pateeto · 10 March 2023 03:31

Back to the ETL party! I finally found a Python package called Great Expectations and it’s AWESOME. I’ve been doing test drives, demos and implementations with it with great success. The setup is simple and in fact you can use it just as a collection of Jupyter notebooks if you don’t want to touch code (shame on you!) or if Python is not your cup of tea.

Topic		Replies	Views
How would you test a data warehouse? 🗄️ Archive data	5	1534	2 July 2019
Data Warehouse/ETL Automation testing 🗄️ Archive automation , regression-testing	2	1603	28 October 2021
Is anyone working on testing of ETL and power BIReport? 🗄️ Archive	2	322	13 May 2022
Positive Stories About Test Automation 🗄️ Archive automation	4	475	18 June 2021
How do you handle test data for standard systems 🗄️ Archive automation	1	404	22 October 2020

Data Warehouses and Automation: Success stories (if any!)

Related topics