Any tips, courses, things we can go away and look at for performance testing within a data world / data team

A lot of courses for performance testing are aimed at front end / back end ‘systems’ and its hard to find some materials around performance testing within a data team, from ingestion of data from sources, through distillation, processing, and then insights / consumption and what tools to use. Any suggestions would be hugely appreciated

1 Like

I don’t know about blogs or courses regarding performance testings around the areas you mention, that’s probably more of a specialized area/field of testing than the norm, thus there is not much public coverage of them.

These areas are also specialized, so the tooling needed to do the tests often may be custom and in-house developed, less open source and off the shelf for general public to use. Or in the case of not, some of the tooling may be general and public, open source, etc. but the other components/tools of the testing are customized specific to the organization/team doing the testing.

For the generic tools that could reapply for the public, it would be things like kafka, redis, elastic search based load/stress test tools for example. For kafka, there are existing tools like kafkameter for JMeter (and variants of that). There was also a kafka integration tool for Gatling too I think. And if you web search around, you’ll find some articles and Youtube videos around kafkameter. Using that example, you can code/customize the kafkameter testing around what your organization would do with kafka, though that’s only one part of the pipeline, so then you need the glue code/tools to pair it up with the testing of the rest of the pipeline.

We could discuss this further if you provide specific technology/vendor examples of data sources/systems, data processing tools, and how the data is consumed or insights derived. Based on the tools used, one can then research what test tools and strategies around those exist.

Another good source of material for this area, is to actually research and follow the articles, forums / discussion groups, mailing lists, slack channels, etc. of the project or groups relating to the data source systems or tools. There might already be discussions in those or you can always initiate a discussion there to find out more. In some rare situations, you might find yourself being the first person to bring up the topic! (like spearheading the effort)

1 Like

Also, performance testing the front end, or parts of back end like APIs often indirectly test the data ingestion & processing because the functionality goes from front end (or beginning of back end) trickling down through the data pipeline, before the end result observations route back to the front end, etc.

Thus it is simpler to cover the testing that way. Basically end-to-end style performance testing similar to the equivalent version of (UI/API) functional test automation.

What I previously posted would cover more on the component-ized version of performance testing the system in terms of integration or unit test approach.

1 Like

Validating parts of the data system for performance test is similar to the norm of performance testing but you would focus on monitoring tools and metrics relating to the data pipeline rather than end to end system or user/API level metrics. For what to check for whether things are performing properly or not. Proper output, response times, and CPU/memory usage of parts of the data pipeline rather than the front end API server or web server, etc.

1 Like