Type of tests on AI / ML models

My current team has data engineers. So far as a QA, my involvement is automating the UI graphs, validating DB / queries against UI values, API tests and performance tests,.
However, we are doing a PoC on an ML model and the engineers are quite wondering what kind of testing shall we include. I have taken the following consideration so far:

  • unit tests
  • Black box testing between input and output
  • edge cases (not sure what?!?)
  • error handlings
  • not testing the ML libraries / packages as you would expect them to function as they are required too.
  • not sure what kind of performance or security testing is required or how to execute them.

Does anyone else have an idea on ML testing?

ML models are supposed to improve accuracy over time based on data sets. Understandably, we can only do that kind of verification / check as a long term approach and not a one-off testing of the feature.


You’ve probably come across this already but Angie Jones has some resources on testing machine learning models.
I haven’t dug too deeply but my understanding is there’s a lot more intuition and exploration in testing machine learning models. I haven’t done any of it myself however, but am very curious!


It depends on the context, so not all of these will be relevant.

Has the model been overfitted to its training data? For instance, a model to tell photos of cats from photos of dogs might accidentally be trained on photos of dogs that are outdoors and cats sat on furniture indoors. So the model actually detects grass vs. chairs rather than dogs vs. cats. So it’s worth testing with valid data that’s unlike the training data e.g. indoor dogs and outdoor cats.

If the model gives a yes / no, what are its most important rates - true positive, false positive, true negative, and false negative? Do all 4 combinations matter the same amount in your context?

Do errors occur equally for all kinds of input, or are they more common for certain subsets of the input space? Does this matter in your context?

The improvement over time probably won’t come for free. In production, the model will be frozen and so will give the same output if given the same input more than once. The behaviour will change only when the model is retrained.

This retraining could happen after X more time has passed, after X more data has been processed, or the performance of the model has dropped by / below X. (The last one is if the data is gradually changing over time, so a fixed model will become less and less accurate.)

Which kind of threshold to trigger retraining makes most sense for your context?

It’s probably something you have at least partly sorted because of the data engineering that’s currently happening, but traceability, provenance, governance etc. are all important. If the model starts behaving differently (maybe worse) today than yesterday, can you work out why? Was a new model released? Why and by whom? Was this a manual process and someone forgot a step? What data was version X of the model trained on? Version X-1? Can you easily go back to version X-1? Can you produce a model that’s trained on the same data as version X,-1, but then train it further on new data?

There are some architecture or design questions to consider (which I realise is much easier said than done). There’s more than one kind of model that you could use - would you get better performance with a different one? For a given model there can be big parameters to tune e.g. how many layers to have in a neural network.

Even with the type and basic structure fixed, there’s the question of features. Instead of plugging all available data into the model, a pre-processed subset (the features) is used. First, a subset of the variables is chosen (by a human) because it is most strongly linked to the desired output. Once this filtering has happened, simple maths can occur before the features are ready, e.g. applying log to base 10, taking the minimum of two bits of data etc.

Are the features the right ones? Are there inputs with the same or similar features, that should lead to different outputs?

1 Like

I’ve used Probability Theory Testing & Metamorphic testing to test Machine Learning Models.

Probability Theory Testing is pure math.

Metamorphic Testing:
It’s basically creating 10.000 tests and instead of validating the response (which you can’t because it’s a prediction) you compare all the results towards each other.
X > Y > Z
If the Acceptence Criteria is that the ML-Model has to be at least 80% accurate. Then your tests have to be for 80% in the correct order.

1 Like

Tom van de Ven, Rik Marselis, and Shaukat Humayun wrote a book “Testing in the digital age” which contains several chapters about testing AI.

Here is a video of the book launch: