How-To Test Third Party Generative AI Models?

With the big push for generative Ai models and companies pushing to include them into workflows. I find myself lost as this mostly black box, especially with this being a third party ai model.

user actions → ai black box → outputs.

How does someone actually test this?
How can we automate this for regression tests?
What strategies should there be around this work?
How do you test something that is primarily out of your hands and black box?


It’s as big a problem as you are imagining that it is. A third party AI model is a big black box. You probably don’t have access to even the training data so you can look for biases and mistakes. The model will not likely defend or explain its output, so you have no idea why it came to that conclusion.

You can research AI attacks (e.g. To Break a Hate-Speech Detection Algorithm, Try 'Love' | WIRED) to see if that inspires anything.

But yes, there is a lot of trust involved. One thing you can test for is to imagine the worst thing the AI system could output and try to handle it properly. At some point it will surprise you and provide output that nobody can explain.

The intrinsic testability of a system relies on our ability to observe and control it, and to assess the relationships between inputs and outputs. If we cannot model the system we cannot properly test it. So instead we treat it as dangerous and unpredictable.

Either move the legal and moral responsibility elsewhere or don’t use AI black box systems for anything that matters. I’m sure you’ve seen people get in trouble for it all the time in the news.


Thank you for the reply! That did put a lot into a new perspective and helps me understand that I’m not totally off the mark with the worries I have.

It does make sense that if I don’t have control, then I need to accommodate the worst case scenarios. So that will help inform some testing decisions. Thanks for the article! I’ll for sure check it out!


Yeah, basically only test for outputs that definitely would get the company sued or actually do breach the TOS. You will find that stakeholders can only provide a very small number of these on paper in black and white, and that will be to your advantage. Do not try to solve the world.


That does make sense. Thanks for your perspective on it! That will really help narrow down the approach so we aren’t spinning our wheels on something we don’t really have control over.

Very good question! I’ve worked on some projects where they used machine learning and such to predict a certain outcome.

So it totally depends on what it does or should do.
I’ve used probability theory testing and Metamorphic testing before for ML Models.

What’s also interesting is to check out these top 10’s because testing doesn’t always mean validating the output:


Interesting! I’ll gladly add those to my reading lists! Thank you!

1 Like