AI Testing Standard and Metrics

These past few weeks I have been studying AI, use vibe coding tools, attended some conference with lots of how to use AI tools and research info data related to AI. Anyway up to this time I haven’t seen and simplified info on testing AI, strategy being used, test designs and metrics defining if confidence is enough to release the AI tool to production. If you guys have reference to it please do send the link im interested to give inputs and learn from it.

I recently completed the ISTQB AI tester certification - that goes into a fair amount of detail on testing AI and what that entails. The focus is on ML but some of it is applicable to LLMs.

They talk about back-to-back testing and A/B testing to compare with either a pseudo oracle (a system created to mirror the expected results of the AI SUT so you can validate test results of an AI system) or using a previous version of an AI system.

They also talk about data analysis (of the training data).

There are a few metrics used to quantify AI systems like Precision, Accuracy, Recall, F1 Score, Receiver Operation Characteristics and the associated Area Under Curve (ROC/AUC) as well as Inter and Intra Cluster metrics. The type used depends on the type of model/algorithm (Classification/Regression (Supervised Learning), Clustering/Association (Unsupervised Learning)).

The syllabus is free: Certified Tester AI Testing (CT-AI) - International Software Testing Qualifications Board

thanks Mark I just started reading through it as well.

@ross,

Exactly! AI testing is still an unwritten domain for any clear-said, worldwide guidelines. What we have today is scattered in research papers, case studies, and company-internal practices.

From my perspective, the testing of AI revolves around:

Data quality metrics—bias detection, representativeness, completeness

Model performance metrics—accuracy, precision, recall, F1 score

Robustness testing—adversarial inputs, edge cases

Ethical and fairness checks

Explainability—how transparent and understandable the decision-making process is

Two frameworks worth mentioning:

NIST AI Risk Management Framework

NIST Risk Management Framework was set forth by the U.S. National Institute of Standards and Technology and presents a structured approach to identifying, assessing, and managing AI risks such as safety, biases, and trustworthiness.

ISO/IEC 24028

For measuring AI trustworthiness in terms of security, privacy, reliability, and ethical considerations, the international standard ISO/IEC 24028 was laid down.