AI Testing TestChat

Tonights TestChat about testing in AI finished up our month of AI

Q1. What crossovers are there with current testing and testing AI? What existing testing methods can be used to test AI?

A brief summary of the answers:
@friendlytester said "exploratory testing will be key. As I believe it will be very rare there will be a single repeatable answer. So we’ll have to continuously be asking questions. "

Noemi rightly pointed out “I think really depends on the AI application itself. Analytics could give us a lot of information about the system being well implemented and achieving its purpose.” and, interestingly “Some existing methods would still be valid (for testing the output of the app) But what about using AI to test an AI application (output)?”

Q2. Part of testing AI involves validating the output, how would you prepare your test data for an AI application? For example @billmatthews discussed using Monte Carlo method to build test data.

“if given the same data, would it learn exactly the same? How long does the ‘learning’ take?”

I wondered how much the order of the data matters, e.g. if I sorted it by first name as input, would I get different outputs to sorting by surname.

As @punkmik points out, what biases are you adding in to your data Biases and cultural and social responsibility

Q3. As AI is a continuously learning system, one concern, is how to confirm that the test results are reached in the “right” way. For example, are multiple tests needed as proof? Also, how do we deal with negative cases? Will this impact the AI?

“Is there actually a version control that can reasonably be applied to a neural network?”

@ayaa.akl “maybe multiple of tests on same set of data…”

“One characteristic that AI possibly has is that the same thing can be ‘asked’ in different ways. So perhaps we could test that the same result (Whatever that result is) is obtained for the different ways of expressing it. In this way we could provide some way of”

“TBH teaching an AI to react to negative input without breaking sounds a lot like wading in the kind of cesspool that social media mods have to.”

“the trickiest part is to ensure the app learns and the users don’t ‘break it’ by teaching the wrong things (I think this was the problem for some AI trials in which inappropriate outputs were showing after a while of being ‘live’)”

“is testing even applicable to true AI? Imagine trying to test humans as we test code. Maybe testing won’t be possible; instead it will be the role of guide and teacher in the early stages…before our intelligence is surpassed and we become the students.”

Q4. What skills do you believe are necessary to get a job as an AI tester? Is it a case of building on existing knowledge or exploring new emerging technology? Or both?

“I guess for some areas the skills will be hard computer science - on the training side (after all, the training inputs need to validated/organized as well). On the other hand I see a lot of room for crossover from psychology or humanities. Even if I don’t really expect a strong AI, ever, if we want the little ones to emulate/enhance the human mind, the same ways of working could very well apply.”

“I think it will be new area of knowledge which will get experience from some communication and cultural studies.”

@alex “Some amount of Education as a discipline as well. How to formally assess knowledge, how knowledge is generated, etc, etc.”

Interesting conversations. I was struggling with the CrowdChat interface (specifically I refuse to use social login, and accept the consequences that may have on my life), however just to add my comments on ML testing generally.

  1. Any methods you use to generate test data would be valid to use in making training data, so should already fit the model. If you’re doing data driven testing on an ML model then you’re developing the model not testing it. This will result in lots of untested models in production.
  2. Compliance/social standards/sanity test. ML is liable to do things which might be considered illegal or uncouth in society by using discriminatory characteristics or proxies thereof (eg female sounding names when avoiding gender or image search not returning gorillas, see Wired article). We are responsible for catching these things. This will be humans identifying these limits and then testing for them on the output, generally with statistical tests.
  3. White box testing. Using (again) statistical analysis and visualisation to find areas of significance. By finding areas with extremely high gradient (ie sensitivity) it might be possible to find inputs which will cause issues. It is then (occasionally) possible to generalise these in such a way as to understand what is being misclassified.
  4. Null input testing. Crafting data which is absurd but non-random, and ensuring that it is categorised as such. A lot of models don’t have a bin for ‘uncategorised/impossible’. Teaching it when to admit it doesn’t know is key to avoid stupid decisions.

To do it well it will require imaginative, critical thinking, good domain knowledge and lots of statistics.

Bias warning: I’ve taught college courses, and my degree is in Education. I’ve also never tested an AI application.

I keep thinking about how you assess people for these sorts of things, and keep coming back in my head to a rubric (a lot like a test plan) for scoring the responses from the AI. The criteria for the rubric would probably be what matters to you most about the responses that you get. You’d be making a tester literally proctoring a test to the AI in a way.

You’d also want to establish criteria for what responses are explicitly not okay, (when do you “instant-fail” the AI), and what score levels indicate that there might be a problem.

Sadly I missed the test chat.

Some thoughts…
Re: Test Data
The stochastic nature of reinforcement learning means that even given the same data it is highly unlikely the learning would be the same. Even fixing random seeds and system environments, etc… would give different outcomes. So although we need to be able to prepare/use different sets of data to test the systems, we really need ways of interpreting the output to gain confidence the outcome is what we expect to some tolerance and pattern.

I am not sure I would go along with the fact data driven testing is developing the model, not testing it. You train a model with the training set, then you can use a separate test set to check the performance of the model. This data was not used to train the model. More data can be collected and generated (think metamorphic testing) to create new data to help improve confidence.

Though this is less helpful for more complex AI systems; think computer games where the it is hard to alter a large data set in a valid way for the associated environment.

Re: confirm that the test results are reached in the “right” way

What is harder when you move away from the classification problems where you have a nice right/wrong result (e.g. is this picture a cat?) and look at more complex AI.
If you have an AI that plays against you in a card game, what is a ‘test pass’? It can’t simply be winning a game, as no-one wants to play an AI that always wins. So is it that it wins 50% time? But is that 50% for all players, or just you?

Re: Skills for an AI tester?
Depends on what the AI being developed is for, what domain it is in, etc…
It is as an open question as what skills are required to be a tester.

When looking for candidates I am looking for different things depending on what team they might join. So fitting the person to a need, rather than a specific job spec.

E.g. Someone to help testing in the Re-enforcement Learning team needs to have good math background to apply statistical techniques, be able to gain a broad understanding of how neural networks work, or RL systems. However, someone working on the platform that executes the learning system in the live, can have a more basic understanding of RL, and treat it as a black box, concentrating on its API, its robustness, etc…