Testing approach of AI integrated application

Could some one have any experience on testing an application which is used AI integration.
Eg:- Let say when user add some topic, system will analyse the topic and generate the web page or content related to that topic. Content will created with the help of AI tool like chat GPT.

Looking some guidance on test planning and strategy on this context.

3 Likes

I donā€™t have any experience in this, but hereā€™s what Iā€™m thinking:

What level of confidence do you have in the AI itself? Has it been sold to you with a certain assurance of quality or are you responsible for testing it from scratch?
What are the consequences of the system going wrong in different ways? Will users rely on the truth of the content, what happens if the content is irrelevant, are there vulnerable users, like children, who should be shielded from certain topics or language?
Is the output static, ie the same input will always produce the same result, or does it learn from/adapt to use?
What are the requirements or acceptance criteria for the generated content to be acceptable, and how much time/skill does it take to assess this?

Of course this isnā€™t anything concrete but I hope it helps if you are still stuck. Anyway it was a good thought exercise, and good luck - it sounds like a big challenge!

2 Likes

Iā€™ve always come up blank when trying to figure out how to test machine learning/AI. Everything around it - fine. Just not the thing itself.

1 Like

Thereā€™s going to come a point where weā€™ll need to write tests to verify the thing wonā€™t try to bump off its users. Testing could become a very dangerous profession.

1 Like

You know, Iā€™m not looking forward to the after action reports on that one. ā€œ90% of tests were successful, but the ā€˜will not murder anyoneā€™ test failed and we need to replace three of our testers.ā€ Yeah, nope.

2 Likes

Iā€™ve been putting together some thoughts on this, but not ready to blog it yet so Iā€™ll put the rough notes here. Iā€™m currently testing an integration with AI similar to what youā€™re talking about, though not the same use case. Sorry that these notes will be rambling, but hopefully something in here helps to inspire you and help you see the e2e process of testing an integration like this.

Testing the Idea and Prompt

First, before you can test the integration, you will need to test the design/engineering of the prompt. By prompt, I mean the set of instructions that an engineer will write for the AI to follow. Itā€™s not always easy to write these in a way that gives a useful response, and you might have to go through multiple versions of the prompt text before it is good enough.
(As an aside, not looking relevant to your case, but to anyone else: you must ask if you are legally or morally able to use the AI for your use case, especially when dealing with user data. ChatGPT and others are based in the US, so you must anonymise EU+UK user data that will be processed there, and you can only send data if you have a legitimate business concern as defined by the ICO.)(There are still enough tech jobs out there that no one should be doing a job where they are asked to produce morally/ethically/legally questionable features.)
At this stage, you will be testing whether or not the AI is fit for purpose. How good is it at doing this task? How good are the engineers in your team at writing prompts for AI? How reliably does it return acceptable content for a given topic? Are there any types of topic itā€™s bad at?

Ideally, you donā€™t want to wait until the end to test these questions. You want to be part of the team researching and implementing the promptā€¦ Or at least helping to inform their process.
In practice at this stage, you have to do some testing with known data. For your example, youā€™d put together a list of topics you think are easy and difficult to generate the content for, including:

  • Highly specific topics, eg ā€œsexism in pre-medieval Welsh folkloreā€
  • General topics, eg ā€œsportsā€
  • Topics that can be understood in two different ways, eg ā€œgolden bandā€ could mean a wedding ring, or a musical group. (You can look up lists of homonyms online, which are words with multiple meanings, for ideas.)
  • Similarly, topics that are the name of a band or famous book etc, but have a general meaning too, eg ā€œWitches Abroadā€ might tell you about Terry Pratchettā€™s Discworld novel, or about actual witches on holidayā€¦
  • Recent trends: AI are a year behind current knowledge and world events, so you could consider testing out what happens when itā€™s given a topic concerning recent world events.

You should probably create two lists with a mixture of these types of topics: one for developing the prompt against, and the other for testing it afterwards. This means that the person writing the prompt can ensure that they create a prompt that returns acceptable data for a range of different inputs, and then when they think they have it exactly correct, the prompt can be tested against data it hasnā€™t been used against before. (This is not about training the AI, as GPT doesnā€™t store or remember what you send it. This is about training the person writing the prompt into writing one that is very good. xD)

Testing Outputs

Next, you will want to test the output of the AI. The prompt writer will ask the AI to return the response in a certain format, eg perhaps in your case it will be returned as valid HTML, or perhaps text in JSON with keys for the page title, content, etc. Whatever is decided by the team, you will need to test that the prompt responds in the format you want every time. (Spoiler: sometimes it will do its own thing, no matter what you tell itā€¦)
Sub-topic here, error handlingā€¦ The AI is not always going to respond exactly the way you want. This is not going to be as reliable as an integration with, for example, google maps. There will need to be some checking of the response on your systemā€™s side, and handling of different error cases, eg:

  • AI is overloaded with requests and cannot process this one
  • AI responds in an unexpected or invalid format
  • AI server is down for maintenance
  • You ran out of tokens/credit to make requests

It is hard to test these scenarios, so what we are doing is implementing a very simple mock which can be turned on and off via a feature flag on test environments. If the feature flag is turned on, we send the request to mockAI instead of the AI, and so we can return a certain state, eg 500 error, and test e2e what our system does in this case. There are probably other ways to do it.

Testing your System

OK enough about testing the AI, letā€™s talk about what tests need to exist on your side. This is where it turns into your everyday testing of a 3rd party integration. Generally with integrations, you ask yourself: which side do I think will be reliable? If youā€™re integrating with google maps, then you probably donā€™t need to check their responses a lot, since they are not going to break. However, integrating with AI is much less reliable, so you will need monitoring and alerts. For example:

  • Monitoring the number or % of responses which come back with unacceptable formats. You may find it is something like 0.1%, and then one day it goes up to 5%, which would trigger actions on your side to see whatā€™s gone wrong
  • Periodically sending a set of data which you know it can handle, and checking that the responses are still ok.

You can test the handling on your side with or without a mock, same as you would any other product. Give it topic, check that it creates the page content as you expect. Iā€™m not gonna go any deeper here, since this is more in the realm of day to day functional testing. Although, if you are not given the opportunity to be involved early in the process as I assume here, then you will need to test all the stuff above at this point, eg what happens when we get errors from the AI, and how it handles different types of topics. I hope that you will be given the opportunity to get involved from start to end of this project though, because working with AI is very interesting.

I hope some or all of that makes sense. I havenā€™t finished putting my ideas together for the blog post yet, so hopefully I will write something more sensible later xD

2 Likes

Long may this continue!

2 Likes

Thank you so much @undevelopedbruce , This is great explanation to start <3