Testing approach of AI integrated application

nuwangi · 15 May 2023 08:20

Could some one have any experience on testing an application which is used AI integration.
Eg:- Let say when user add some topic, system will analyse the topic and generate the web page or content related to that topic. Content will created with the help of AI tool like chat GPT.

Looking some guidance on test planning and strategy on this context.

phoebeyoung · 18 May 2023 15:27

I don’t have any experience in this, but here’s what I’m thinking:

What level of confidence do you have in the AI itself? Has it been sold to you with a certain assurance of quality or are you responsible for testing it from scratch?
What are the consequences of the system going wrong in different ways? Will users rely on the truth of the content, what happens if the content is irrelevant, are there vulnerable users, like children, who should be shielded from certain topics or language?
Is the output static, ie the same input will always produce the same result, or does it learn from/adapt to use?
What are the requirements or acceptance criteria for the generated content to be acceptable, and how much time/skill does it take to assess this?

Of course this isn’t anything concrete but I hope it helps if you are still stuck. Anyway it was a good thought exercise, and good luck - it sounds like a big challenge!

hitchdev · 19 May 2023 15:52

I’ve always come up blank when trying to figure out how to test machine learning/AI. Everything around it - fine. Just not the thing itself.

darth_piriteze · 19 May 2023 17:10

There’s going to come a point where we’ll need to write tests to verify the thing won’t try to bump off its users. Testing could become a very dangerous profession.

katepaulk · 19 May 2023 20:39

You know, I’m not looking forward to the after action reports on that one. “90% of tests were successful, but the ‘will not murder anyone’ test failed and we need to replace three of our testers.” Yeah, nope.

undevelopedbruce · 20 May 2023 10:02

I’ve been putting together some thoughts on this, but not ready to blog it yet so I’ll put the rough notes here. I’m currently testing an integration with AI similar to what you’re talking about, though not the same use case. Sorry that these notes will be rambling, but hopefully something in here helps to inspire you and help you see the e2e process of testing an integration like this.

Testing the Idea and Prompt

First, before you can test the integration, you will need to test the design/engineering of the prompt. By prompt, I mean the set of instructions that an engineer will write for the AI to follow. It’s not always easy to write these in a way that gives a useful response, and you might have to go through multiple versions of the prompt text before it is good enough.
(As an aside, not looking relevant to your case, but to anyone else: you must ask if you are legally or morally able to use the AI for your use case, especially when dealing with user data. ChatGPT and others are based in the US, so you must anonymise EU+UK user data that will be processed there, and you can only send data if you have a legitimate business concern as defined by the ICO.)(There are still enough tech jobs out there that no one should be doing a job where they are asked to produce morally/ethically/legally questionable features.)
At this stage, you will be testing whether or not the AI is fit for purpose. How good is it at doing this task? How good are the engineers in your team at writing prompts for AI? How reliably does it return acceptable content for a given topic? Are there any types of topic it’s bad at?

Ideally, you don’t want to wait until the end to test these questions. You want to be part of the team researching and implementing the prompt… Or at least helping to inform their process.
In practice at this stage, you have to do some testing with known data. For your example, you’d put together a list of topics you think are easy and difficult to generate the content for, including:

Highly specific topics, eg “sexism in pre-medieval Welsh folklore”
General topics, eg “sports”
Topics that can be understood in two different ways, eg “golden band” could mean a wedding ring, or a musical group. (You can look up lists of homonyms online, which are words with multiple meanings, for ideas.)
Similarly, topics that are the name of a band or famous book etc, but have a general meaning too, eg “Witches Abroad” might tell you about Terry Pratchett’s Discworld novel, or about actual witches on holiday…
Recent trends: AI are a year behind current knowledge and world events, so you could consider testing out what happens when it’s given a topic concerning recent world events.

You should probably create two lists with a mixture of these types of topics: one for developing the prompt against, and the other for testing it afterwards. This means that the person writing the prompt can ensure that they create a prompt that returns acceptable data for a range of different inputs, and then when they think they have it exactly correct, the prompt can be tested against data it hasn’t been used against before. (This is not about training the AI, as GPT doesn’t store or remember what you send it. This is about training the person writing the prompt into writing one that is very good. xD)

Testing Outputs

Next, you will want to test the output of the AI. The prompt writer will ask the AI to return the response in a certain format, eg perhaps in your case it will be returned as valid HTML, or perhaps text in JSON with keys for the page title, content, etc. Whatever is decided by the team, you will need to test that the prompt responds in the format you want every time. (Spoiler: sometimes it will do its own thing, no matter what you tell it…)
Sub-topic here, error handling… The AI is not always going to respond exactly the way you want. This is not going to be as reliable as an integration with, for example, google maps. There will need to be some checking of the response on your system’s side, and handling of different error cases, eg:

AI is overloaded with requests and cannot process this one
AI responds in an unexpected or invalid format
AI server is down for maintenance
You ran out of tokens/credit to make requests

It is hard to test these scenarios, so what we are doing is implementing a very simple mock which can be turned on and off via a feature flag on test environments. If the feature flag is turned on, we send the request to mockAI instead of the AI, and so we can return a certain state, eg 500 error, and test e2e what our system does in this case. There are probably other ways to do it.

Testing your System

OK enough about testing the AI, let’s talk about what tests need to exist on your side. This is where it turns into your everyday testing of a 3rd party integration. Generally with integrations, you ask yourself: which side do I think will be reliable? If you’re integrating with google maps, then you probably don’t need to check their responses a lot, since they are not going to break. However, integrating with AI is much less reliable, so you will need monitoring and alerts. For example:

Monitoring the number or % of responses which come back with unacceptable formats. You may find it is something like 0.1%, and then one day it goes up to 5%, which would trigger actions on your side to see what’s gone wrong
Periodically sending a set of data which you know it can handle, and checking that the responses are still ok.

You can test the handling on your side with or without a mock, same as you would any other product. Give it topic, check that it creates the page content as you expect. I’m not gonna go any deeper here, since this is more in the realm of day to day functional testing. Although, if you are not given the opportunity to be involved early in the process as I assume here, then you will need to test all the stuff above at this point, eg what happens when we get errors from the AI, and how it handles different types of topics. I hope that you will be given the opportunity to get involved from start to end of this project though, because working with AI is very interesting.

I hope some or all of that makes sense. I haven’t finished putting my ideas together for the blog post yet, so hopefully I will write something more sensible later xD

phoebeyoung · 22 May 2023 15:09

Long may this continue!

nuwangi · 27 September 2023 23:47

Thank you so much @undevelopedbruce , This is great explanation to start <3

Topic		Replies	Views
🤖 Day 5: Identify a case study on AI in testing and share your findings 30 Days of Testing ai , 30-days-of-ai-in-testing , 30	63	3898	13 September 2024
How to test genAI Discussions automation , automation-in-testing	5	824	24 February 2024
Prompting for Testers - Setting the Right Context for Your Prompts Activity MoT Content Discussions ai , prompt-engineering , mot-ondemand-pft	3	230	27 September 2025
AI Discussion on tools Discussions learning , process , ai	3	279	2 March 2024
🤖 Day 4: Watch the AMA on Artificial Intelligence in Testing and share your key takeaway 30 Days of Testing 30-days-of-testing , ai , 30-days-of-ai-in-testing , ama	96	3389	5 July 2024

Testing approach of AI integrated application

Related topics