Iāve been putting together some thoughts on this, but not ready to blog it yet so Iāll put the rough notes here. Iām currently testing an integration with AI similar to what youāre talking about, though not the same use case. Sorry that these notes will be rambling, but hopefully something in here helps to inspire you and help you see the e2e process of testing an integration like this.
Testing the Idea and Prompt
First, before you can test the integration, you will need to test the design/engineering of the prompt. By prompt, I mean the set of instructions that an engineer will write for the AI to follow. Itās not always easy to write these in a way that gives a useful response, and you might have to go through multiple versions of the prompt text before it is good enough.
(As an aside, not looking relevant to your case, but to anyone else: you must ask if you are legally or morally able to use the AI for your use case, especially when dealing with user data. ChatGPT and others are based in the US, so you must anonymise EU+UK user data that will be processed there, and you can only send data if you have a legitimate business concern as defined by the ICO.)(There are still enough tech jobs out there that no one should be doing a job where they are asked to produce morally/ethically/legally questionable features.)
At this stage, you will be testing whether or not the AI is fit for purpose. How good is it at doing this task? How good are the engineers in your team at writing prompts for AI? How reliably does it return acceptable content for a given topic? Are there any types of topic itās bad at?
Ideally, you donāt want to wait until the end to test these questions. You want to be part of the team researching and implementing the promptā¦ Or at least helping to inform their process.
In practice at this stage, you have to do some testing with known data. For your example, youād put together a list of topics you think are easy and difficult to generate the content for, including:
- Highly specific topics, eg āsexism in pre-medieval Welsh folkloreā
- General topics, eg āsportsā
- Topics that can be understood in two different ways, eg āgolden bandā could mean a wedding ring, or a musical group. (You can look up lists of homonyms online, which are words with multiple meanings, for ideas.)
- Similarly, topics that are the name of a band or famous book etc, but have a general meaning too, eg āWitches Abroadā might tell you about Terry Pratchettās Discworld novel, or about actual witches on holidayā¦
- Recent trends: AI are a year behind current knowledge and world events, so you could consider testing out what happens when itās given a topic concerning recent world events.
You should probably create two lists with a mixture of these types of topics: one for developing the prompt against, and the other for testing it afterwards. This means that the person writing the prompt can ensure that they create a prompt that returns acceptable data for a range of different inputs, and then when they think they have it exactly correct, the prompt can be tested against data it hasnāt been used against before. (This is not about training the AI, as GPT doesnāt store or remember what you send it. This is about training the person writing the prompt into writing one that is very good. xD)
Testing Outputs
Next, you will want to test the output of the AI. The prompt writer will ask the AI to return the response in a certain format, eg perhaps in your case it will be returned as valid HTML, or perhaps text in JSON with keys for the page title, content, etc. Whatever is decided by the team, you will need to test that the prompt responds in the format you want every time. (Spoiler: sometimes it will do its own thing, no matter what you tell itā¦)
Sub-topic here, error handlingā¦ The AI is not always going to respond exactly the way you want. This is not going to be as reliable as an integration with, for example, google maps. There will need to be some checking of the response on your systemās side, and handling of different error cases, eg:
- AI is overloaded with requests and cannot process this one
- AI responds in an unexpected or invalid format
- AI server is down for maintenance
- You ran out of tokens/credit to make requests
It is hard to test these scenarios, so what we are doing is implementing a very simple mock which can be turned on and off via a feature flag on test environments. If the feature flag is turned on, we send the request to mockAI instead of the AI, and so we can return a certain state, eg 500 error, and test e2e what our system does in this case. There are probably other ways to do it.
Testing your System
OK enough about testing the AI, letās talk about what tests need to exist on your side. This is where it turns into your everyday testing of a 3rd party integration. Generally with integrations, you ask yourself: which side do I think will be reliable? If youāre integrating with google maps, then you probably donāt need to check their responses a lot, since they are not going to break. However, integrating with AI is much less reliable, so you will need monitoring and alerts. For example:
- Monitoring the number or % of responses which come back with unacceptable formats. You may find it is something like 0.1%, and then one day it goes up to 5%, which would trigger actions on your side to see whatās gone wrong
- Periodically sending a set of data which you know it can handle, and checking that the responses are still ok.
You can test the handling on your side with or without a mock, same as you would any other product. Give it topic, check that it creates the page content as you expect. Iām not gonna go any deeper here, since this is more in the realm of day to day functional testing. Although, if you are not given the opportunity to be involved early in the process as I assume here, then you will need to test all the stuff above at this point, eg what happens when we get errors from the AI, and how it handles different types of topics. I hope that you will be given the opportunity to get involved from start to end of this project though, because working with AI is very interesting.
I hope some or all of that makes sense. I havenāt finished putting my ideas together for the blog post yet, so hopefully I will write something more sensible later xD