Welcome to Day 28 of the 30 Days of TestBash challenge!
In the talk, “Quality Statements for LLMs: The Good, The Bad and The Ugly,” @qm_bastian explores the challenges of testing AI and LLMs (Large Language Models), like ChatGPT, Gemini or Claude. Bastian explains that testing AI involves evaluating probabilistic outputs and understanding the context in which the AI operates.
Today’s task encourages you to reflect on what quality means for LLMs and how defining meaningful quality statements can guide your testing approach.
Task Steps
-
Reflect on a quality Statement: Think about a quality aspect that matters when using LLMs, such as accuracy, ethical considerations, or handling ambiguity. For example, you might consider a statement like, “The LLM should produce factually accurate information, flagging uncertainty when relevant.” How could this statement guide how you test and evaluate LLM outputs?
-
Share Your Thoughts: Reply to this post and share a quality statement you think is important for LLMs. Explain how it could help guide your testing approach.
Bonus Step: Try testing an LLM against your quality statement this week. Share what you found, noting any challenges or surprises you encountered.
2 Likes
The only way I see to fully test LLMs (whatever they produce) would be experience based manual tests.
Sure, you can always automate “the outer workings” of AI software, like testing if input is possible, output is readable, machine can be reached and is working. But for the content produced? This has to be done manually.
2 Likes
Currently, all LLM are in the development and learning phase, it would be difficult to set the quality standard for them, however in general there are a few things that we can use as a parameter to judge the response of the quality of LLM.
-
The first and most important thing is the result generated by LLM is not biased on any factor. The response should be completely unbiased free of any biases or discrimination for any race, gender, person, or any entity.
-
LLM is transparent on how the data will be stored or used for training purposes or discarded after generating the result
-
response is generated as per the prompt, there are many instances where the response is irrelevant of the prompt however this factor cannot be prioritized because it is already mentioned on all LLM websites that the result may be incorrect
-
Response generated is as per the latest information available on the topic on the internet
These are some of the points that we can keep in mind while using LLM to decide the quality of the content generated by it based on our prompt.
However, one thing that also needs to be considered is our prompt, since the prompt is the input for LLM we have to ensure that our prompt is relevant which can be understood by the LLM.
1 Like