Day 28: Define quality for LLMs

sarahdeery · 13 October 2024 08:00

Welcome to Day 28 of the 30 Days of TestBash challenge!

In the talk, “Quality Statements for LLMs: The Good, The Bad and The Ugly,” @qm_bastian explores the challenges of testing AI and LLMs (Large Language Models), like ChatGPT, Gemini or Claude. Bastian explains that testing AI involves evaluating probabilistic outputs and understanding the context in which the AI operates.

Today’s task encourages you to reflect on what quality means for LLMs and how defining meaningful quality statements can guide your testing approach.

Task Steps

Reflect on a quality Statement: Think about a quality aspect that matters when using LLMs, such as accuracy, ethical considerations, or handling ambiguity. For example, you might consider a statement like, “The LLM should produce factually accurate information, flagging uncertainty when relevant.” How could this statement guide how you test and evaluate LLM outputs?
Share Your Thoughts: Reply to this post and share a quality statement you think is important for LLMs. Explain how it could help guide your testing approach.

Bonus Step: Try testing an LLM against your quality statement this week. Share what you found, noting any challenges or surprises you encountered.

larsthomsen · 13 October 2024 22:50

The only way I see to fully test LLMs (whatever they produce) would be experience based manual tests.
Sure, you can always automate “the outer workings” of AI software, like testing if input is possible, output is readable, machine can be reached and is working. But for the content produced? This has to be done manually.

ujjwal.singh · 14 October 2024 05:36

Currently, all LLM are in the development and learning phase, it would be difficult to set the quality standard for them, however in general there are a few things that we can use as a parameter to judge the response of the quality of LLM.

The first and most important thing is the result generated by LLM is not biased on any factor. The response should be completely unbiased free of any biases or discrimination for any race, gender, person, or any entity.
LLM is transparent on how the data will be stored or used for training purposes or discarded after generating the result
response is generated as per the prompt, there are many instances where the response is irrelevant of the prompt however this factor cannot be prioritized because it is already mentioned on all LLM websites that the result may be incorrect
Response generated is as per the latest information available on the topic on the internet

These are some of the points that we can keep in mind while using LLM to decide the quality of the content generated by it based on our prompt.
However, one thing that also needs to be considered is our prompt, since the prompt is the input for LLM we have to ensure that our prompt is relevant which can be understood by the LLM.

Topic		Replies	Views
How Do You Evaluate the Quality of LLM Responses? MoT Content Discussions ai , articles	4	2014	17 June 2024
Are there any specific test automation tools or frameworks recommended for testing Language Models - LLM? Discussions automation , ai	4	2150	20 January 2024
Large Language Models: Ethic and society - How can we help as testers? Discussions 30-days-of-tools , ai , chatgpt , philosophy	5	531	9 August 2023
Have you tried any of these LLM-as-a-Judge tools? Discussions tools , risks , llms , llm-as-a-judge , evaluation	2	286	24 June 2025
How to test genAI Discussions automation , automation-in-testing	5	827	24 February 2024

Day 28: Define quality for LLMs

Related topics