How Do You Evaluate the Quality of LLM Responses?

sarahdeery · 20 February 2024 16:57

In our latest read, “Testing The Quality Of ChatGPT Responses: A Report From The Field,” @paulmaxwellwalters takes us through a hands-on exploration of assessing AI grading systems. This article gives a practical perspective on evaluating Large Language Models (LLMs) like GPT-3.5 and GPT-4, focusing on their ability to grade student assignments accurately and effectively.

The piece shares insights into setting up tests, measuring accuracy, relevance, and providing feedback. It’s a must-read for anyone looking to understand how to gauge LLM performance in real-world applications.

Have you got any thoughts on how this could apply to your work with evaluating AI responses? Perhaps you’ve got you’re own strategies and experiences to share. The more insights shared, the better; you can help shape other’s approaches to LLM evaluation.

ivoqa · 21 February 2024 09:43

Oooo this is something I have been waiting for so long to grab my hands on! Tnx for the link, it interests me immensely how are they testing AI

Exactly on that topic I have a short funny anecdote that I wrote here:

TLDR: most people using AI commercially have no clue how to test it

paulmaxwellwalters · 22 February 2024 12:37

Thanks for the kind words! If you have any questions or feedback, I would love to hear them!

conrad.braam · 23 February 2024 09:18

like anything we need to separate out the response into it’s component parts, the “generative part” needs separating from the data-aggregation/generalization part, and the truthiness component becomes easier to test for at that point. How you do that separating is a job for the implementation to expose as metadata. Basically don’t test the entire thing, break it down?

eagle11 · 17 June 2024 10:28

Hi Paul, this is very interesting research that you have conducted on accuracy of LLMs. I fully agree with your conclusion, we came to pretty much same conclusion - these models are not (yet) able to provide correct answer in “zero shot” learning. It would be interesting if you would try to implement “few shot” learning with LLM to see if the model perfomance is getting better. According to our research, model accuracy should go up. Look at the article for more ideas - Zero-Shot Learning vs. Few-Shot Learning vs. Fine-Tuning: A technical walkthrough using OpenAI's APIs & models

Topic		Replies	Views
Day 28: Define quality for LLMs 📆 30 Days of Testing ai , 30-days-of-testbash	2	88	14 October 2024
Large Language Models: Ethic and society - How can we help as testers? 🙋 Questions 30-days-of-tools , ai , chatgpt , philosophy	5	501	9 August 2023
Are there any specific test automation tools or frameworks recommended for testing Language Models - LLM? 🙋 Questions automation , ai	4	1969	20 January 2024
How good are you at spotting AI content? 🙋 Questions tools , learning , career-development	1	65	16 October 2024
🤖 Day 9: Evaluate prompt quality and try to improve it 📆 30 Days of Testing 30-days-of-testing , blockchain , 30-days-of-ai-in-testing , prompt-engineering , llms	29	4044	22 March 2024

How Do You Evaluate the Quality of LLM Responses?

Related topics