How Do You Evaluate the Quality of LLM Responses?

In our latest read, “Testing The Quality Of ChatGPT Responses: A Report From The Field,” @paulmaxwellwalters takes us through a hands-on exploration of assessing AI grading systems. This article gives a practical perspective on evaluating Large Language Models (LLMs) like GPT-3.5 and GPT-4, focusing on their ability to grade student assignments accurately and effectively.

The piece shares insights into setting up tests, measuring accuracy, relevance, and providing feedback. It’s a must-read for anyone looking to understand how to gauge LLM performance in real-world applications.

Have you got any thoughts on how this could apply to your work with evaluating AI responses? Perhaps you’ve got you’re own strategies and experiences to share. The more insights shared, the better; you can help shape other’s approaches to LLM evaluation.

5 Likes

Oooo this is something I have been waiting for so long to grab my hands on! Tnx for the link, it interests me immensely how are they testing AI :smile:

Exactly on that topic I have a short funny anecdote that I wrote here:

TLDR: most people using AI commercially have no clue how to test it :grimacing: :laughing:

4 Likes

Thanks for the kind words! If you have any questions or feedback, I would love to hear them!

like anything we need to separate out the response into it’s component parts, the “generative part” needs separating from the data-aggregation/generalization part, and the truthiness component becomes easier to test for at that point. How you do that separating is a job for the implementation to expose as metadata. Basically don’t test the entire thing, break it down?