In our latest read, “Testing The Quality Of ChatGPT Responses: A Report From The Field,” @paulmaxwellwalters takes us through a hands-on exploration of assessing AI grading systems. This article gives a practical perspective on evaluating Large Language Models (LLMs) like GPT-3.5 and GPT-4, focusing on their ability to grade student assignments accurately and effectively.
The piece shares insights into setting up tests, measuring accuracy, relevance, and providing feedback. It’s a must-read for anyone looking to understand how to gauge LLM performance in real-world applications.
Have you got any thoughts on how this could apply to your work with evaluating AI responses? Perhaps you’ve got you’re own strategies and experiences to share. The more insights shared, the better; you can help shape other’s approaches to LLM evaluation.
like anything we need to separate out the response into it’s component parts, the “generative part” needs separating from the data-aggregation/generalization part, and the truthiness component becomes easier to test for at that point. How you do that separating is a job for the implementation to expose as metadata. Basically don’t test the entire thing, break it down?
Hi Paul, this is very interesting research that you have conducted on accuracy of LLMs. I fully agree with your conclusion, we came to pretty much same conclusion - these models are not (yet) able to provide correct answer in “zero shot” learning. It would be interesting if you would try to implement “few shot” learning with LLM to see if the model perfomance is getting better. According to our research, model accuracy should go up. Look at the article for more ideas - Zero-Shot Learning vs. Few-Shot Learning vs. Fine-Tuning: A technical walkthrough using OpenAI's APIs & models