I enjoyed this article:
LLM-as-a-Judge: A Practical Guide
I like how it’s super clear how you might use one LLM to evaluate another LLM’s work.
It also mentions some tools:
OpenAI Evals: A framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
DeepEval: A simple-to-use LLM evaluation framework for evaluating and testing large-language model systems (e.g., RAG pipelines, chatbots, AI agents, etc.). It is similar to Pytest but specialized for unit testing LLM outputs.
TruLens: Systematically evaluate and track LLM experiments. Core functionality includes Feedback Functions, The RAG Triad, and Honest, Harmless and Helpful Evals.
Promptfoo: A developer-friendly local tool for testing LLM applications. Support testing on prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs.
LangSmith: Evaluation utilities provided by LangChain, a popular framework for building LLM applications. Supports LLM-as-a-judge evaluator for both offline and online evaluation.
Have you used any of them before? In what context? How did they help? What problems/issues did you encounter?