Have you tried any of these LLM-as-a-Judge tools?

I enjoyed this article:

LLM-as-a-Judge: A Practical Guide

I like how it’s super clear how you might use one LLM to evaluate another LLM’s work.

It also mentions some tools:

OpenAI Evals: A framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

DeepEval: A simple-to-use LLM evaluation framework for evaluating and testing large-language model systems (e.g., RAG pipelines, chatbots, AI agents, etc.). It is similar to Pytest but specialized for unit testing LLM outputs.

TruLens: Systematically evaluate and track LLM experiments. Core functionality includes Feedback Functions, The RAG Triad, and Honest, Harmless and Helpful Evals.

Promptfoo: A developer-friendly local tool for testing LLM applications. Support testing on prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs.

LangSmith: Evaluation utilities provided by LangChain, a popular framework for building LLM applications. Supports LLM-as-a-judge evaluator for both offline and online evaluation.

Have you used any of them before? In what context? How did they help? What problems/issues did you encounter?

2 Likes

This one is also pretty fun for making lists and pro’s & cons to “be a consultant”! :slight_smile:
Not much of LLM but could be somewhat used for interesting parts.

1 Like

Not sure if you are aware, but I wrote the Council of LLMs model.

It’s on the same lines with real ownership and power with the human.

1 Like