LLM-as-Judge Evaluation
Use one LLM to evaluate the outputs of another, enabling scalable automated quality assessment for AI systems.
Overview
LLM-as-Judge uses a (typically stronger) language model to evaluate the outputs of AI systems. Instead of relying solely on human evaluation (expensive, slow) or simple metrics (BLEU, ROUGE — often unreliable), you use an LLM with a well-designed rubric to grade responses on dimensions like correctness, helpfulness, safety, and style.
When to Use
- Scaling evaluation beyond what human reviewers can handle
- Evaluating open-ended outputs where exact-match metrics fail
- Building CI/CD pipelines for LLM applications
- A/B testing different prompts, models, or RAG configurations
- Quick iteration on prompt engineering with automated feedback
Architecture
flowchart LR
I[Input/Question] --> S[System Under Test]
S --> O[Generated Output]
I --> J[Judge LLM]
O --> J
R[Reference Answer<br>optional] --> J
RB[Rubric/Criteria] --> J
J --> SC[Score + Reasoning]
SC --> DB[(Results DB)]
DB --> D[Dashboard/Reports]
Judging Approaches
| Approach | Description | Best For |
|---|---|---|
| Pointwise | Score a single output on a scale | General quality assessment |
| Pairwise | Compare two outputs, pick the better one | A/B testing models/prompts |
| Reference-based | Compare output to a gold standard | Factual accuracy |
| Multi-criteria | Score on multiple dimensions separately | Comprehensive evaluation |
Implementation
Gotchas & Best Practices
LLM judges have systematic biases: they prefer longer answers, verbose responses, and their own style. Calibrate with human evaluations and use position-debiasing for pairwise comparisons.
Using the same model to judge its own outputs is unreliable — it tends to rate itself highly. Use a different (ideally stronger) model as the judge.
Vague criteria like “Is this good?” produce inconsistent scores. Define concrete rubrics with specific scoring anchors for each level (1-5).
Ask the judge to explain its reasoning before giving a score. This “judge chain-of-thought” produces more accurate and consistent evaluations.
Build a set of human-labeled examples and measure judge agreement (Cohen’s kappa). Use these as few-shot examples to align the judge with human preferences.
Variations
- Pointwise Grading — Score single outputs against criteria
- Pairwise Comparison — Pick the winner between two outputs
- Multi-Dimension — Score across multiple independent criteria
- Cascade — Fast/cheap judge first, expensive judge for edge cases
- Constitutional AI — Self-judge against a set of principles