evaluation intermediate

LLM-as-Judge Evaluation

Use one LLM to evaluate the outputs of another, enabling scalable automated quality assessment for AI systems.

evaluationllm-judgequalitytestingmetricsgrading

Overview

LLM-as-Judge uses a (typically stronger) language model to evaluate the outputs of AI systems. Instead of relying solely on human evaluation (expensive, slow) or simple metrics (BLEU, ROUGE — often unreliable), you use an LLM with a well-designed rubric to grade responses on dimensions like correctness, helpfulness, safety, and style.

When to Use

  • Scaling evaluation beyond what human reviewers can handle
  • Evaluating open-ended outputs where exact-match metrics fail
  • Building CI/CD pipelines for LLM applications
  • A/B testing different prompts, models, or RAG configurations
  • Quick iteration on prompt engineering with automated feedback

Architecture

flowchart LR
    I[Input/Question] --> S[System Under Test]
    S --> O[Generated Output]
    I --> J[Judge LLM]
    O --> J
    R[Reference Answer<br>optional] --> J
    RB[Rubric/Criteria] --> J
    J --> SC[Score + Reasoning]
    SC --> DB[(Results DB)]
    DB --> D[Dashboard/Reports]

Judging Approaches

ApproachDescriptionBest For
PointwiseScore a single output on a scaleGeneral quality assessment
PairwiseCompare two outputs, pick the better oneA/B testing models/prompts
Reference-basedCompare output to a gold standardFactual accuracy
Multi-criteriaScore on multiple dimensions separatelyComprehensive evaluation

Implementation

▶ Interactive Example (python)

Gotchas & Best Practices

🚨 Judge Bias

LLM judges have systematic biases: they prefer longer answers, verbose responses, and their own style. Calibrate with human evaluations and use position-debiasing for pairwise comparisons.

🚨 Self-Evaluation Blindspot

Using the same model to judge its own outputs is unreliable — it tends to rate itself highly. Use a different (ideally stronger) model as the judge.

⚠️ Rubric Specificity

Vague criteria like “Is this good?” produce inconsistent scores. Define concrete rubrics with specific scoring anchors for each level (1-5).

💡 Use CoT for Better Judging

Ask the judge to explain its reasoning before giving a score. This “judge chain-of-thought” produces more accurate and consistent evaluations.

💡 Calibrate with Human Labels

Build a set of human-labeled examples and measure judge agreement (Cohen’s kappa). Use these as few-shot examples to align the judge with human preferences.

Variations

  • Pointwise Grading — Score single outputs against criteria
  • Pairwise Comparison — Pick the winner between two outputs
  • Multi-Dimension — Score across multiple independent criteria
  • Cascade — Fast/cheap judge first, expensive judge for edge cases
  • Constitutional AI — Self-judge against a set of principles

Further Reading