Challenge 6.3: LLM-as-a-Judge
How do you evaluate whether a summary is “good” — when there is no single correct answer? “Paris” as the capital of France can be checked with includes(). But “Explain Machine Learning” has a thousand valid answers.
OVERVIEW
Section titled “OVERVIEW”The principle: A second LLM (the “Judge”) receives the output and the expected value and evaluates how factually correct the answer is. The result is a structured score with a rationale.
Without LLM-as-Judge: You would have to read and evaluate every open-ended answer yourself. With 50 test cases, that takes hours. And again with every prompt change. Human evaluation doesn’t scale.
With LLM-as-Judge: You define once WHAT “good” means (a score scale), and the judge LLM evaluates automatically. 50 test cases in seconds instead of hours. Not perfect — but good enough to detect regressions and measure progress.
WALKTHROUGH
Section titled “WALKTHROUGH”Layer 1: The Problem with Open-Ended Answers
Section titled “Layer 1: The Problem with Open-Ended Answers”Consider this test case:
{ input: 'Explain what Machine Learning is.', expected: 'Machine Learning is a subfield of AI where algorithms learn from data.',}A valid answer could be:
- “ML is an area of artificial intelligence that uses statistical methods to recognize patterns in data.”
- “Machine Learning enables computers to learn from experience without being explicitly programmed.”
Both are correct, but neither matches the expected value exactly. Levenshtein would give low scores. An LLM can evaluate: “Factually equivalent, just phrased differently.”
Layer 2: The Score Scale
Section titled “Layer 2: The Score Scale”The Factuality scorer uses a 5-level scale:
| Grade | Meaning | Score | Description |
|---|---|---|---|
| A | Subset | 0.4 | Answer is a subset of the expert opinion — correct, but incomplete |
| B | Superset | 0.6 | Answer contains everything from the expert opinion plus more — correct and more detailed |
| C | Identical | 1.0 | Answer and expert opinion are factually identical |
| D | Conflict | 0.0 | Answer contradicts the expert opinion — factually wrong |
| E | Irrelevant Diff | 1.0 | Answers differ, but the differences don’t matter from the perspective of factuality |
The scale allows gradations: An incomplete answer (A: 0.4) is better than a wrong one (D: 0.0), but worse than a perfect one (C: 1.0).
Layer 3: The Factuality Scorer — Step by Step
Section titled “Layer 3: The Factuality Scorer — Step by Step”We use gpt-4o as the judge model — it’s particularly well-suited for nuanced content evaluations. The scorer uses generateObject from the AI SDK to get a structured evaluation from the judge LLM:
import { createScorer } from 'evalite';import { generateObject } from 'ai';import { openai } from '@ai-sdk/openai';import { z } from 'zod';
const Factuality = createScorer<string, string, string>({ name: 'Factuality', description: 'Evaluates factual correctness of the answer compared to the expert answer.', scorer: async ({ input, expected, output }) => { // 1. The judge LLM receives input, expected, and output const { object } = await generateObject({ model: openai('gpt-4o'), prompt: `You are comparing a submitted answer to an expert answer on a given question.
[BEGIN DATA][Question]: ${input}[Expert]: ${expected}[Submission]: ${output}[END DATA]
Compare the factual content of the submitted answer with the expert answer.Ignore differences in style, grammar, or punctuation.Select one:(A) The submission is a subset of the expert answer and is fully consistent with it.(B) The submission is a superset of the expert answer and is fully consistent with it.(C) The submission contains all the same details as the expert answer.(D) There is a disagreement between the submission and the expert answer.(E) The answers differ, but these differences don't matter from the perspective of factuality.`, schema: z.object({ answer: z.enum(['A', 'B', 'C', 'D', 'E']), // <- Structured answer rationale: z.string(), // <- Rationale }), });
// 2. Convert grade to numeric score const scores: Record<string, number> = { A: 0.4, // Subset — correct, but incomplete B: 0.6, // Superset — correct and more detailed C: 1.0, // Identical — perfect D: 0.0, // Conflict — wrong E: 1.0, // Irrelevant diff — differences don't matter };
// 3. Return score + metadata return { score: scores[object.answer], metadata: { rationale: object.rationale, // <- Traceability! grade: object.answer, }, }; },});Three critical points:
generateObjectwith a Zod schema enforces a structured answer — no free-text parsing needed- The
rationalegives you traceability — you can see WHY the judge evaluated the way it did - The score mapping (A->0.4, B->0.6, etc.) translates the qualitative evaluation into a numeric value
Layer 4: Using the Factuality Scorer
Section titled “Layer 4: Using the Factuality Scorer”import { evalite } from 'evalite';import { traceAISDKModel } from 'evalite/ai-sdk';import { generateText } from 'ai';import { openai } from '@ai-sdk/openai';
evalite('Knowledge Check', { data: async () => [ { input: 'What is TypeScript?', expected: 'TypeScript is a typed superset of JavaScript that compiles to plain JavaScript.', }, { input: 'What is the difference between let and const?', expected: 'let allows reassignment, const does not. Both are block-scoped.', }, ], task: async (input) => { const result = await generateText({ model: traceAISDKModel(openai('gpt-4o-mini')), system: 'Answer technical questions concisely in one sentence.', prompt: input, }); return result.text; }, scorers: [Factuality],});In the dashboard you’ll now see the Factuality score AND the judge’s rationale per test case. When a score is low, you read the rationale and understand what went wrong.
Layer 5: Costs and Trade-offs
Section titled “Layer 5: Costs and Trade-offs”LLM-as-Judge is powerful, but not free:
| Aspect | Deterministic | LLM-as-Judge |
|---|---|---|
| Speed | Microseconds | Seconds |
| Cost | 0 | Tokens per evaluation |
| Reproducibility | 100% identical | ~95% consistent |
| Flexibility | Only exact checks | Open-ended evaluations |
Rule of thumb: If you can write the evaluation as an if statement, use a deterministic scorer. If not, use LLM-as-Judge.
Task: Build a Factuality scorer and test it with correct and incorrect answers.
Create the file factuality.eval.ts and run it with pnpm eval:dev. This scorer requires an OPENAI_API_KEY (see Briefing).
import { evalite } from 'evalite';import { createScorer } from 'evalite';import { generateObject } from 'ai';import { openai } from '@ai-sdk/openai';import { z } from 'zod';
// TODO 1: Implement the Factuality scorer with createScorer// - Use generateObject with the prompt from the walkthrough// - Schema: { answer: z.enum(['A','B','C','D','E']), rationale: z.string() }// - Map the grades to scores: A->0.4, B->0.6, C->1.0, D->0.0, E->1.0// - Return { score, metadata: { rationale, grade } }
// TODO 2: Create an evalite() with the name 'Factuality Test'// - data: 3 test cases — one correct, one partially correct,// one wrong answer (as a simulated task)// - scorers: [Factuality]
// TODO 3: Check the judge's rationale in the dashboard// - Does the rationale make sense? Is it comprehensible?Checklist:
- Factuality scorer implemented with
createScorer -
generateObjectwith Zod schema for structured evaluation - Score mapping (A-E -> 0-1) implemented
-
metadatawithrationalereturned - At least 3 test cases with different quality levels
Show solution
import { evalite } from 'evalite';import { createScorer } from 'evalite';import { generateObject } from 'ai';import { openai } from '@ai-sdk/openai';import { z } from 'zod';
const Factuality = createScorer<string, string, string>({ name: 'Factuality', description: 'Evaluates factual correctness with LLM-as-Judge.', scorer: async ({ input, expected, output }) => { const { object } = await generateObject({ model: openai('gpt-4o'), prompt: `You are comparing a submitted answer to an expert answer on a given question.
[BEGIN DATA][Question]: ${input}[Expert]: ${expected}[Submission]: ${output}[END DATA]
Compare the factual content of the submitted answer with the expert answer.Ignore differences in style, grammar, or punctuation.Select one:(A) The submission is a subset of the expert answer and is fully consistent with it.(B) The submission is a superset of the expert answer and is fully consistent with it.(C) The submission contains all the same details as the expert answer.(D) There is a disagreement between the submission and the expert answer.(E) The answers differ, but these differences don't matter from the perspective of factuality.`, schema: z.object({ answer: z.enum(['A', 'B', 'C', 'D', 'E']), rationale: z.string(), }), });
const scores: Record<string, number> = { A: 0.4, B: 0.6, C: 1.0, D: 0.0, E: 1.0 }; return { score: scores[object.answer], metadata: { rationale: object.rationale, grade: object.answer }, }; },});
evalite('Factuality Test', { data: async () => [ { input: 'What is the capital of France?', expected: 'Paris is the capital of France.', }, { input: 'What is TypeScript?', expected: 'TypeScript is a typed superset of JavaScript that compiles to plain JavaScript.', }, { input: 'What is the speed of light?', expected: 'The speed of light is approximately 300,000 km/s.', }, ], task: async (input) => { // Simulated answers with different quality levels const answers: Record<string, string> = { 'What is the capital of France?': 'The capital of France is Paris.', // <- Correct (C) 'What is TypeScript?': 'TypeScript adds types to JavaScript.', // <- Partial (A) 'What is the speed of light?': 'The speed of light is 500,000 km/s.', // <- Wrong (D) }; return answers[input] ?? 'I do not know.'; }, scorers: [Factuality],});Explanation: The three test cases cover different quality levels:
- “The capital of France is Paris.” -> Grade C (1.0) — factually identical
- “TypeScript adds types to JavaScript.” -> Grade A (0.4) — correct, but incomplete (missing “compiles to plain JavaScript”)
- “The speed of light is 500,000 km/s.” -> Grade D (0.0) — factually wrong
In the dashboard you’ll see the rationale explaining why the judge decided the way it did.
COMBINE
Section titled “COMBINE”Exercise: Combine a deterministic scorer (Challenge 6.2) with the Factuality scorer (Challenge 6.3) in one eval. Test with a real LLM call.
- Create an eval with 5 knowledge questions (e.g., programming concepts)
- Use
generateTextwithtraceAISDKModelas thetask - Scorers:
containsKeyword(from 6.2) ANDFactuality(from 6.3) - Compare in the dashboard: Where do both scorers agree? Where don’t they?
Food for thought: Are there cases where containsKeyword gives a 1, but Factuality gives a 0? And vice versa?