Level 6: Evals — Briefing

Level 6: Evals — automated testing for LLM applications

TL;DR

Evals are automated tests for LLM applications. Instead of right/wrong, they measure quality on a scale. Evalite is your Vitest for AI.

Skill Tree

What You’ll Learn

Evalite Basics — Set up the TypeScript-native eval framework and write your first eval
Deterministic Eval — Fast, cheap scorers without an LLM: string comparison, contains checks, custom scorers
LLM-as-a-Judge — An LLM evaluates the output of another LLM — for open-ended answers without a single correct solution
Dataset Management — Collect, maintain, and critically evaluate representative test data
Langfuse — Production observability: monitor traces, costs, and quality in real time

Why This Matters

“Your App Is Only As Good As Its Evals” — Matt Pocock

Without evals, you change a prompt and hope it gets better. You manually read through outputs to check if the answers are still good. One change breaks something else — and you only notice when a user complains.

The concrete problem: LLM outputs are non-deterministic. The same prompt can produce different results. Classic unit tests with expect(result).toBe("...") don’t work. You need a new testing paradigm — scorers that measure quality on a scale instead of binary right/wrong.

Prerequisites

Level 1: AI SDK Basics — You should be comfortable with generateText, as it serves as your task function in evals
Level 5: Context Engineering — System prompts and prompt templates that you’ll iteratively improve with evals
OpenAI API Key — This level uses OpenAI models (gpt-4o, gpt-4o-mini) as judge and task models. Evalite and the Autoevals library use OpenAI by default. Create a key at platform.openai.com/api-keys and store it in a .env file:
Terminal window
```
OPENAI_API_KEY=sk-...
```
pnpm — Evalite uses pnpm as package manager. If you’ve been using npm so far: npm install -g pnpm. Alternatively, npm install -D works just as well instead of pnpm add -D.
Continue working in your project directory from Level 1, or create a new directory for this level.

Skip hint: Already working with Evalite or another eval framework and know the difference between deterministic and LLM-as-Judge scorers? Jump straight to the Boss Fight and build a complete eval pipeline.

Challenges

6.1 — Evalite Basics Installation, .eval.ts file convention, evalite() basic structure

6.2 — Deterministic Eval Inline scorer, createScorer, Levenshtein, contains check

6.3 — LLM-as-a-Judge Factuality scorer, score scale A-E, rationale

6.4 — Dataset Management Representative datasets, edge cases, dataset critiquing

6.5 — Langfuse Basics Traces, generations, scores, production monitoring

Boss Fight

Build an eval pipeline for chat titles: A complete evaluation system for a chat title generator. You create a dataset with diverse chat histories, combine deterministic scorers (title length) with LLM-as-Judge (relevance), and track everything with traceAISDKModel. All five building blocks in one pipeline.

Sources for This Level

github.com/mattpocock/evalite — Evalite Repo + Docs
npmjs.com/package/evalite
Autoevals Library (Braintrust)
Langfuse: Evaluation Overview
ai-hero-dev Exercises 06.01-06.07