Level 6: Evals — Briefing
Evals are automated tests for LLM applications. Instead of right/wrong, they measure quality on a scale. Evalite is your Vitest for AI.
Skill Tree
Section titled “Skill Tree”What You’ll Learn
Section titled “What You’ll Learn”- Evalite Basics — Set up the TypeScript-native eval framework and write your first eval
- Deterministic Eval — Fast, cheap scorers without an LLM: string comparison, contains checks, custom scorers
- LLM-as-a-Judge — An LLM evaluates the output of another LLM — for open-ended answers without a single correct solution
- Dataset Management — Collect, maintain, and critically evaluate representative test data
- Langfuse — Production observability: monitor traces, costs, and quality in real time
Why This Matters
Section titled “Why This Matters”“Your App Is Only As Good As Its Evals” — Matt Pocock
Without evals, you change a prompt and hope it gets better. You manually read through outputs to check if the answers are still good. One change breaks something else — and you only notice when a user complains.
The concrete problem: LLM outputs are non-deterministic. The same prompt can produce different results. Classic unit tests with expect(result).toBe("...") don’t work. You need a new testing paradigm — scorers that measure quality on a scale instead of binary right/wrong.
Prerequisites
Section titled “Prerequisites”- Level 1: AI SDK Basics — You should be comfortable with
generateText, as it serves as your task function in evals - Level 5: Context Engineering — System prompts and prompt templates that you’ll iteratively improve with evals
- OpenAI API Key — This level uses OpenAI models (
gpt-4o,gpt-4o-mini) as judge and task models. Evalite and the Autoevals library use OpenAI by default. Create a key at platform.openai.com/api-keys and store it in a.envfile:Terminal window OPENAI_API_KEY=sk-... - pnpm — Evalite uses
pnpmas package manager. If you’ve been usingnpmso far:npm install -g pnpm. Alternatively,npm install -Dworks just as well instead ofpnpm add -D. - Continue working in your project directory from Level 1, or create a new directory for this level.
Skip hint: Already working with Evalite or another eval framework and know the difference between deterministic and LLM-as-Judge scorers? Jump straight to the Boss Fight and build a complete eval pipeline.
Challenges
Section titled “Challenges”Boss Fight
Section titled “Boss Fight”Build an eval pipeline for chat titles: A complete evaluation system for a chat title generator. You create a dataset with diverse chat histories, combine deterministic scorers (title length) with LLM-as-Judge (relevance), and track everything with traceAISDKModel. All five building blocks in one pipeline.