Skip to content
EN DE

Level 6: Evals — Briefing

Evals are automated tests for LLM applications. Instead of right/wrong, they measure quality on a scale. Evalite is your Vitest for AI.

Skill Tree — Level 6 Evals is the current level
  • Evalite Basics — Set up the TypeScript-native eval framework and write your first eval
  • Deterministic Eval — Fast, cheap scorers without an LLM: string comparison, contains checks, custom scorers
  • LLM-as-a-Judge — An LLM evaluates the output of another LLM — for open-ended answers without a single correct solution
  • Dataset Management — Collect, maintain, and critically evaluate representative test data
  • Langfuse — Production observability: monitor traces, costs, and quality in real time

“Your App Is Only As Good As Its Evals” — Matt Pocock

Without evals, you change a prompt and hope it gets better. You manually read through outputs to check if the answers are still good. One change breaks something else — and you only notice when a user complains.

The concrete problem: LLM outputs are non-deterministic. The same prompt can produce different results. Classic unit tests with expect(result).toBe("...") don’t work. You need a new testing paradigm — scorers that measure quality on a scale instead of binary right/wrong.

  • Level 1: AI SDK Basics — You should be comfortable with generateText, as it serves as your task function in evals
  • Level 5: Context Engineering — System prompts and prompt templates that you’ll iteratively improve with evals
  • OpenAI API Key — This level uses OpenAI models (gpt-4o, gpt-4o-mini) as judge and task models. Evalite and the Autoevals library use OpenAI by default. Create a key at platform.openai.com/api-keys and store it in a .env file:
    Terminal window
    OPENAI_API_KEY=sk-...
  • pnpm — Evalite uses pnpm as package manager. If you’ve been using npm so far: npm install -g pnpm. Alternatively, npm install -D works just as well instead of pnpm add -D.
  • Continue working in your project directory from Level 1, or create a new directory for this level.

Skip hint: Already working with Evalite or another eval framework and know the difference between deterministic and LLM-as-Judge scorers? Jump straight to the Boss Fight and build a complete eval pipeline.

Build an eval pipeline for chat titles: A complete evaluation system for a chat title generator. You create a dataset with diverse chat histories, combine deterministic scorers (title length) with LLM-as-Judge (relevance), and track everything with traceAISDKModel. All five building blocks in one pipeline.

Part of AI Learning — free courses from prompt to production. Jan on LinkedIn