Challenge 6.1: Evalite Basics

THINK

How do you test whether your LLM system gives good answers — manually read through them every time? What if you change 50 prompts and want to check whether quality went up or down?

OVERVIEW

.eval.ts file feeds into data (Input + Expected), task (LLM Call), and scorers (Evaluation) — the evalite() core — which produces results in the Score Dashboard at localhost:3006

evalite() takes three things: data (what goes in and what should come out), task (the function under test), and scorers (how the output is evaluated). Results are stored in a local SQLite database and displayed in the browser dashboard.

WHY

Without evals: You change a prompt and hope the answers get better. You manually read through 5 answers, they look fine, so you deploy. Two weeks later you notice: for a certain question the LLM now hallucinates. No idea since when.

With evals: You change a prompt, run evalite watch, and immediately see: score went from 0.82 to 0.91. Or: score dropped from 0.82 to 0.65 — regression caught before any user notices. Measure, compare, iterate.

WALKTHROUGH

Layer 1: Installation

Install Evalite and the Autoevals library as dev dependencies:

pnpm add -D evalite vitest autoevals ai @ai-sdk/openai zod

Then set up a script in package.json:

{
  "scripts": {
    "eval": "evalite",
    "eval:dev": "evalite watch"
  }
}

evalite watch starts watch mode — like vitest --watch. Whenever a .eval.ts file changes, evals are automatically re-run.

Layer 2: The .eval.ts File Convention

Evalite looks for files ending in .eval.ts — analogous to .test.ts in Vitest:

src/
  my-feature.ts          <- Your code
  my-feature.test.ts     <- Unit tests (Vitest)
  my-feature.eval.ts     <- Evals (Evalite)

Each .eval.ts file contains one or more evalite() calls.

Layer 3: evalite() Basic Structure

The three building blocks — data, task, scorers:

import { evalite } from 'evalite';
import { Levenshtein } from 'autoevals'; // Levenshtein measures edit distance — how many character changes are needed to get from output to expected

evalite('My First Eval', {
  // 1. Test data: What goes in, what should come out?
  data: [
    { input: 'Hello', expected: 'Hello World!' },
  ],

  // 2. The function under test — here still without an LLM
  task: async (input) => {
    return input + ' World!';
  },

  // 3. Evaluation: How close is the output to the expected?
  scorers: [Levenshtein],
});

Flow:

data provides test cases (input + expected output)
task is executed per test case with the input
scorers compare the actual output with the expected value
Results are stored in SQLite (node_modules/.evalite)
Dashboard shows scores at http://localhost:3006

Layer 4: data as an async Function

Instead of a static array, data can also be an async function — useful for dynamic loading:

evalite('Capital Cities', {
  data: async () => [
    { input: 'What is the capital of France?', expected: 'Paris' },
    { input: 'What is the capital of Germany?', expected: 'Berlin' },
    { input: 'What is the capital of Japan?', expected: 'Tokyo' },
  ],
  task: async (input) => {
    // Your LLM call goes here later
    return input;
  },
  scorers: [Levenshtein],
});

Layer 5: traceAISDKModel — AI SDK Integration

When you use a real LLM call as your task, you wrap the model with traceAISDKModel. This lets Evalite capture all LLM calls (tokens, latency, cost) in the dashboard:

import { evalite } from 'evalite';
import { traceAISDKModel } from 'evalite/ai-sdk';
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { Levenshtein } from 'autoevals';

evalite('Capital Cities', {
  data: async () => [
    { input: 'What is the capital of France?', expected: 'Paris' },
    { input: 'What is the capital of Germany?', expected: 'Berlin' },
  ],
  task: async (input) => {
    const result = await generateText({
      model: traceAISDKModel(openai('gpt-4o-mini')),  // <- Tracing!
      system: 'Answer concisely. No periods.',
      prompt: input,
    });
    return result.text;
  },
  scorers: [Levenshtein],
});

traceAISDKModel is a wrapper that adds Evalite tracing to the AI SDK model. In the dashboard you’ll then see not only the score, but also token usage and latency per test case.

Layer 6: Result UI

Start the evals with pnpm eval:dev and open http://localhost:3006:

Overview: All evals with average score
Detail view: Each test case with input, output, expected, and score
History: Scores over time — see improvements or regressions
Traces: With traceAISDKModel you see the LLM calls with token usage

TRY

Task: Set up Evalite and write a simple eval with the Levenshtein scorer.

Create the file hello.eval.ts and run it with pnpm eval:dev. You’ll see the results in the dashboard at http://localhost:3006.

import { evalite } from 'evalite';
import { Levenshtein } from 'autoevals';

// TODO 1: Create an evalite() with the name 'Greeting Eval'

// TODO 2: Define data with 3 test cases:
//   - input: 'Hi'      -> expected: 'Hi! How can I help?'
//   - input: 'Hello'   -> expected: 'Hello! How can I help?'
//   - input: 'Hey'     -> expected: 'Hey! How can I help?'

// TODO 3: Implement a task function that takes the input
//   and appends ' How can I help?' (with an exclamation mark after the input)

// TODO 4: Use Levenshtein as the scorer

Checklist:

evalite and Levenshtein imported
data with 3 test cases defined
task returns the input with appended text
Levenshtein set as scorer
pnpm eval:dev shows results in the dashboard

Show solution

import { evalite } from 'evalite';
import { Levenshtein } from 'autoevals';

evalite('Greeting Eval', {
  data: [
    { input: 'Hi', expected: 'Hi! How can I help?' },
    { input: 'Hello', expected: 'Hello! How can I help?' },
    { input: 'Hey', expected: 'Hey! How can I help?' },
  ],
  task: async (input) => {
    return `${input}! How can I help?`;
  },
  scorers: [Levenshtein],
});

Expected output: All three test cases should achieve a score of 1.0 — the task function produces exactly the expected output. In the dashboard at localhost:3006 you’ll see “Greeting Eval” with an average score of 1.0.

Explanation: The Levenshtein scorer measures the similarity between the actual output and the expected output. A score of 1.0 means identical, a score of 0.0 means completely different. In this example, every test case should achieve a score of 1.0, because the task function produces exactly the expected output.

COMBINE

prompt and system flow into generateText() (Level 1.3) producing result.text, which together with expected goes through the Levenshtein Scorer to produce a Score 0-1

Exercise: Use generateText from Level 1.3 as the task function in an Evalite eval. Test whether an LLM correctly names capital cities.

Create an .eval.ts file with 5 capital city questions
Use generateText with traceAISDKModel as the task
System prompt: 'Answer with only the city name. No periods, no extra text.'
Scorer: Levenshtein
Run pnpm eval:dev and check the scores in the dashboard

Food for thought: Why is Levenshtein not the ideal scorer here? What happens if the LLM answers “Paris, France” instead of “Paris”?

Sources

github.com/mattpocock/evalite — Evalite Repo + Docs
npmjs.com/package/evalite
Autoevals Library (Braintrust)
ai-hero-dev Exercise 06.01