Boss Fight: Eval Pipeline for Chat Titles
The Scenario
Section titled “The Scenario”You are building a complete eval pipeline for a chat title generator. The system takes a user message and generates a short, descriptive title for the chat. Your pipeline automatically evaluates the quality of these titles — with deterministic scorers AND LLM-as-Judge.
This is how it should work:
Input: "I need help with TypeScript Generics and Constraints"Title: "TypeScript Generics"
Input: "How do I configure Docker Compose for a multi-container setup?"Title: "Docker Compose Setup"
Input: ""Title: "New Chat"This project connects all five building blocks:
Requirements
Section titled “Requirements”-
Dataset (Challenge 6.4) — Create a dataset with at least 20 diverse chat histories. Cover: technical questions, short inputs, long inputs, empty inputs, ambiguous inputs, multilingual inputs.
-
Task: generateText (Challenge 6.1) — Use
generateTextwith a system prompt that controls the title generator. The system prompt should define rules: maximum length, no punctuation at the end, descriptive but concise. -
traceAISDKModel (Challenge 6.1) — Wrap the model with
traceAISDKModelto see token usage and latency in the dashboard. -
Deterministic Scorer: Title Length (Challenge 6.2) — Create a
createScorerthat evaluates title length:- 1-50 characters -> Score 1.0
- 51-80 characters -> Score 0.5
- 0 or >80 characters -> Score 0.0
-
Deterministic Scorer: No Trailing Period (Challenge 6.2) — Create a scorer that checks whether the title does NOT end with a period.
-
LLM-as-Judge Scorer: Relevance (Challenge 6.3) — Create a scorer that evaluates whether the generated title matches the user message. Use
generateObjectwith a Zod schema and a score scale. -
Multiple Scorers Combined — All three scorers (title length, no trailing period, relevance) must run in a single
scorersarray. -
Eval-Driven Iteration — Run the evals once, analyze the results, adjust the system prompt, and run the evals again. Did the score improve?
Starter Code
Section titled “Starter Code”Create the file chat-titles.eval.ts and run it with pnpm eval:dev.
Note: Langfuse (Challenge 6.5) would be added in production — here we focus on the eval pipeline with Evalite.
import { evalite } from 'evalite';import { createScorer } from 'evalite';import { traceAISDKModel } from 'evalite/ai-sdk';import { generateText, generateObject } from 'ai';import { openai } from '@ai-sdk/openai';import { z } from 'zod';
// TODO 1: Create the titleLength scorer (createScorer)// - 1-50 characters -> 1.0// - 51-80 characters -> 0.5// - 0 or >80 -> 0.0
// TODO 2: Create the noTrailingPeriod scorer (createScorer)// - Does NOT end with a period -> 1.0// - Ends with a period -> 0.0
// TODO 3: Create the titleRelevance scorer (createScorer with LLM-as-Judge)// - Use generateObject with a judge prompt// - Score scale: A (perfectly relevant), B (partially relevant),// C (vaguely relevant), D (not relevant)// - Return score + metadata (rationale)
// TODO 4: Define the system prompt for the title generator
// TODO 5: Create the dataset with 20+ test cases
// TODO 6: Create the evalite() with:// - data: Your dataset// - task: generateText with traceAISDKModel// - scorers: [titleLength, noTrailingPeriod, titleRelevance]Evaluation Criteria
Section titled “Evaluation Criteria”Your Boss Fight is passed when:
- Dataset with at least 20 diverse test cases (technical, conversational, edge cases, ambiguous)
-
generateTextwithtraceAISDKModelastask— titles are generated by the LLM - System prompt defines rules for title generation (length, format, style)
-
titleLengthscorer evaluates length with graduated scores -
noTrailingPeriodscorer checks for no trailing period -
titleRelevancescorer uses LLM-as-Judge withgenerateObjectand Zod schema - All three scorers run in a single
scorersarray - At least one iteration: adjust system prompt, re-evaluate, compare scores
Hint 1: System Prompt Design
A good system prompt for title generation could look like this:
Generate a short, descriptive title for the following chat message.Rules:- Maximum 50 characters- No periods at the end- Be specific, not generic- If the input is empty, return "New Chat"- Use the language of the inputStart with this and iterate based on the eval results.
Hint 2: Relevance Scorer Prompt
The judge prompt for relevance could ask the question: “How well does this title describe the chat message?” and offer a scale:
- A (1.0): Title captures the main topic precisely
- B (0.7): Title is related but too vague or too specific
- C (0.3): Title is only loosely related
- D (0.0): Title is unrelated to the message
Hint 3: Edge Cases in the Dataset
Don’t forget:
- Empty strings (
'') - Whitespace only (
' ') - Very short inputs (
'Hi','?') - Very long inputs (200+ characters)
- Inputs with special characters, emojis, or code snippets
- Inputs in different languages