Boss Fight: Production-Ready AI System

The Scenario

You’re building a production-ready AI system — the ultimate project that combines everything you’ve learned across 9 levels. The system takes a topic, autonomously researches it, selects the optimal model for each step, protects itself with guardrails, and delivers a quality-assured report.

Your system should feel like this:

[Input Guard] "Edge Computing Trends" — Injection Check: OK, PII Check: OK

[Phase 1: Research] Routing: gemini-flash (Research = einfache Suche)
  Iteration 1: search("Edge Computing Vorteile 2026") — 312 Tokens
  Iteration 2: search("Edge Computing vs Cloud Computing") — 287 Tokens
  Iteration 3: Fertig (finishReason: stop)
[Phase 1] 3 Iterationen, 1.041 Tokens, Abbruch: complete

[Phase 2: Summarize] Routing: claude-sonnet (Analyse = komplex)
  5 Kernaussagen generiert — 423 Tokens

[Phase 3: Format] Routing: claude-sonnet (Formatierung = komplex)
  Streaming Report...
  "Edge Computing hat sich als Schluessel-Technologie etabliert..."
[Phase 3] 891 Tokens

[Output Guard] Laenge: OK (2.847 Zeichen), Format: OK

[Quality Compare] Gleichen Prompt an 2 Modelle → Judge Score: 8/10

[Stats] Pipeline: 3.2s, 2.355 Tokens, geschaetzte Kosten: $0.023

This project connects all 9 levels:

Boss Fight Overview: User Input to Input Guardrails, then Pipeline with Research Phase (Model Router, Custom Loop, Break Conditions), Processing Phase (Workflow, Structured Output), Quality Phase (Output Guardrails, Comparing Outputs), to Stream and Report; Usage Tracking and Evals as cross-cutting concerns

Requirements

Input Guardrails (Challenge 9.1) — Check every user input for prompt injection and PII before the pipeline starts. Invalid inputs are rejected with a clear error message.
Model Router (Challenge 9.2) — Use the optimal model for each phase of the pipeline. Research: cheap flash model. Analysis and formatting: powerful Sonnet/Opus model. Classification: smallest available model.
Research Loop (Level 3 + Level 8) — A custom agent loop with at least one tool (search). The loop has three break conditions: Max Iterations (5), Timeout (30s), Cost Guard (5,000 tokens). The agent decides on its own when enough research is done.
Workflow (Level 8.1) — The processing phase chains at least 2 sequential generateText calls. Output from Step N becomes input for Step N+1. Each step has its own system prompt with XML-structured Context Engineering (Level 5).
Output Guardrails (Challenge 9.1) — Check the final report for length, format, and content. Empty or overly long outputs are caught.
Comparing Outputs (Challenge 9.3) — For the final report: Generate the summary with 2 different models in parallel and choose the better result (via a simple scorer or LLM-as-a-Judge).
Streaming (Level 7 + 8.2) — Stream the final report in real time. Send progress data parts for each phase of the pipeline. (Use createDataStream + writeData as in Challenge 8.2.)
Usage Tracking (Level 2.2) — Track token usage per phase and total. Calculate estimated costs.
Structured Output (Level 1.5) — The report is returned as a typed object with a Zod schema (title, summary, key findings, conclusion).
Eval Coverage (Level 6) — Write at least 2 Evalite tests: one that checks whether the report has the expected sections, and one that evaluates the length of the summary.

Starter Code

import { createDataStream, generateText, streamText, tool, Output } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import { google } from '@ai-sdk/google';
import { z } from 'zod';

// --- Schemas ---
const ReportSchema = z.object({
  title: z.string(),
  summary: z.string(),
  keyFindings: z.array(z.string()).min(3).max(7),
  conclusion: z.string(),
});

// --- Guardrails ---
// TODO: Input guardrails (injection, PII)
// TODO: Output guardrails (length, format)

// --- Model Router ---
// TODO: selectModel(phase: 'research' | 'analysis' | 'format' | 'classify')

// --- Tools ---
// TODO: search tool

// --- Pipeline Phases ---
// TODO: researchPhase(topic) — Custom Loop + Tools + Break Conditions
// TODO: processingPhase(research, topic) — Workflow + Context Engineering + Structured Output
// TODO: qualityPhase(report) — Output Guardrails + Comparing Outputs

// --- Main Pipeline ---
// TODO: researchPipeline(topic) — Orchestrates all phases, streams progress, tracks costs

// --- Evals ---
// TODO: evalite('research-pipeline', { ... })

Evaluation Criteria

Your Boss Fight is passed when:

Hints

Hint 1: Pipeline Structure

Build the pipeline as an async function researchPipeline(topic: string) that executes all three phases sequentially. Each phase is its own function that returns a result object. At the end: collect statistics and return the report.

Hint 2: Model Router Integration

Create a selectModel function that takes the phase name as a parameter. Research gets a flash model, Analysis and Format get Sonnet. The router is called in each phase, not centrally — so each phase can get its optimal model.

Hint 3: Integrating Comparing Outputs

The comparison doesn’t have to run the entire pipeline twice. It’s sufficient to generate the summary with 2 models in parallel (Promise.all) in the processing phase and then choose the longer/better output. The research step only runs once.

Hint 4: Keep Evals Separate

Write the evals in a separate file. Import the pipeline function and test it with Evalite. One scorer checks whether keyFindings.length >= 3, another whether summary.length > 100. This keeps pipeline and tests cleanly separated.