Challenge 6.5: Langfuse Basics
Your evals run locally — but what happens with LLM calls in production? How do you notice that your system is getting slower, costs are exploding, or quality is dropping — before a user complains?
OVERVIEW
Section titled “OVERVIEW”Langfuse is an open-source observability tool for LLM applications. It captures every LLM call in production and shows you latency, token costs, and quality in a dashboard. Like Datadog or Sentry — but specifically for LLMs.
Without observability: Your LLM system is running in production. A user reports: “The answers are bad.” You look at the logs — nothing. You don’t know which prompt was sent, how the LLM responded, or how much it cost. You’re flying blind.
With observability: You open the Langfuse dashboard and see: Trace #4721, prompt “Explain X”, response ”…”, latency 3.2s, cost $0.003, score 0.4. You immediately see where the problem is — and can reproduce it in an eval.
WALKTHROUGH
Section titled “WALKTHROUGH”Layer 1: Langfuse Core Concepts
Section titled “Layer 1: Langfuse Core Concepts”Langfuse organizes data in three levels:
| Concept | What | Example |
|---|---|---|
| Trace | A complete request lifecycle | User asks “What is TypeScript?” -> Answer |
| Generation | A single LLM call within a trace | generateText({ prompt: '...' }) |
| Score | An evaluation of a trace or generation | Factuality: 0.8, Latency: 2.1s |
| Span | An arbitrary code section (non-LLM) | Database query, retrieval step |
A trace can contain multiple generations — for example with an agent that makes multiple tool calls (Level 3).
Layer 2: What Langfuse Captures
Section titled “Layer 2: What Langfuse Captures”Per generation, Langfuse stores:
// This is what Langfuse automatically captures for every LLM call:{ // Input model: 'gpt-4o', input: { system: 'You are a helpful assistant.', messages: [{ role: 'user', content: 'What is TypeScript?' }], },
// Output output: 'TypeScript is a typed extension of JavaScript...',
// Metrics usage: { promptTokens: 42, completionTokens: 128, totalTokens: 170, }, latency: 1842, // ms cost: 0.00085, // USD (calculated from token prices)
// Metadata traceId: 'trace_abc123', timestamp: '2026-03-08T14:30:00Z',}Layer 3: Integration with the AI SDK
Section titled “Layer 3: Integration with the AI SDK”Langfuse offers two integration paths: The recommended OpenTelemetry-based approach (LangfuseExporter) and the manual SDK approach. The following code shows the manual approach to make the concepts (trace, generation, score) explicit. In production you’d use the OTel integration — it captures AI SDK calls automatically without manual logging.
import { Langfuse } from 'langfuse';
// 1. Initialize Langfuse clientconst langfuse = new Langfuse({ publicKey: process.env.LANGFUSE_PUBLIC_KEY, secretKey: process.env.LANGFUSE_SECRET_KEY, baseUrl: 'https://cloud.langfuse.com', // or self-hosted URL});
// 2. Create a traceconst trace = langfuse.trace({ name: 'chat-completion', userId: 'user_123', metadata: { feature: 'chat-titles' },});
// 3. Log a generationconst generation = trace.generation({ name: 'generate-title', model: 'gpt-4o-mini', input: { prompt: 'Generate a title for: ...' },});
// 4. After the LLM call: log output and metricsgeneration.end({ output: 'TypeScript Generics Explained', usage: { promptTokens: 42, completionTokens: 8 },});
// 5. Optional: Add a scoretrace.score({ name: 'title-quality', value: 0.85, comment: 'Title is concise and relevant.',});
// 6. At the end: flush (important for serverless/Edge!)await langfuse.flushAsync();Important: In serverless environments (Vercel, Cloudflare Workers) you must call flushAsync() before the function terminates. Otherwise traces will be lost.
Layer 4: Dashboard Features
Section titled “Layer 4: Dashboard Features”The Langfuse dashboard offers:
| Feature | What You See |
|---|---|
| Traces | All requests with timeline — click in for details |
| Generations | Each LLM call with input, output, tokens, latency |
| Scores | Average quality over time |
| Cost Analytics | Token usage and costs per day, model, feature |
| Latency Distribution | P50, P95, P99 latencies — where are your bottlenecks? |
| User Analytics | Which users generate the most costs? |
Layer 5: Evalite vs. Langfuse
Section titled “Layer 5: Evalite vs. Langfuse”Evalite and Langfuse are complementary — they cover different phases:
| Aspect | Evalite | Langfuse |
|---|---|---|
| When | Development, CI/CD | Production |
| Where | Local, your machine | Cloud or self-hosted |
| What | Defined test cases with known expected values | Real user requests |
| Purpose | Prompt iteration, regression detection | Monitoring, debugging, cost control |
| Data | Your dataset | Real production data |
The workflow: You iterate with Evalite (locally) until the score is good enough. Then you deploy. Langfuse monitors in production. When scores drop there, you go back to Evalite and fix the issue.
Task: Familiarize yourself with the Langfuse concepts — traces, generations, scores.
- Go to langfuse.com and create a free account
- Explore the demo project in the dashboard:
- Open a trace and follow the request lifecycle
- Look at the generation details: input, output, tokens, latency
- Find the cost analytics: Which model costs the most?
- Answer for yourself:
- What is the difference between a trace and a generation?
- Where would you add a score — on the trace or on the generation?
- When would you set up an alert?
Checklist:
- Langfuse account created (or demo project explored)
- Opened a trace in the dashboard and understood it
- Explained the difference between trace and generation
- Found and interpreted cost analytics
- Own notes: Where would you use Langfuse in your project?
Hint: No coding in this challenge
This challenge is intentionally conceptual. Integrating Langfuse with the AI SDK requires an OpenTelemetry setup that varies depending on the hosting environment (Node.js, Vercel, Cloudflare). The focus here is on understanding the concepts — the practical integration comes when you deploy your first project to production.
If you still want hands-on work: Langfuse offers a Quickstart Guide that lets you create a local trace in under 5 minutes.
COMBINE
Section titled “COMBINE”Exercise: Sketch the eval-driven development workflow for yourself:
- You have a chat title generator (from the previous challenges)
- How would you set up the Evalite-Langfuse workflow?
- When does Evalite run? (local, CI/CD)
- When does Langfuse run? (production)
- What happens when a Langfuse alert fires?
- Consider: Which scores would you track in production?
- Latency? Cost per request? Title quality?
Food for thought: Langfuse captures real user requests. Could you automatically feed that data back into the Evalite dataset to improve the test data?