Skip to content
EN DE

Challenge 6.5: Langfuse Basics

Your evals run locally — but what happens with LLM calls in production? How do you notice that your system is getting slower, costs are exploding, or quality is dropping — before a user complains?

Production App sends an LLM Call that is captured in a Langfuse Trace, which displays Latency, Cost, and Quality in the Dashboard

Langfuse is an open-source observability tool for LLM applications. It captures every LLM call in production and shows you latency, token costs, and quality in a dashboard. Like Datadog or Sentry — but specifically for LLMs.

Without observability: Your LLM system is running in production. A user reports: “The answers are bad.” You look at the logs — nothing. You don’t know which prompt was sent, how the LLM responded, or how much it cost. You’re flying blind.

With observability: You open the Langfuse dashboard and see: Trace #4721, prompt “Explain X”, response ”…”, latency 3.2s, cost $0.003, score 0.4. You immediately see where the problem is — and can reproduce it in an eval.

Langfuse organizes data in three levels:

A Trace (complete user request) contains Generation 1 (System Prompt + User Input), Generation 2 (Tool Call), and Generation 3 (Final Answer), where Generation 1 has a Factuality Score 0.8 and Generation 3 has a Latency Score 2.1s
ConceptWhatExample
TraceA complete request lifecycleUser asks “What is TypeScript?” -> Answer
GenerationA single LLM call within a tracegenerateText({ prompt: '...' })
ScoreAn evaluation of a trace or generationFactuality: 0.8, Latency: 2.1s
SpanAn arbitrary code section (non-LLM)Database query, retrieval step

A trace can contain multiple generations — for example with an agent that makes multiple tool calls (Level 3).

Per generation, Langfuse stores:

// This is what Langfuse automatically captures for every LLM call:
{
// Input
model: 'gpt-4o',
input: {
system: 'You are a helpful assistant.',
messages: [{ role: 'user', content: 'What is TypeScript?' }],
},
// Output
output: 'TypeScript is a typed extension of JavaScript...',
// Metrics
usage: {
promptTokens: 42,
completionTokens: 128,
totalTokens: 170,
},
latency: 1842, // ms
cost: 0.00085, // USD (calculated from token prices)
// Metadata
traceId: 'trace_abc123',
timestamp: '2026-03-08T14:30:00Z',
}

Langfuse offers two integration paths: The recommended OpenTelemetry-based approach (LangfuseExporter) and the manual SDK approach. The following code shows the manual approach to make the concepts (trace, generation, score) explicit. In production you’d use the OTel integration — it captures AI SDK calls automatically without manual logging.

import { Langfuse } from 'langfuse';
// 1. Initialize Langfuse client
const langfuse = new Langfuse({
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
baseUrl: 'https://cloud.langfuse.com', // or self-hosted URL
});
// 2. Create a trace
const trace = langfuse.trace({
name: 'chat-completion',
userId: 'user_123',
metadata: { feature: 'chat-titles' },
});
// 3. Log a generation
const generation = trace.generation({
name: 'generate-title',
model: 'gpt-4o-mini',
input: { prompt: 'Generate a title for: ...' },
});
// 4. After the LLM call: log output and metrics
generation.end({
output: 'TypeScript Generics Explained',
usage: { promptTokens: 42, completionTokens: 8 },
});
// 5. Optional: Add a score
trace.score({
name: 'title-quality',
value: 0.85,
comment: 'Title is concise and relevant.',
});
// 6. At the end: flush (important for serverless/Edge!)
await langfuse.flushAsync();

Important: In serverless environments (Vercel, Cloudflare Workers) you must call flushAsync() before the function terminates. Otherwise traces will be lost.

The Langfuse dashboard offers:

FeatureWhat You See
TracesAll requests with timeline — click in for details
GenerationsEach LLM call with input, output, tokens, latency
ScoresAverage quality over time
Cost AnalyticsToken usage and costs per day, model, feature
Latency DistributionP50, P95, P99 latencies — where are your bottlenecks?
User AnalyticsWhich users generate the most costs?

Evalite and Langfuse are complementary — they cover different phases:

Development loop: Change code, Evalite (Local Evals), Compare scores, back to Change code. Production loop: App, Langfuse Monitoring, Alerts on score drop trigger code changes
AspectEvaliteLangfuse
WhenDevelopment, CI/CDProduction
WhereLocal, your machineCloud or self-hosted
WhatDefined test cases with known expected valuesReal user requests
PurposePrompt iteration, regression detectionMonitoring, debugging, cost control
DataYour datasetReal production data

The workflow: You iterate with Evalite (locally) until the score is good enough. Then you deploy. Langfuse monitors in production. When scores drop there, you go back to Evalite and fix the issue.

Task: Familiarize yourself with the Langfuse concepts — traces, generations, scores.

  1. Go to langfuse.com and create a free account
  2. Explore the demo project in the dashboard:
    • Open a trace and follow the request lifecycle
    • Look at the generation details: input, output, tokens, latency
    • Find the cost analytics: Which model costs the most?
  3. Answer for yourself:
    • What is the difference between a trace and a generation?
    • Where would you add a score — on the trace or on the generation?
    • When would you set up an alert?

Checklist:

  • Langfuse account created (or demo project explored)
  • Opened a trace in the dashboard and understood it
  • Explained the difference between trace and generation
  • Found and interpreted cost analytics
  • Own notes: Where would you use Langfuse in your project?
Hint: No coding in this challenge

This challenge is intentionally conceptual. Integrating Langfuse with the AI SDK requires an OpenTelemetry setup that varies depending on the hosting environment (Node.js, Vercel, Cloudflare). The focus here is on understanding the concepts — the practical integration comes when you deploy your first project to production.

If you still want hands-on work: Langfuse offers a Quickstart Guide that lets you create a local trace in under 5 minutes.

Development leads to Evalite (6.1-6.4), then Improve prompt, Deploy, Langfuse (6.5), Monitoring — when score drops, the loop returns to Development

Exercise: Sketch the eval-driven development workflow for yourself:

  1. You have a chat title generator (from the previous challenges)
  2. How would you set up the Evalite-Langfuse workflow?
    • When does Evalite run? (local, CI/CD)
    • When does Langfuse run? (production)
    • What happens when a Langfuse alert fires?
  3. Consider: Which scores would you track in production?
    • Latency? Cost per request? Title quality?

Food for thought: Langfuse captures real user requests. Could you automatically feed that data back into the Evalite dataset to improve the test data?

Part of AI Learning — free courses from prompt to production. Jan on LinkedIn