Skip to content
EN DE

Metrics

Your data science team presents results: “The new model has an F1 score of 0.88, ROUGE-L is at 0.45, and AUC is 0.94.” The VP Product looks at you and asks: “Is that good?” Your answer determines whether the feature ships.

As a PM, you don’t need to understand every formula. But you need to know which metric matters for which product type, what the numbers mean for users and the business, and when a number looks deceptively good. Metric selection is a product decision — not a technical one.

Precision vs Recall Tradeoff

Most AI products include classification components — spam detection, intent routing, content moderation. Three metrics you must understand:

Precision: Of all items the model predicted as positive — how many were actually positive? High precision means few false alarms. Prioritize precision when false positives are costly (e.g., flagging legitimate transactions as fraud).

Recall: Of all actually positive items — how many did the model find? High recall means few missed cases. Prioritize recall when false negatives are dangerous (e.g., missing a cancer diagnosis).

F1 Score: Harmonic mean of precision and recall. Use F1 when neither false positives nor false negatives clearly dominate.

Typical production F1 targets:

ApplicationF1 target
Fraud detection0.80-0.85
Document classification0.75+
Content moderation0.85+
Medical diagnosis support0.90+

The accuracy trap: Accuracy is misleading for imbalanced datasets. A spam filter with 99% accuracy sounds great — until you realize 99% of emails are not spam. The model could just label everything “not spam” and hit 99%.

For products that generate text (summarization, translation, content creation):

MetricStrengthLimitation
BLEUTranslation quality with referenceSurface-level word overlap only
ROUGESummarization quality with referenceSurface-level word overlap only
BERTScoreSemantic similarity (catches paraphrases)Requires embedding model
LLM-as-JudgeOpen-ended quality, tone, helpfulnessCost, latency, judge bias

Current consensus: LLM-as-Judge has become the preferred metric for final quality validation in generative AI. BLEU and ROUGE remain useful for fast regression checks in CI/CD pipelines.

The most valuable evaluations are task-specific:

RAG-specific (RAGAS framework): Context Relevance (are retrieved documents relevant?), Faithfulness (is the answer grounded in sources or hallucinated?), Answer Relevance (does the answer address the question?).

Agent-specific: Task Completion Rate, Tool Call Accuracy, Step Efficiency, Recovery Rate.

Product-level (what stakeholders care about): User Satisfaction (CSAT, thumbs up/down), Task Completion Time (with AI vs. without), Adoption Rate, Escalation Rate, Cost per Successful Interaction.

Technical metricStakeholder translation
Precision = 0.92”Out of every 100 items the AI flags, 92 are correct”
Recall = 0.85”The AI catches 85 out of every 100 real cases”
F1 = 0.88”Balances finding things (85%) with being right when it does (92%)“
AUC = 0.94”The model correctly ranks a positive above a negative 94% of the time”

Metric selection by product type:

Product typePrimary metricsSecondary metrics
Content moderationPrecision, Recall (per category)Latency, false positive rate by content type
Search / retrievalNDCG (Normalized Discounted Cumulative Gain — ranking quality), MRR (Mean Reciprocal Rank — position of first relevant result), Context RelevanceRetrieval latency, zero-result rate
SummarizationLLM-as-Judge (faithfulness, coverage)User satisfaction, time saved
Chatbot / assistantTask completion rate, user satisfactionEscalation rate, response time
ClassificationF1, AUC, per-class precision/recallThreshold sensitivity analysis
Code generationFunctional correctness (tests pass)User acceptance rate

Rules for stakeholder communication:

  1. Always translate to business impact: “92% precision means 8 false alerts per 100 flags — roughly 2 hours of analyst time daily”
  2. Show tradeoffs, not single numbers: “We can increase catch rate from 85% to 95%, but false alerts will triple”
  3. Benchmark against the current process, not against perfection

You are a PM at an e-commerce company. Your AI feature classifies product reviews as genuine or fake. The data science team presents two model variants:

The situation:

  • 50,000 reviews/month, estimated 8% fake reviews
  • Currently manual review by 3 moderators (cost: $13,000/month)
  • Each undetected fake review costs an average of $50 (trust loss, returns)
  • Each falsely deleted genuine review costs an average of $16 (annoyed customer, support)

Model A: Precision 0.95, Recall 0.70 — few false alarms, but misses 30% of fakes Model B: Precision 0.78, Recall 0.92 — catches almost all fakes, but falsely deletes 22% of genuine reviews

How would you decide?

The best decision: Model A with human review for uncaught cases.

Why (the math):

  • Model B — cost of false positives: Out of 46,000 genuine reviews, 22% are falsely deleted = 10,120 reviews. At $16 per case = $161,920/month. Unacceptable.
  • Model A — cost of false negatives: Out of 4,000 fake reviews, 30% are missed = 1,200 reviews. At $50 per case = $60,000/month.
  • Model A + human review: Route the 1,200 missed fakes plus borderline cases to one moderator. Cost: roughly $4,500/month for a part-time moderator.
  • Total cost Model A + human review: well below the $13,000 fully manual process — and massively below Model B’s $161,920.

Common mistakes:

  • “Higher recall is always better” — not when false positives hit real users. The cost asymmetry decides.
  • “Optimize for one metric” — real products need balance. The PM defines the acceptable tradeoff.
  • “Accuracy is enough” — at 8% fake rate, a model labeling everything “genuine” would hit 92% accuracy.

Metric selection is a product decision, not a technical one. Because every metric encodes a tradeoff — and the PM must decide which tradeoff users can tolerate.

  • Precision vs. recall is not a technical detail — it is the question of whether false positives or false negatives are more harmful for your product.
  • Accuracy is misleading for imbalanced datasets. Use F1, precision, and recall — broken down by category.
  • Translate every metric into business impact. “F1 = 0.88” means nothing to stakeholders. “8 false alerts per 100 flags” does.

Sources: Google ML Crash Course — Classification Metrics (2024), Evidently AI — Classification Metrics Guide (2025), RAGAS Framework Documentation (2025), Galileo — Accuracy Metrics for ML Engineers (2025), Deepchecks — F1 Score, Accuracy, ROC-AUC (2025)

Part of AI Learning — free courses from prompt to production. Jan on LinkedIn