Metrics
Context
Section titled “Context”Your data science team presents results: “The new model has an F1 score of 0.88, ROUGE-L is at 0.45, and AUC is 0.94.” The VP Product looks at you and asks: “Is that good?” Your answer determines whether the feature ships.
As a PM, you don’t need to understand every formula. But you need to know which metric matters for which product type, what the numbers mean for users and the business, and when a number looks deceptively good. Metric selection is a product decision — not a technical one.
Concept
Section titled “Concept”Classification Metrics
Section titled “Classification Metrics”Most AI products include classification components — spam detection, intent routing, content moderation. Three metrics you must understand:
Precision: Of all items the model predicted as positive — how many were actually positive? High precision means few false alarms. Prioritize precision when false positives are costly (e.g., flagging legitimate transactions as fraud).
Recall: Of all actually positive items — how many did the model find? High recall means few missed cases. Prioritize recall when false negatives are dangerous (e.g., missing a cancer diagnosis).
F1 Score: Harmonic mean of precision and recall. Use F1 when neither false positives nor false negatives clearly dominate.
Typical production F1 targets:
| Application | F1 target |
|---|---|
| Fraud detection | 0.80-0.85 |
| Document classification | 0.75+ |
| Content moderation | 0.85+ |
| Medical diagnosis support | 0.90+ |
The accuracy trap: Accuracy is misleading for imbalanced datasets. A spam filter with 99% accuracy sounds great — until you realize 99% of emails are not spam. The model could just label everything “not spam” and hit 99%.
Generation Metrics
Section titled “Generation Metrics”For products that generate text (summarization, translation, content creation):
| Metric | Strength | Limitation |
|---|---|---|
| BLEU | Translation quality with reference | Surface-level word overlap only |
| ROUGE | Summarization quality with reference | Surface-level word overlap only |
| BERTScore | Semantic similarity (catches paraphrases) | Requires embedding model |
| LLM-as-Judge | Open-ended quality, tone, helpfulness | Cost, latency, judge bias |
Current consensus: LLM-as-Judge has become the preferred metric for final quality validation in generative AI. BLEU and ROUGE remain useful for fast regression checks in CI/CD pipelines.
Task-Specific Metrics
Section titled “Task-Specific Metrics”The most valuable evaluations are task-specific:
RAG-specific (RAGAS framework): Context Relevance (are retrieved documents relevant?), Faithfulness (is the answer grounded in sources or hallucinated?), Answer Relevance (does the answer address the question?).
Agent-specific: Task Completion Rate, Tool Call Accuracy, Step Efficiency, Recovery Rate.
Product-level (what stakeholders care about): User Satisfaction (CSAT, thumbs up/down), Task Completion Time (with AI vs. without), Adoption Rate, Escalation Rate, Cost per Successful Interaction.
Translating Metrics for Stakeholders
Section titled “Translating Metrics for Stakeholders”| Technical metric | Stakeholder translation |
|---|---|
| Precision = 0.92 | ”Out of every 100 items the AI flags, 92 are correct” |
| Recall = 0.85 | ”The AI catches 85 out of every 100 real cases” |
| F1 = 0.88 | ”Balances finding things (85%) with being right when it does (92%)“ |
| AUC = 0.94 | ”The model correctly ranks a positive above a negative 94% of the time” |
Framework
Section titled “Framework”Metric selection by product type:
| Product type | Primary metrics | Secondary metrics |
|---|---|---|
| Content moderation | Precision, Recall (per category) | Latency, false positive rate by content type |
| Search / retrieval | NDCG (Normalized Discounted Cumulative Gain — ranking quality), MRR (Mean Reciprocal Rank — position of first relevant result), Context Relevance | Retrieval latency, zero-result rate |
| Summarization | LLM-as-Judge (faithfulness, coverage) | User satisfaction, time saved |
| Chatbot / assistant | Task completion rate, user satisfaction | Escalation rate, response time |
| Classification | F1, AUC, per-class precision/recall | Threshold sensitivity analysis |
| Code generation | Functional correctness (tests pass) | User acceptance rate |
Rules for stakeholder communication:
- Always translate to business impact: “92% precision means 8 false alerts per 100 flags — roughly 2 hours of analyst time daily”
- Show tradeoffs, not single numbers: “We can increase catch rate from 85% to 95%, but false alerts will triple”
- Benchmark against the current process, not against perfection
Scenario
Section titled “Scenario”You are a PM at an e-commerce company. Your AI feature classifies product reviews as genuine or fake. The data science team presents two model variants:
The situation:
- 50,000 reviews/month, estimated 8% fake reviews
- Currently manual review by 3 moderators (cost: $13,000/month)
- Each undetected fake review costs an average of $50 (trust loss, returns)
- Each falsely deleted genuine review costs an average of $16 (annoyed customer, support)
Model A: Precision 0.95, Recall 0.70 — few false alarms, but misses 30% of fakes Model B: Precision 0.78, Recall 0.92 — catches almost all fakes, but falsely deletes 22% of genuine reviews
Decide
Section titled “Decide”How would you decide?
The best decision: Model A with human review for uncaught cases.
Why (the math):
- Model B — cost of false positives: Out of 46,000 genuine reviews, 22% are falsely deleted = 10,120 reviews. At $16 per case = $161,920/month. Unacceptable.
- Model A — cost of false negatives: Out of 4,000 fake reviews, 30% are missed = 1,200 reviews. At $50 per case = $60,000/month.
- Model A + human review: Route the 1,200 missed fakes plus borderline cases to one moderator. Cost: roughly $4,500/month for a part-time moderator.
- Total cost Model A + human review: well below the $13,000 fully manual process — and massively below Model B’s $161,920.
Common mistakes:
- “Higher recall is always better” — not when false positives hit real users. The cost asymmetry decides.
- “Optimize for one metric” — real products need balance. The PM defines the acceptable tradeoff.
- “Accuracy is enough” — at 8% fake rate, a model labeling everything “genuine” would hit 92% accuracy.
Reflect
Section titled “Reflect”Metric selection is a product decision, not a technical one. Because every metric encodes a tradeoff — and the PM must decide which tradeoff users can tolerate.
- Precision vs. recall is not a technical detail — it is the question of whether false positives or false negatives are more harmful for your product.
- Accuracy is misleading for imbalanced datasets. Use F1, precision, and recall — broken down by category.
- Translate every metric into business impact. “F1 = 0.88” means nothing to stakeholders. “8 false alerts per 100 flags” does.
Sources: Google ML Crash Course — Classification Metrics (2024), Evidently AI — Classification Metrics Guide (2025), RAGAS Framework Documentation (2025), Galileo — Accuracy Metrics for ML Engineers (2025), Deepchecks — F1 Score, Accuracy, ROC-AUC (2025)