Writing AI PRDs

Context

Your team wants to build an AI feature: automatic summaries for support tickets. The engineering lead asks for the PRD. You open your usual template — problem statement, user stories, acceptance criteria, launch timeline — and realize something is missing.

What does “acceptance criteria” mean for a feature that gives a different answer every time? When is a summary “good enough”? And who decides — a QA engineer with a checklist or an eval dataset with 500 examples?

Traditional PRDs define deterministic behavior: given input X, the system produces output Y. AI PRDs must define a quality range for probabilistic outputs: given input X, the system produces outputs that meet quality threshold Z at least P% of the time.

Concept

What makes an AI PRD different

The biggest difference: an AI PRD contains an evaluation section that restructures the entire document.

Element	Traditional PRD	AI PRD
Requirements	Exact behavior specification	Quality thresholds + eval criteria
Success criteria	Feature works yes/no	Accuracy, latency, cost targets per use case
Edge cases	Enumerate and handle each	Failure modes and graceful degradation
Testing	Pass/fail test cases	Eval datasets, benchmarks, human judgment
Acceptance	QA sign-off	Eval metrics above threshold + human review

The 7 sections of an AI PRD

1. Problem Statement & User Context — Same as a traditional PRD, but quantify the manual effort the AI will replace.

2. AI Approach & Rationale — Why AI is the right solution (vs. rules or traditional code). Which approach: LLM API, RAG, fine-tuning, agent workflow.

3. Evaluation Criteria — The most important new section. Golden dataset, metrics (accuracy, hallucination rate, latency, cost-per-query), minimum thresholds for launch.

4. Model & Infrastructure — Model selection rationale, expected volume, cost projection.

5. User Experience — How AI output is presented. Confidence indicators, fallback behavior, feedback mechanism (thumbs up/down, regenerate).

6. Risk & Mitigation — Failure modes, guardrails, bias considerations, privacy.

7. Success Metrics & Iteration Plan — Launch metrics, post-launch monitoring, improvement cadence.

Prompts as product specifications

In LLM-based features, the system prompt is effectively the product specification. This is a paradigm shift:

Prompt changes are product changes — they need review, testing, and versioning
A/B testing prompts is equivalent to A/B testing features
Prompt regression testing (running evals after prompt changes) replaces traditional regression testing

Google treats prompts as code artifacts: version control, mandatory review, eval suites on changes. Anthropic recommends clearly defining the assistant’s role, constraints, and output format in the system prompt.

Framework

When to use which PRD format:

Feature type	PRD format	Rationale
Deterministic output	Traditional PRD	No probabilistic behavior
AI with binary classification	Hybrid PRD (traditional + eval section)	Output is yes/no, but accuracy varies
Text/image generation	Full AI PRD	Probabilistic, complex outputs
Agent workflow	Full AI PRD + agent architecture	Multiple AI components, complex failure modes

Golden rule: As soon as an AI component is involved, the PRD needs evaluation criteria — even if the rest stays traditional.

AI PRD Template

The following template is a starting point — adapt it to your product and organization.

1. Problem Statement & User Context

What problem are we solving? For whom?
Manual effort today: ___ hours/month
Expected reduction: ___%

2. AI Approach & Rationale

Chosen approach: [ ] LLM API [ ] RAG [ ] Fine-tuning [ ] Agent Workflow
Why AI instead of rules/traditional code: ___
Alternative approaches evaluated: ___

3. Evaluation Criteria

Golden dataset: ___ examples, source: ___
Primary metric: ___ (e.g., accuracy, ROUGE, human rating)
Minimum threshold for launch: ___
Secondary metrics: Hallucination rate < ___%, Latency < ms, Cost < $/query

4. Model & Infrastructure

Model: ___ | Rationale: ___
Expected volume: ___ requests/month
Cost projection: $___/month at expected volume

5. User Experience

Output presentation: ___
Confidence indicator: [ ] Yes [ ] No
Fallback on low confidence: ___
Feedback mechanism: [ ] Thumbs [ ] Regenerate [ ] Edit [ ] Other

6. Risk & Mitigation

Top 3 failure modes: ___
Guardrails: ___
Bias considerations: ___
Privacy implications: ___

7. Success Metrics & Iteration Plan

Launch metric: ___
Post-launch monitoring: ___ (cadence: daily/weekly)
Planned improvement cadence: ___

Scenario: Duolingo Max — Writing an AI PRD for Language Learning with GPT-4

Early 2023. OpenAI releases GPT-4. Duolingo — the world’s largest language learning app with 500+ million registered users — sees a once-in-a-generation opportunity: AI-powered conversations could solve the biggest unsolved problem in language learning. Users practice vocabulary and grammar, but almost nobody has real conversations. An LLM could change that.

The team moves fast. Within weeks, “Duolingo Max” takes shape — a new premium tier at $30/month featuring two GPT-4-powered capabilities.

The facts:

Roleplay: Conversational practice with an AI partner in realistic scenarios (ordering coffee, asking for directions)
Explain My Answer: AI explains why an answer was right or wrong — personalized, not pulled from a database
Tech stack: GPT-4 API combined with Duolingo’s proprietary “Birdbrain” ML model that tracks each user’s language competence
Launch: March 2023, initially for Spanish and French on iOS
Result: Only ~5% of paying subscribers upgraded to Max
Margin impact: ~120 basis points of margin loss from GPT-4 API costs
Pivot: Duolingo Max was later rebranded to “Duolingo Pro” — AI features were folded into all paid tiers rather than kept as a premium add-on

The question: Imagine you’re writing the AI PRD for Duolingo Max before launch. What evaluation criteria would you define for “good conversation”? What cost ceiling would you set? And how do you measure whether Roleplay actually helps users learn a language?

Decide

What happened with Duolingo Max — and what could the PRD have prevented?

What happened: Duolingo Max shipped fast, driven by competitive pressure and GPT-4’s availability. The features worked technically. But the unit economics didn’t: GPT-4 inference costs were high, adoption was low at ~5% of subscribers, and $30/month exceeded most users’ willingness to pay. Duolingo later had to overhaul its entire pricing approach, folding AI features into all paid tiers.

What an AI PRD should have flagged — applied to the template from this lesson:

3. Evaluation Criteria — the missing core question: How do you measure whether an AI conversation was “good”? Traditional NLP metrics (BLEU, ROUGE) measure text similarity — not whether a learner actually learned something. Duolingo needed proxy metrics:

Engagement: How many Roleplay sessions per week? How long?
Retention: Do Roleplay users come back more frequently?
Learning progress: Do users improve faster on Birdbrain competence scores?
User satisfaction: NPS or qualitative feedback after sessions

Without defining these criteria upfront, there was no baseline. The team couldn’t answer: “Is this feature worth the premium?”

4. Model & Infrastructure — the missing cost ceiling: GPT-4 was one of the most expensive LLMs in early 2023. An AI PRD should have included a clear cost projection: cost per Roleplay session, maximum share of subscription revenue, break-even at what adoption rate. The ~120 basis points of margin loss suggest this calculation was either not done or not taken seriously.

5. User Experience — the pricing disconnect: $30/month for two AI features — on top of the existing subscription. The PRD should have forced the question: what perceived value do these features deliver? Users pay for outcomes (learning a language), not for technology (GPT-4).

The core lesson: Duolingo had the technology and the users. What was missing was PRD discipline: clear eval criteria for learning quality, a cost ceiling for inference costs, and realistic pricing validation before launch.

Reflect

Evaluation criteria are the hardest PM work on AI features: The central challenge with Duolingo Max wasn’t technical — GPT-4 could hold conversations just fine. The challenge was defining what a “good” conversation means in the context of language learning. Without that definition, every downstream decision lacked a foundation.
Cost ceilings belong in the PRD, not in the retrospective: Duolingo’s ~120 basis points of margin loss wasn’t a surprising outcome — it was a predictable consequence of missing cost planning. An AI PRD with a clear cost projection would have forced earlier exploration of alternative models or pricing strategies.
“Ship fast” is not a substitute for “define right”: The competitive pressure from GPT-4’s release was real. But speed without eval criteria produces a product you can’t measure, can’t improve, and can’t defend — as the later pivot from Max to Pro demonstrated.
Without predefined quality thresholds, there is no baseline for improvement: An AI PRD is not a traditional PRD with “AI” in the title. The evaluation section changes the entire document — it forces you to define what “good enough” means before you build.

Sources: Duolingo Earnings Calls Q1-Q3 2023, Duolingo Engineering Blog (“How Duolingo Uses AI”), The Verge — “Duolingo Max Review” (2023), Stratechery — “Duolingo and the AI Opportunity” (2023), OpenAI GPT-4 Technical Report (2023)