Writing AI PRDs
Context
Section titled “Context”Your team wants to build an AI feature: automatic summaries for support tickets. The engineering lead asks for the PRD. You open your usual template — problem statement, user stories, acceptance criteria, launch timeline — and realize something is missing.
What does “acceptance criteria” mean for a feature that gives a different answer every time? When is a summary “good enough”? And who decides — a QA engineer with a checklist or an eval dataset with 500 examples?
Traditional PRDs define deterministic behavior: given input X, the system produces output Y. AI PRDs must define a quality range for probabilistic outputs: given input X, the system produces outputs that meet quality threshold Z at least P% of the time.
Concept
Section titled “Concept”What makes an AI PRD different
Section titled “What makes an AI PRD different”The biggest difference: an AI PRD contains an evaluation section that restructures the entire document.
| Element | Traditional PRD | AI PRD |
|---|---|---|
| Requirements | Exact behavior specification | Quality thresholds + eval criteria |
| Success criteria | Feature works yes/no | Accuracy, latency, cost targets per use case |
| Edge cases | Enumerate and handle each | Failure modes and graceful degradation |
| Testing | Pass/fail test cases | Eval datasets, benchmarks, human judgment |
| Acceptance | QA sign-off | Eval metrics above threshold + human review |
The 7 sections of an AI PRD
Section titled “The 7 sections of an AI PRD”1. Problem Statement & User Context — Same as a traditional PRD, but quantify the manual effort the AI will replace.
2. AI Approach & Rationale — Why AI is the right solution (vs. rules or traditional code). Which approach: LLM API, RAG, fine-tuning, agent workflow.
3. Evaluation Criteria — The most important new section. Golden dataset, metrics (accuracy, hallucination rate, latency, cost-per-query), minimum thresholds for launch.
4. Model & Infrastructure — Model selection rationale, expected volume, cost projection.
5. User Experience — How AI output is presented. Confidence indicators, fallback behavior, feedback mechanism (thumbs up/down, regenerate).
6. Risk & Mitigation — Failure modes, guardrails, bias considerations, privacy.
7. Success Metrics & Iteration Plan — Launch metrics, post-launch monitoring, improvement cadence.
Prompts as product specifications
Section titled “Prompts as product specifications”In LLM-based features, the system prompt is effectively the product specification. This is a paradigm shift:
- Prompt changes are product changes — they need review, testing, and versioning
- A/B testing prompts is equivalent to A/B testing features
- Prompt regression testing (running evals after prompt changes) replaces traditional regression testing
Google treats prompts as code artifacts: version control, mandatory review, eval suites on changes. Anthropic recommends clearly defining the assistant’s role, constraints, and output format in the system prompt.
Framework
Section titled “Framework”When to use which PRD format:
| Feature type | PRD format | Rationale |
|---|---|---|
| Deterministic output | Traditional PRD | No probabilistic behavior |
| AI with binary classification | Hybrid PRD (traditional + eval section) | Output is yes/no, but accuracy varies |
| Text/image generation | Full AI PRD | Probabilistic, complex outputs |
| Agent workflow | Full AI PRD + agent architecture | Multiple AI components, complex failure modes |
Golden rule: As soon as an AI component is involved, the PRD needs evaluation criteria — even if the rest stays traditional.
AI PRD Template
Section titled “AI PRD Template”The following template is a starting point — adapt it to your product and organization.
Scenario: Duolingo Max — Writing an AI PRD for Language Learning with GPT-4
Section titled “Scenario: Duolingo Max — Writing an AI PRD for Language Learning with GPT-4”Early 2023. OpenAI releases GPT-4. Duolingo — the world’s largest language learning app with 500+ million registered users — sees a once-in-a-generation opportunity: AI-powered conversations could solve the biggest unsolved problem in language learning. Users practice vocabulary and grammar, but almost nobody has real conversations. An LLM could change that.
The team moves fast. Within weeks, “Duolingo Max” takes shape — a new premium tier at $30/month featuring two GPT-4-powered capabilities.
The facts:
- Roleplay: Conversational practice with an AI partner in realistic scenarios (ordering coffee, asking for directions)
- Explain My Answer: AI explains why an answer was right or wrong — personalized, not pulled from a database
- Tech stack: GPT-4 API combined with Duolingo’s proprietary “Birdbrain” ML model that tracks each user’s language competence
- Launch: March 2023, initially for Spanish and French on iOS
- Result: Only ~5% of paying subscribers upgraded to Max
- Margin impact: ~120 basis points of margin loss from GPT-4 API costs
- Pivot: Duolingo Max was later rebranded to “Duolingo Pro” — AI features were folded into all paid tiers rather than kept as a premium add-on
The question: Imagine you’re writing the AI PRD for Duolingo Max before launch. What evaluation criteria would you define for “good conversation”? What cost ceiling would you set? And how do you measure whether Roleplay actually helps users learn a language?
Decide
Section titled “Decide”What happened with Duolingo Max — and what could the PRD have prevented?
What happened: Duolingo Max shipped fast, driven by competitive pressure and GPT-4’s availability. The features worked technically. But the unit economics didn’t: GPT-4 inference costs were high, adoption was low at ~5% of subscribers, and $30/month exceeded most users’ willingness to pay. Duolingo later had to overhaul its entire pricing approach, folding AI features into all paid tiers.
What an AI PRD should have flagged — applied to the template from this lesson:
3. Evaluation Criteria — the missing core question: How do you measure whether an AI conversation was “good”? Traditional NLP metrics (BLEU, ROUGE) measure text similarity — not whether a learner actually learned something. Duolingo needed proxy metrics:
- Engagement: How many Roleplay sessions per week? How long?
- Retention: Do Roleplay users come back more frequently?
- Learning progress: Do users improve faster on Birdbrain competence scores?
- User satisfaction: NPS or qualitative feedback after sessions
Without defining these criteria upfront, there was no baseline. The team couldn’t answer: “Is this feature worth the premium?”
4. Model & Infrastructure — the missing cost ceiling: GPT-4 was one of the most expensive LLMs in early 2023. An AI PRD should have included a clear cost projection: cost per Roleplay session, maximum share of subscription revenue, break-even at what adoption rate. The ~120 basis points of margin loss suggest this calculation was either not done or not taken seriously.
5. User Experience — the pricing disconnect: $30/month for two AI features — on top of the existing subscription. The PRD should have forced the question: what perceived value do these features deliver? Users pay for outcomes (learning a language), not for technology (GPT-4).
The core lesson: Duolingo had the technology and the users. What was missing was PRD discipline: clear eval criteria for learning quality, a cost ceiling for inference costs, and realistic pricing validation before launch.
Reflect
Section titled “Reflect”- Evaluation criteria are the hardest PM work on AI features: The central challenge with Duolingo Max wasn’t technical — GPT-4 could hold conversations just fine. The challenge was defining what a “good” conversation means in the context of language learning. Without that definition, every downstream decision lacked a foundation.
- Cost ceilings belong in the PRD, not in the retrospective: Duolingo’s ~120 basis points of margin loss wasn’t a surprising outcome — it was a predictable consequence of missing cost planning. An AI PRD with a clear cost projection would have forced earlier exploration of alternative models or pricing strategies.
- “Ship fast” is not a substitute for “define right”: The competitive pressure from GPT-4’s release was real. But speed without eval criteria produces a product you can’t measure, can’t improve, and can’t defend — as the later pivot from Max to Pro demonstrated.
- Without predefined quality thresholds, there is no baseline for improvement: An AI PRD is not a traditional PRD with “AI” in the title. The evaluation section changes the entire document — it forces you to define what “good enough” means before you build.
Sources: Duolingo Earnings Calls Q1-Q3 2023, Duolingo Engineering Blog (“How Duolingo Uses AI”), The Verge — “Duolingo Max Review” (2023), Stratechery — “Duolingo and the AI Opportunity” (2023), OpenAI GPT-4 Technical Report (2023)