Prompt Engineering
Context
Section titled “Context”Your AI feature delivers inconsistent results. Sometimes the summary is perfect, sometimes completely off. The engineering team proposes fine-tuning — three weeks of work, $15,000 budget. Your tech lead asks: “Have you optimized the prompt yet?”
Prompt engineering is the fastest, cheapest lever for improving AI output quality. For PMs, it’s more than a technical detail: the prompt is the product specification. It defines behavior, tone, format, and boundaries of your AI feature. If you understand the prompt, you understand the product.
Concept
Section titled “Concept”The Prompting Hierarchy
Section titled “The Prompting Hierarchy”Not every task needs the same technique. The skill is starting with the simplest method and escalating only when needed.
Zero-Shot: Describe only the task, no examples. Works for well-understood tasks like summarization or translation. Always start here — minimal tokens, fastest iteration.
Few-Shot: Provide 1-5 examples of desired input-output pairs. Research shows strong accuracy gains from 1-2 examples, with diminishing returns beyond 4-5. PM trap: adding 10+ examples “just to be safe” wastes tokens and can confuse the model.
Chain-of-Thought (CoT): Instruct the model to reason step by step. Delivers up to a 19-point improvement on complex reasoning tasks according to the original benchmark study (MMLU-Pro). But caution: for reasoning models (o-series, Claude Extended Thinking), explicit CoT is redundant — they already do it internally.
System Prompts: The invisible frame that defines role, constraints, and behavior. System prompts are typically cached (Anthropic charges 0.1x base rate for cached reads) — cost-efficient for repeated use.
Advanced Techniques
Section titled “Advanced Techniques”Structured Output: Force JSON, XML, or YAML as the response format. All major providers support this natively. Essential when AI output feeds into downstream systems (APIs, databases, UI rendering).
Self-Consistency: Run the same prompt multiple times, take the majority answer. Increases accuracy at 3-5x cost — only for high-stakes decisions.
Blended Prompting (current best practice): Combine few-shot + role instruction + format constraints + CoT in a single prompt. Most production prompts use this approach.
Prompt Security — What PMs Need to Know
Section titled “Prompt Security — What PMs Need to Know”Prompts are not protected commands — they are attackable. PMs need to understand the key risks:
- Prompt injection: User input overrides the system instruction (“Ignore all previous instructions and…”). Defense: input sanitization, clearly separated system/user prompts, output validation
- Jailbreaking: Creative circumvention of safety guardrails. No prompt alone protects against this — multiple defense layers needed (input filters, output filters, monitoring)
- Data exfiltration: The model leaks information from the system prompt or context not intended for the user
PM decision: What actions can your AI feature perform? The more autonomy it has (sending emails, modifying data), the more critical prompt security becomes. For high-stakes features, security testing (red teaming, see Chapter 5) belongs in the launch process.
What PMs Get Wrong
Section titled “What PMs Get Wrong”- “Longer prompts = better results.” False. Overly verbose prompts dilute the signal. Concise, specific instructions outperform lengthy ones.
- “Prompt engineering is an engineering task.” Partially false. The prompt defines product behavior — PMs should own prompt design (behavior, constraints, tone), engineers handle integration.
- “One perfect prompt works forever.” False. Model updates change behavior. Prompts need versioning and monitoring like any feature.
Framework
Section titled “Framework”The Complexity-Stakes Matrix:
| Simple task (classification, extraction) | Complex task (reasoning, generation) | |
|---|---|---|
| Low stakes | Zero-shot, temperature 0-0.2 | CoT, temperature 0.3-0.7 |
| High stakes | Few-shot + validation layer | CoT + self-consistency + human review |
Escalation path: Zero-shot, then few-shot, then CoT, then prompt chaining, then self-consistency. Stop at the first level that meets your quality requirements.
| Technique | Token overhead | Latency impact | When to use |
|---|---|---|---|
| Zero-shot | Minimal | Lowest | Always first |
| Few-shot (3 examples) | +200-500 tokens | Low | For specific zero-shot failures |
| Chain-of-thought | +100-2000 tokens output | Medium | Complex reasoning tasks |
| System prompt (cached) | First call: full cost; then: 0.1x | None after first | Always for product features |
| Self-consistency (5 runs) | 5x total cost | 5x latency | High-stakes decisions only |
Scenario
Section titled “Scenario”You’re a PM at a legal tech SaaS (B2B, 2,000 law firms). Your next feature: automated contract clause analysis. The first prototype uses zero-shot and correctly identifies risk levels for 60% of clauses.
The situation:
- Target accuracy: 90%+ (legal use requires high precision)
- Budget: $8,000 for the first iteration
- Volume: 25,000 clauses/month
- Time pressure: feature launch in 4 weeks
- Engineering team proposes fine-tuning (3 weeks, $15,000)
Options:
- Fine-tuning: 3 weeks development, $15,000, requires 500+ labeled examples
- Few-shot + CoT: 3-5 expert analysis examples as templates, force step-by-step reasoning. 2-3 days of work, under $500 in prompt costs
- Blended prompt: System prompt (role: senior lawyer) + 3 few-shot examples + CoT + structured output (JSON with risk level and reasoning). 1 week including testing
Decide
Section titled “Decide”How would you decide?
The best decision: Option 3 — Blended prompt.
Why:
- Follow the escalation path: Zero-shot delivers 60%. Before you start fine-tuning, you must exhaust prompt options. This isn’t optional — it’s best practice
- Cost risk: Fine-tuning for $15,000 on a feature that might reach 90%+ with better prompting is premature. Practical experience across many teams shows: fine-tuning a weaker model often loses to good prompting on a stronger model
- Structured output is critical: JSON with risk level + reasoning makes output downstream-ready (UI rendering, database) and enforces consistent formatting
- Timeline: 1 week instead of 3. If the blended prompt hits 85%, you can add more few-shot examples. If it hits 90%+, fine-tuning is unnecessary
- Expected impact: Practitioner experience (not single studies) shows that few-shot + CoT together can deliver 15-25 percentage point improvement over zero-shot — varies significantly by task
Common mistake: Jumping straight to fine-tuning without exhausting prompting. This costs weeks and thousands of dollars — and fine-tuned output is less flexible than a well-designed prompt.
Reflect
Section titled “Reflect”- The prompt is the product specification. If you don’t understand the prompt, you don’t understand what the AI feature does. PMs must own prompt design — not delegate it.
- Always start with zero-shot and escalate only for measurable failures. Each level costs more tokens and complexity.
- Blended prompting (few-shot + role + CoT + structured output) is the current production standard — not any single technique in isolation.
- Prompts need versioning and monitoring. A prompt that works on GPT-4 may fail on GPT-5.
Sources: DAIR.AI Prompt Engineering Guide, Lakera Prompt Engineering Guide (2026), IBM RAG vs Fine-Tuning vs Prompt Engineering, CodeSignal Prompt Engineering Best Practices (2025), K2View Prompt Engineering Techniques (2026)