Skip to content
EN DE

Cost/Quality Tradeoffs

Your AI feature has been live for two months. Usage is growing — but so is the API bill. Last month: $14,000. Forecast for next month: $21,000. Your CFO asks: “Every other software feature costs us practically nothing per user interaction. Why is AI so expensive?”

The answer: AI features have marginal cost per use. Every API call costs money. This is fundamentally different from traditional SaaS, where marginal cost per user interaction is near zero. This “AI tax” changes unit economics, pricing, and margin calculations — and PMs need to understand it from day one.

The fundamental cost driver: Everything in LLMs is measured in tokens. Pricing, budgeting, and optimization all revolve around token consumption.

Key pricing dynamics (2026):

  • Output tokens cost 2-5x more than input tokens across all major providers (generation requires more compute than input processing)
  • Cached input tokens cost 0.1x the base rate (Anthropic) or are free (some providers)
  • Reasoning tokens (internal chain-of-thought in o-series models) are billed as output tokens but invisible to the user — a hidden cost multiplier

LLMflation — the price trend: Per a16z research (a16z “LLMflation” analysis, 2024 — based on token prices from major providers 2023-2024. The trend continues but varies by provider and model class.), LLM inference costs have declined approximately 10x annually:

  • GPT-4-equivalent performance: $20/1M tokens (late 2022) to $0.40/1M tokens (2025)
  • PM implication: features that are uneconomical today may be viable in 6-12 months

Example 1: AI customer support bot

  • 100,000 conversations/month, averaging 2,000 input tokens + 500 output tokens
  • With Claude Sonnet 4.6: input $600 + output $750 = $1,350/month
  • With Gemini 2.5 Flash: input $30 + output $30 = $60/month
  • 22.5x cost factor for the premium model

Example 2: AI-powered search (RAG)

  • 500,000 queries/month, each with 500-token query + 2,000-token context + 300-token response
  • With Gemini 2.5 Flash: input $187 + output $90 + embedding $5 + vector DB $200 = ~$483/month

1. Model routing (biggest lever): Route requests to the cheapest model capable of handling them. 70-80% to the fast/cheap tier, the rest to frontier. Savings: 5-10x.

2. Prompt caching: Cache static prompt portions (system prompts, few-shot examples). Anthropic: cached reads cost 0.1x. Up to 73% cost reduction for repetitive workloads (Redis LangCache benchmark).

3. Output length control: Set max_tokens to the minimum needed. Use structured output (JSON) instead of verbose prose. Output tokens cost 2-5x more than input — every unnecessary output token is expensive.

4. Batching: Group multiple requests into batch API calls (OpenAI and Anthropic). Typically 50% cost reduction for non-real-time workloads. Trade-off: higher latency.

5. Token reduction: Compress prompts, summarize conversation history instead of sending full transcripts, use embeddings for retrieval instead of stuffing everything into context.

6. Self-hosting open models: Break-even vs. API typically at 40+ GPU-hours/week sustained usage. Midjourney case study: migrated from NVIDIA A100/H100 to TPU v6e, reducing monthly inference from $2.1M to under $700K. Only viable at significant scale with dedicated MLOps capability.

Line itemCalculationExample
Revenue per user/monthSubscription or usage fee$20/user/month
AI cost per user/month(Avg queries x tokens per query x price per token)$0.50-$5.00/user/month
AI cost as % of revenueAI cost / Revenue2.5-25%

Healthy benchmarks: AI inference cost should be less than 10% of the feature’s revenue contribution. Above 20%: optimize (routing, caching) or adjust pricing. For freemium products: free-tier AI costs must be covered by paid conversion.

Quality levelTypical approachUse case
”Good enough” (80%)Small model, zero-shotAutocomplete, classification, simple extraction
”High quality” (90%)Mid-tier model, few-shot + RAGCustomer support, document analysis
”Near-perfect” (95%+)Frontier model, CoT + RAG + human reviewMedical, legal, financial — high stakes

The diminishing returns curve: Going from 80% to 90% quality costs roughly 3x. From 90% to 95% costs roughly 10x. From 95% to 99% costs roughly 50x. PMs must define “good enough” before engineering starts optimizing.

Cost optimization — in this order (highest ROI first):

PriorityLeverExpected savingsEffort
1Model routing5-10xMedium (routing logic + testing)
2Prompt caching50-90% on cached portionsLow (configuration)
3Output length control20-50%Low (max_tokens + structured output)
4Batching50% for non-real-timeLow (API switch)
5Prompt compression10-30%Low (prompt optimization)
6Self-hostingVariable, only at scaleHigh (infrastructure + MLOps)

AI feature P&L check:

  • AI cost below 10% of feature revenue: Healthy
  • AI cost 10-20%: Needs optimization, still viable
  • AI cost above 20%: Optimize immediately or adjust pricing

You’re a PM at a content marketing SaaS (B2B, 3,000 customers). Your AI feature: automatic blog post generation. Currently you use Claude Sonnet 4.6 for all requests.

The situation:

  • 60,000 blog posts/month generated
  • Average 1,500 input tokens (briefing) + 3,000 output tokens (post)
  • Current monthly cost: input (90M tokens x $3/1M) = $270 + output (180M tokens x $15/1M) = $2,700 = $2,970/month
  • Subscription price: $49/user/month, average 20 posts per user
  • 45% of generated posts are “Quick Drafts” (bullet-point summaries, 200 words)
  • 40% are “Standard Posts” (800 words, SEO-optimized)
  • 15% are “Deep Dives” (2,000+ words, research-intensive)

Options:

  1. Keep the status quo: $2,970/month, same model for everything
  2. Model routing: Quick Drafts on Gemini 2.5 Flash, Standard on Claude Sonnet, Deep Dives on Claude Sonnet with Extended Thinking
  3. Model routing + caching: Like Option 2, plus prompt caching for system prompts and recurring briefing templates
How would you decide?

The best decision: Option 3 — Model routing + caching.

Why:

  • Quick Drafts on Flash (45% of volume): 27,000 posts x (1,500 x $0.15/1M + 800 x $0.60/1M) = ~$19/month. Vs. currently ~$1,335 for the same share on Sonnet. Quality for bullet points is sufficient on Flash
  • Standard on Sonnet (40%): 24,000 posts stay on Sonnet = ~$1,188/month. Sonnet quality is justified here
  • Deep Dives with Extended Thinking (15%): 9,000 posts x higher cost = ~$670/month. Better quality for premium content
  • Total cost Option 2: ~$1,877/month — 37% savings
  • Prompt caching on top: System prompts (SEO rules, style guide, brand voice) are sent with every request. Caching reduces these costs by 90%. With 60,000 requests with 800-token system prompts, that saves another ~$200/month
  • Total cost Option 3: ~$1,650/month — 44% savings vs. status quo
  • Unit economics check: $1,650 / 3,000 customers = $0.55/customer/month. At $49 subscription = 1.1% of revenue. Healthy

Common mistake: Waiting for LLMflation without actively optimizing. Costs do decline ~10x annually, but usage volume typically grows faster. Without active optimization, costs rise despite falling prices.

  • AI features have marginal cost per use — that’s what fundamentally differentiates them from traditional software. This “AI tax” must be factored into unit economics from day 1, not after the bill arrives.
  • Model routing is the single biggest lever. 70-80% of requests don’t need a frontier model. 5-10x savings are realistic without users noticing any quality drop.
  • Define “good enough” before you optimize. Going from 80% to 90% quality costs 3x, from 90% to 95% costs 10x. The PM’s job is to define the threshold — not to demand maximum quality.
  • LLMflation (10x annual price decline) is real, but it’s no reason to skip optimization. Usage typically grows faster than prices fall.

Sources: a16z LLMflation — LLM Inference Cost Is Going Down Fast, Introl Cost Per Token Analysis, Introl Inference Unit Economics, Redis LLM Token Optimization (2026), Silicon Data LLM Cost Per Token Guide (2026)

Part of AI Learning — free courses from prompt to production. Jan on LinkedIn