Skip to content
EN DE

Model Selection

Your head of engineering created a spreadsheet with 15 LLMs, sorted by MMLU score. “We’ll use the best model” is their recommendation. The problem: the #1 model costs 10x as much as #5 — and for your use case (ticket classification), the quality difference is marginal. Your CFO will see the API bill in month two and ask why you’re using the most expensive model for a task that a cheaper one handles just as well.

Model selection is not a one-time technical decision. It’s a product decision that balances quality, cost, latency, and compliance requirements — and it may change quarterly because the model landscape moves so fast.

As of: March 2026. Model pricing changes quarterly. The principles (multi-model routing, your own evals, design for switching) remain stable — specific prices don’t.

The market has consolidated around a few major providers:

ProviderFrontier modelStrengthInput $/1M tokensOutput $/1M tokens
OpenAIGPT-5.2 / GPT-5.4General-purpose, strong tooling ecosystem$1.75-$3.00$7.00-$14.00
AnthropicClaude Opus 4.6Writing, safety, agentic coding, 200K context$5.00$25.00
AnthropicClaude Sonnet 4.6Best performance-to-cost ratio$3.00$15.00
GoogleGemini 3 ProMultimodal, 1M context, caching$1.25$10.00
GoogleGemini 2.5 FlashSpeed-optimized, cost-efficient$0.15$0.60
MetaLlama 4 Scout (open)Open source, 10M context, fast$0.11$0.34
DeepSeekDeepSeek-V3 / R1Aggressive price-performance$0.07-$0.55$0.28-$2.19

Reasoning models (separate tier): OpenAI o3/o4-mini, Claude Extended Thinking, Gemini Thinking Mode. Deep reasoning, higher cost, higher latency.

What PMs need to know:

  • MMLU/MMLU-Pro: General knowledge across 57+ domains. The most widely cited benchmark, but increasingly saturated at the top
  • SWE-bench: Real-world software engineering tasks (fixing GitHub issues). More meaningful than HumanEval for production coding assessment
  • ARC-AGI: Abstract reasoning. Tests pattern recognition that humans find easy but LLMs find hard
  • LMArena (formerly LMSYS Chatbot Arena): Community-based Elo ranking through blind user votes. Closer to real user experience than automated benchmarks — but susceptible to style-over-substance bias (longer, prettier answers get preferred)

PM caveat: Benchmarks measure model capabilities under controlled conditions. They do not measure performance on YOUR specific task. A model ranked #1 on MMLU may underperform on your customer support use case. Always evaluate on your own data.

The emerging production pattern: use different models for different tasks rather than one model for everything.

Tiered routing architecture:

  • Fast/cheap tier (Gemini Flash, GPT-4o-mini, Llama): Simple classification, extraction, formatting — 70-80% of requests
  • Strong tier (Claude Sonnet, GPT-4o, Gemini Pro): Complex reasoning, generation, analysis — 15-25% of requests
  • Reasoning tier (o3, Claude Extended Thinking): Multi-step reasoning, research — 1-5% of requests

Cost impact: Multi-model routing can reduce costs 5-10x compared to sending everything to the frontier model.

Practical Selection Criteria Beyond Benchmarks

Section titled “Practical Selection Criteria Beyond Benchmarks”
CriterionWhy it mattersHow to evaluate
Task-specific qualityBenchmarks don’t predict your use caseRun 50-100 representative queries, blind-evaluate outputs
Latency (TTFT + TPS)User experience for real-time featuresMeasure time-to-first-token and tokens-per-second
Cost at YOUR volume10x price difference between modelsCalculate monthly cost at projected volume
Data privacy/complianceRegulatory requirementsReview data processing terms; consider self-hosted open models
Ecosystem/toolingDeveloper productivityFunction calling, JSON mode, streaming, SDK quality

Model selection in 3 steps:

Step 1 — Define requirements:

DimensionQuestions
QualityWhat’s “good enough” for this feature?
LatencyReal-time (under 2s)? Background processing OK?
CostMonthly budget at projected volume?
ComplianceData residency? Privacy? Industry regulation?

Step 2 — Evaluate candidates:

  • Run 50-100 representative queries on 3-4 candidate models
  • Blind-evaluate (remove model names, rate quality 1-5)
  • Measure latency and calculate cost at projected volume

Step 3 — Design for switching:

  • Use model-agnostic abstractions (LiteLLM, OpenRouter, or provider SDKs with adapters)
  • Version prompts per model (different models respond differently to the same prompt)
  • Monitor quality metrics continuously

You’re a PM at an e-commerce SaaS (B2B, 1,200 shops). Your next feature: AI-generated product descriptions. The feature should offer 3 quality tiers: Basic (bullet points), Standard (SEO-optimized), and Premium (storytelling + SEO).

The situation:

  • Volume: 400,000 product descriptions/month
  • Basic: 70% of volume (simple products)
  • Standard: 25% (main catalog)
  • Premium: 5% (flagship products)
  • Budget: $8,000/month for AI costs
  • Requirement: latency under 5 seconds for all tiers
  • Languages: English, German, French

Options:

  1. Single model: Claude Sonnet 4.6 for all tiers. Estimated cost: $22,000/month
  2. Two tiers: Gemini 2.5 Flash for Basic + Standard, Claude Sonnet 4.6 for Premium. Estimated cost: $5,800/month
  3. Three tiers: Llama 4 Scout (via API) for Basic, Gemini 2.5 Flash for Standard, Claude Opus 4.6 for Premium. Estimated cost: $4,200/month
How would you decide?

The best decision: Option 2 — Two tiers.

Why:

  • Option 1 is 2.75x over budget: $22,000 vs. $8,000 budget. Using Claude Sonnet for bullet points is like parking a Ferrari in the garage
  • Option 2 hits the sweet spot: $5,800 is within budget. Gemini 2.5 Flash is sufficient for simple generation and massively cheaper at 70-95% of volume. Claude Sonnet delivers premium quality only where it counts (5% of volume)
  • Option 3 saves more, but: Llama 4 Scout requires either self-hosting (infrastructure overhead) or a third-party API provider. The added complexity of a third model saves only $1,600/month — that doesn’t justify the engineering overhead for routing, testing, and monitoring
  • Blind evaluation is critical: Before launch, generate 100 product descriptions per model, have them blind-evaluated. If Flash scores 4/5 for standard products, the two-tier architecture is validated
  • Design for switching: Build with LiteLLM or similar abstraction. If Gemini Flash gets 2x better in 6 months (or a new model appears), model switching should be a config change, not a rewrite

Common mistake: Choosing the most expensive model because it leads the leaderboard. The quality difference between tier-1 and tier-2 models is marginal for many tasks, but the cost difference is dramatic.

  • Benchmarks are orientation, not a decision basis. Always evaluate on your own data — 50-100 representative queries, blind-evaluated. The leaderboard winner isn’t automatically the best choice for your task.
  • Multi-model routing is the production standard. 70-80% of requests go to the cheap model, only complex tasks to the expensive one. 5-10x cost savings are realistic.
  • Design for model switching. The model landscape changes quarterly. Lock-in to one model costs you later.
  • Data privacy and compliance can constrain model choice. Self-hosted open models (Llama, Mistral) are the answer when no data may leave the organization.

Sources: Artificial Analysis LLM Leaderboard, Klu 2026 LLM Leaderboard, DEV Community “Choosing an LLM in 2026”, Claude5.ai LLM API Pricing (2026), Shakudo Top 9 LLMs (March 2026)

Part of AI Learning — free courses from prompt to production. Jan on LinkedIn