Skip to content
EN DE

Fine-Tuning

Your customer success team is complaining: the AI assistant sounds “too generic.” It doesn’t use your industry’s terminology, responds too formally, and formats answers differently than your style guide dictates. Your CTO proposes fine-tuning. Your VP of Engineering asks: “How much does it cost, and when will we see results?”

Fine-tuning is the third lever in the AI optimization hierarchy — after prompt engineering and RAG. It changes the model’s weights, meaning its internal behavior. That’s powerful, but expensive and inflexible. PMs need to understand when fine-tuning is justified and when it’s an expensive shortcut for better prompting.

Fine-tuning takes a pre-trained LLM and trains it further on a domain-specific dataset. It changes the weights — the internal behavior — rather than just providing context at runtime.

What fine-tuning is good for:

  • Consistently changing tone, style, or format
  • Embedding domain-specific terminology and reasoning patterns
  • Reducing prompt length (baked-in behavior doesn’t need prompt instructions)
  • Improving performance on narrow, well-defined tasks

What fine-tuning is NOT good for:

  • Adding new factual knowledge (use RAG — fine-tuned knowledge goes stale)
  • One-off customization (use prompting)
  • Tasks where requirements change frequently (retraining is expensive)

LoRA (Low-Rank Adaptation): The dominant method since 2025. Freezes original weights and trains small adapter matrices. Achieves 95% of full fine-tuning performance at 10% of the cost. Adapters are small (10-100 MB) and swappable — multiple specializations from one base model. Cost: $500-$5,000 per run.

QLoRA: Combines LoRA with quantization (16-bit to 4-bit). Enables fine-tuning a 7B model on a consumer GPU (RTX 4090, ~$1,500). Slight quality trade-off vs. standard LoRA. Ideal for experimentation and proof of concept.

Full Fine-Tuning: Updates all parameters. Highest quality ceiling but prohibitively expensive: $10,000-$30,000+ per run for 7B+ models. Only when budget is unlimited.

Managed Services (2026): OpenAI offers fine-tuning API for GPT-4o and GPT-4o-mini. Anthropic for Claude (enterprise tier). Google Vertex AI for Gemini. Upload JSONL, pay per training token — no infrastructure management needed.

Model sizeMinimum examplesRecommendedQuality bar
7B (Mistral, Llama)100-5001,000-5,000Consistent format, correct labels
13B-70B500-1,0005,000-10,000Domain expert validated
Managed API (GPT-4o)10 (minimum)50-100High-quality input-output pairs

The golden rule: 500 expert-curated examples outperform 50,000 noisy ones — especially in specialized domains. For more general tasks, more data can still help. The most common failure mode in fine-tuning isn’t insufficient data — it’s bad data.

The Decision Hierarchy: Prompt, Then RAG, Then Fine-Tune

Section titled “The Decision Hierarchy: Prompt, Then RAG, Then Fine-Tune”

The IBM framework (widely adopted):

  1. Prompt engineering (hours, $0-$100): If the model produces acceptable results with the right instructions — stop here
  2. RAG ($70-$1,000/month ongoing): If the model needs access to current/proprietary data
  3. Fine-tuning ($5,000-$50,000+ upfront, ongoing maintenance): Only when behavioral change is needed AND prompt engineering can’t achieve it

Fine-tuning decision matrix — answer these questions in order:

QuestionYesNo
Can prompt engineering solve this?Stop. No fine-tuning neededContinue
Does the model need current/proprietary data?Add RAG, then reassessContinue
Do you need consistent behavioral change?Continue evaluatingNo fine-tuning needed
Do you have 500+ high-quality labeled examples?Continue evaluatingInvest in data first
Will you process 50,000+ queries/month?Continue evaluatingROI unlikely
Can you maintain the fine-tuned model over time?Start fine-tuningBudget for retraining

Cost and timeline (2026):

MethodCompute costEngineering costTotal timeline
LoRA (7B)$500-$3,000$4,000-$12,000 (data + eval)2-4 weeks
LoRA (13B)$2,000-$5,000$4,000-$12,0003-6 weeks
Full fine-tuning (7B)$10,000-$30,000$8,000-$20,0004-8 weeks
Managed API (OpenAI)$0.80-$3.00/1M training tokensMinimal1-3 days

ROI benchmark: Fine-tuning typically pays back in 4-8 months for companies processing 50,000+ queries/month. Below that volume, prompt engineering is usually more cost-effective.

You’re a PM at a fintech startup (B2C, 150,000 MAU). Your AI feature: automatic bank transaction categorization. The current system uses GPT-4o-mini with few-shot prompting and achieves 82% accuracy.

The situation:

  • Target accuracy: 92%+ (users complain about wrong categories)
  • Volume: 3 million transactions/month
  • Current costs: $4,200/month (GPT-4o-mini API)
  • Your data team has collected 12,000 manually categorized transactions
  • Competition: two rivals recently launched “AI categorization”

Options:

  1. Better prompts: Blended prompt with more context (transaction history, merchant database). Estimated improvement: 82% to 87%. Effort: 1 week
  2. Fine-tuning (LoRA): Fine-tune GPT-4o-mini on 12,000 examples. Estimated improvement: 92%+. Effort: 3 weeks, $6,000
  3. Model switch: Switch to Claude Sonnet 4.6 with optimized prompt. Estimated improvement: 88%. Cost: $18,000/month (4x higher)
How would you decide?

The best decision: Option 1 first, then Option 2.

Why:

  • Follow the hierarchy: Prompt optimization first (1 week). If 87% isn’t enough, you have the baseline for fine-tuning
  • Fine-tuning is justified here: 3 million transactions/month far exceeds the ROI threshold (50,000+). 12,000 labeled examples are sufficient. The task (categorization) is narrowly defined — exactly what fine-tuning is designed for
  • LoRA on GPT-4o-mini is cost-optimal: Fine-tuned models need shorter prompts (behavior is baked in rather than prompted), which can reduce ongoing API costs
  • Option 3 is a cost problem: $18,000 vs. $4,200/month for 6 percentage points of improvement. Fine-tuning costs $6,000 once and saves long-term
  • Expected path: Week 1: optimize prompt to 87%. Weeks 2-4: fine-tune to 92%+. Fine-tuning costs pay back in under 2 months through shorter prompts

Common mistake: Fine-tuning a weaker model instead of better prompting on a stronger one. In this case, GPT-4o-mini with fine-tuning is the right call because the volume justifies the ROI and the task is narrow enough.

  • Fine-tuning changes behavior, not knowledge. It doesn’t make the model “smarter” — it adjusts style, format, and domain-specific patterns. For new knowledge, you need RAG.
  • The hierarchy of prompt, then RAG, then fine-tune isn’t a suggestion — it’s cost protection. Each step is an order of magnitude more expensive and less flexible.
  • Data quality beats data quantity. 500 expert examples are worth more than 50,000 noisy ones.
  • Fine-tuning isn’t a one-time cost. Models need retraining when data changes or base models update. Budget for ongoing maintenance.

Sources: IBM RAG vs Fine-Tuning vs Prompt Engineering, Stratagem Systems LoRA Fine-Tuning Cost Analysis (2026), Introl Fine-Tuning Infrastructure Guide (2025), Heavybit LLM Fine-Tuning Guide, Stratagem Systems LLM Fine-Tuning Business Guide (2026)

Part of AI Learning — free courses from prompt to production. Jan on LinkedIn