Fine-Tuning
Context
Section titled “Context”Your customer success team is complaining: the AI assistant sounds “too generic.” It doesn’t use your industry’s terminology, responds too formally, and formats answers differently than your style guide dictates. Your CTO proposes fine-tuning. Your VP of Engineering asks: “How much does it cost, and when will we see results?”
Fine-tuning is the third lever in the AI optimization hierarchy — after prompt engineering and RAG. It changes the model’s weights, meaning its internal behavior. That’s powerful, but expensive and inflexible. PMs need to understand when fine-tuning is justified and when it’s an expensive shortcut for better prompting.
Concept
Section titled “Concept”What Fine-Tuning Actually Does
Section titled “What Fine-Tuning Actually Does”Fine-tuning takes a pre-trained LLM and trains it further on a domain-specific dataset. It changes the weights — the internal behavior — rather than just providing context at runtime.
What fine-tuning is good for:
- Consistently changing tone, style, or format
- Embedding domain-specific terminology and reasoning patterns
- Reducing prompt length (baked-in behavior doesn’t need prompt instructions)
- Improving performance on narrow, well-defined tasks
What fine-tuning is NOT good for:
- Adding new factual knowledge (use RAG — fine-tuned knowledge goes stale)
- One-off customization (use prompting)
- Tasks where requirements change frequently (retraining is expensive)
Fine-Tuning Methods
Section titled “Fine-Tuning Methods”LoRA (Low-Rank Adaptation): The dominant method since 2025. Freezes original weights and trains small adapter matrices. Achieves 95% of full fine-tuning performance at 10% of the cost. Adapters are small (10-100 MB) and swappable — multiple specializations from one base model. Cost: $500-$5,000 per run.
QLoRA: Combines LoRA with quantization (16-bit to 4-bit). Enables fine-tuning a 7B model on a consumer GPU (RTX 4090, ~$1,500). Slight quality trade-off vs. standard LoRA. Ideal for experimentation and proof of concept.
Full Fine-Tuning: Updates all parameters. Highest quality ceiling but prohibitively expensive: $10,000-$30,000+ per run for 7B+ models. Only when budget is unlimited.
Managed Services (2026): OpenAI offers fine-tuning API for GPT-4o and GPT-4o-mini. Anthropic for Claude (enterprise tier). Google Vertex AI for Gemini. Upload JSONL, pay per training token — no infrastructure management needed.
Data Quality Decides Everything
Section titled “Data Quality Decides Everything”| Model size | Minimum examples | Recommended | Quality bar |
|---|---|---|---|
| 7B (Mistral, Llama) | 100-500 | 1,000-5,000 | Consistent format, correct labels |
| 13B-70B | 500-1,000 | 5,000-10,000 | Domain expert validated |
| Managed API (GPT-4o) | 10 (minimum) | 50-100 | High-quality input-output pairs |
The golden rule: 500 expert-curated examples outperform 50,000 noisy ones — especially in specialized domains. For more general tasks, more data can still help. The most common failure mode in fine-tuning isn’t insufficient data — it’s bad data.
The Decision Hierarchy: Prompt, Then RAG, Then Fine-Tune
Section titled “The Decision Hierarchy: Prompt, Then RAG, Then Fine-Tune”The IBM framework (widely adopted):
- Prompt engineering (hours, $0-$100): If the model produces acceptable results with the right instructions — stop here
- RAG ($70-$1,000/month ongoing): If the model needs access to current/proprietary data
- Fine-tuning ($5,000-$50,000+ upfront, ongoing maintenance): Only when behavioral change is needed AND prompt engineering can’t achieve it
Framework
Section titled “Framework”Fine-tuning decision matrix — answer these questions in order:
| Question | Yes | No |
|---|---|---|
| Can prompt engineering solve this? | Stop. No fine-tuning needed | Continue |
| Does the model need current/proprietary data? | Add RAG, then reassess | Continue |
| Do you need consistent behavioral change? | Continue evaluating | No fine-tuning needed |
| Do you have 500+ high-quality labeled examples? | Continue evaluating | Invest in data first |
| Will you process 50,000+ queries/month? | Continue evaluating | ROI unlikely |
| Can you maintain the fine-tuned model over time? | Start fine-tuning | Budget for retraining |
Cost and timeline (2026):
| Method | Compute cost | Engineering cost | Total timeline |
|---|---|---|---|
| LoRA (7B) | $500-$3,000 | $4,000-$12,000 (data + eval) | 2-4 weeks |
| LoRA (13B) | $2,000-$5,000 | $4,000-$12,000 | 3-6 weeks |
| Full fine-tuning (7B) | $10,000-$30,000 | $8,000-$20,000 | 4-8 weeks |
| Managed API (OpenAI) | $0.80-$3.00/1M training tokens | Minimal | 1-3 days |
ROI benchmark: Fine-tuning typically pays back in 4-8 months for companies processing 50,000+ queries/month. Below that volume, prompt engineering is usually more cost-effective.
Scenario
Section titled “Scenario”You’re a PM at a fintech startup (B2C, 150,000 MAU). Your AI feature: automatic bank transaction categorization. The current system uses GPT-4o-mini with few-shot prompting and achieves 82% accuracy.
The situation:
- Target accuracy: 92%+ (users complain about wrong categories)
- Volume: 3 million transactions/month
- Current costs: $4,200/month (GPT-4o-mini API)
- Your data team has collected 12,000 manually categorized transactions
- Competition: two rivals recently launched “AI categorization”
Options:
- Better prompts: Blended prompt with more context (transaction history, merchant database). Estimated improvement: 82% to 87%. Effort: 1 week
- Fine-tuning (LoRA): Fine-tune GPT-4o-mini on 12,000 examples. Estimated improvement: 92%+. Effort: 3 weeks, $6,000
- Model switch: Switch to Claude Sonnet 4.6 with optimized prompt. Estimated improvement: 88%. Cost: $18,000/month (4x higher)
Decide
Section titled “Decide”How would you decide?
The best decision: Option 1 first, then Option 2.
Why:
- Follow the hierarchy: Prompt optimization first (1 week). If 87% isn’t enough, you have the baseline for fine-tuning
- Fine-tuning is justified here: 3 million transactions/month far exceeds the ROI threshold (50,000+). 12,000 labeled examples are sufficient. The task (categorization) is narrowly defined — exactly what fine-tuning is designed for
- LoRA on GPT-4o-mini is cost-optimal: Fine-tuned models need shorter prompts (behavior is baked in rather than prompted), which can reduce ongoing API costs
- Option 3 is a cost problem: $18,000 vs. $4,200/month for 6 percentage points of improvement. Fine-tuning costs $6,000 once and saves long-term
- Expected path: Week 1: optimize prompt to 87%. Weeks 2-4: fine-tune to 92%+. Fine-tuning costs pay back in under 2 months through shorter prompts
Common mistake: Fine-tuning a weaker model instead of better prompting on a stronger one. In this case, GPT-4o-mini with fine-tuning is the right call because the volume justifies the ROI and the task is narrow enough.
Reflect
Section titled “Reflect”- Fine-tuning changes behavior, not knowledge. It doesn’t make the model “smarter” — it adjusts style, format, and domain-specific patterns. For new knowledge, you need RAG.
- The hierarchy of prompt, then RAG, then fine-tune isn’t a suggestion — it’s cost protection. Each step is an order of magnitude more expensive and less flexible.
- Data quality beats data quantity. 500 expert examples are worth more than 50,000 noisy ones.
- Fine-tuning isn’t a one-time cost. Models need retraining when data changes or base models update. Budget for ongoing maintenance.
Sources: IBM RAG vs Fine-Tuning vs Prompt Engineering, Stratagem Systems LoRA Fine-Tuning Cost Analysis (2026), Introl Fine-Tuning Infrastructure Guide (2025), Heavybit LLM Fine-Tuning Guide, Stratagem Systems LLM Fine-Tuning Business Guide (2026)