Fine-Tuning

Context

Your customer success team is complaining: the AI assistant sounds “too generic.” It doesn’t use your industry’s terminology, responds too formally, and formats answers differently than your style guide dictates. Your CTO proposes fine-tuning. Your VP of Engineering asks: “How much does it cost, and when will we see results?”

Fine-tuning is the third lever in the AI optimization hierarchy — after prompt engineering and RAG. It changes the model’s weights, meaning its internal behavior. That’s powerful, but expensive and inflexible. PMs need to understand when fine-tuning is justified and when it’s an expensive shortcut for better prompting.

Concept

What Fine-Tuning Actually Does

Fine-tuning takes a pre-trained LLM and trains it further on a domain-specific dataset. It changes the weights — the internal behavior — rather than just providing context at runtime.

What fine-tuning is good for:

Consistently changing tone, style, or format
Embedding domain-specific terminology and reasoning patterns
Reducing prompt length (baked-in behavior doesn’t need prompt instructions)
Improving performance on narrow, well-defined tasks

What fine-tuning is NOT good for:

Adding new factual knowledge (use RAG — fine-tuned knowledge goes stale)
One-off customization (use prompting)
Tasks where requirements change frequently (retraining is expensive)

Fine-Tuning Methods

LoRA (Low-Rank Adaptation): The dominant method since 2025. Freezes original weights and trains small adapter matrices. Achieves 95% of full fine-tuning performance at 10% of the cost. Adapters are small (10-100 MB) and swappable — multiple specializations from one base model. Cost: $500-$5,000 per run.

QLoRA: Combines LoRA with quantization (16-bit to 4-bit). Enables fine-tuning a 7B model on a consumer GPU (RTX 4090, ~$1,500). Slight quality trade-off vs. standard LoRA. Ideal for experimentation and proof of concept.

Full Fine-Tuning: Updates all parameters. Highest quality ceiling but prohibitively expensive: $10,000-$30,000+ per run for 7B+ models. Only when budget is unlimited.

Managed Services (2026): OpenAI offers fine-tuning API for GPT-4o and GPT-4o-mini. Anthropic for Claude (enterprise tier). Google Vertex AI for Gemini. Upload JSONL, pay per training token — no infrastructure management needed.

Data Quality Decides Everything

Model size	Minimum examples	Recommended	Quality bar
7B (Mistral, Llama)	100-500	1,000-5,000	Consistent format, correct labels
13B-70B	500-1,000	5,000-10,000	Domain expert validated
Managed API (GPT-4o)	10 (minimum)	50-100	High-quality input-output pairs

The golden rule: 500 expert-curated examples outperform 50,000 noisy ones — especially in specialized domains. For more general tasks, more data can still help. The most common failure mode in fine-tuning isn’t insufficient data — it’s bad data.

The Decision Hierarchy: Prompt, Then RAG, Then Fine-Tune

The IBM framework (widely adopted):

Prompt engineering (hours, $0-$100): If the model produces acceptable results with the right instructions — stop here
RAG ($70-$1,000/month ongoing): If the model needs access to current/proprietary data
Fine-tuning ($5,000-$50,000+ upfront, ongoing maintenance): Only when behavioral change is needed AND prompt engineering can’t achieve it

Framework

Fine-tuning decision matrix — answer these questions in order:

Question	Yes	No
Can prompt engineering solve this?	Stop. No fine-tuning needed	Continue
Does the model need current/proprietary data?	Add RAG, then reassess	Continue
Do you need consistent behavioral change?	Continue evaluating	No fine-tuning needed
Do you have 500+ high-quality labeled examples?	Continue evaluating	Invest in data first
Will you process 50,000+ queries/month?	Continue evaluating	ROI unlikely
Can you maintain the fine-tuned model over time?	Start fine-tuning	Budget for retraining

Cost and timeline (2026):

Method	Compute cost	Engineering cost	Total timeline
LoRA (7B)	$500-$3,000	$4,000-$12,000 (data + eval)	2-4 weeks
LoRA (13B)	$2,000-$5,000	$4,000-$12,000	3-6 weeks
Full fine-tuning (7B)	$10,000-$30,000	$8,000-$20,000	4-8 weeks
Managed API (OpenAI)	$0.80-$3.00/1M training tokens	Minimal	1-3 days

ROI benchmark: Fine-tuning typically pays back in 4-8 months for companies processing 50,000+ queries/month. Below that volume, prompt engineering is usually more cost-effective.

Scenario

You’re a PM at a fintech startup (B2C, 150,000 MAU). Your AI feature: automatic bank transaction categorization. The current system uses GPT-4o-mini with few-shot prompting and achieves 82% accuracy.

The situation:

Target accuracy: 92%+ (users complain about wrong categories)
Volume: 3 million transactions/month
Current costs: $4,200/month (GPT-4o-mini API)
Your data team has collected 12,000 manually categorized transactions
Competition: two rivals recently launched “AI categorization”

Options:

Better prompts: Blended prompt with more context (transaction history, merchant database). Estimated improvement: 82% to 87%. Effort: 1 week
Fine-tuning (LoRA): Fine-tune GPT-4o-mini on 12,000 examples. Estimated improvement: 92%+. Effort: 3 weeks, $6,000
Model switch: Switch to Claude Sonnet 4.6 with optimized prompt. Estimated improvement: 88%. Cost: $18,000/month (4x higher)

Decide

How would you decide?

The best decision: Option 1 first, then Option 2.

Why:

Follow the hierarchy: Prompt optimization first (1 week). If 87% isn’t enough, you have the baseline for fine-tuning
Fine-tuning is justified here: 3 million transactions/month far exceeds the ROI threshold (50,000+). 12,000 labeled examples are sufficient. The task (categorization) is narrowly defined — exactly what fine-tuning is designed for
LoRA on GPT-4o-mini is cost-optimal: Fine-tuned models need shorter prompts (behavior is baked in rather than prompted), which can reduce ongoing API costs
Option 3 is a cost problem: $18,000 vs. $4,200/month for 6 percentage points of improvement. Fine-tuning costs $6,000 once and saves long-term
Expected path: Week 1: optimize prompt to 87%. Weeks 2-4: fine-tune to 92%+. Fine-tuning costs pay back in under 2 months through shorter prompts

Common mistake: Fine-tuning a weaker model instead of better prompting on a stronger one. In this case, GPT-4o-mini with fine-tuning is the right call because the volume justifies the ROI and the task is narrow enough.

Reflect

Fine-tuning changes behavior, not knowledge. It doesn’t make the model “smarter” — it adjusts style, format, and domain-specific patterns. For new knowledge, you need RAG.
The hierarchy of prompt, then RAG, then fine-tune isn’t a suggestion — it’s cost protection. Each step is an order of magnitude more expensive and less flexible.
Data quality beats data quantity. 500 expert examples are worth more than 50,000 noisy ones.
Fine-tuning isn’t a one-time cost. Models need retraining when data changes or base models update. Budget for ongoing maintenance.

Sources: IBM RAG vs Fine-Tuning vs Prompt Engineering, Stratagem Systems LoRA Fine-Tuning Cost Analysis (2026), Introl Fine-Tuning Infrastructure Guide (2025), Heavybit LLM Fine-Tuning Guide, Stratagem Systems LLM Fine-Tuning Business Guide (2026)