Fine-Tuning LLMs
From dataset curation to LoRA adapters, SFT, DPO, and production deployment β a practitioner's complete guide to adapting large language models.
Fine-tuning is not a silver bullet β it is a precision tool. Used correctly it produces models that outperform general-purpose models on specific tasks. Used carelessly it wastes GPU budget and makes models worse. This guide teaches you the difference.
Most teams that think they need fine-tuning don't. Fine-tuning is expensive, slow, and introduces new failure modes. Before training a single step, prove that prompting and RAG cannot solve your problem. Fine-tuning wins when you need capabilities the model doesn't have β not when you need it to do something it already can.
Fine-tuning sits at the top of a capability ladder. Each rung is more expensive and time-consuming than the last. The rule: never climb to the next rung unless you've exhausted the current one.
For ~80% of LLM use cases, prompt engineering + RAG is sufficient. For ~15%, better model routing or a larger model solves the problem. Only ~5% truly require fine-tuning β usually domain-specific tasks, format control beyond what prompting achieves, or latency/cost requirements that demand a smaller, specialized model.
You need structured outputs (JSON, XML, code) with 99%+ reliability, and even few-shot + JSON mode doesn't achieve it consistently on your edge cases.
- Medical coding with complex schemas
- Domain-specific DSLs
- Highly structured report generation
You need the model to behave fundamentally differently from its base training β different tone, different reasoning style, different refusal boundaries.
- Brand voice consistency at scale
- Domain-specific communication norms
- Reducing over-refusal for valid use cases
The knowledge you need is too large or specialized for RAG context β the model must internalize domain expertise into its weights.
- Medical/legal terminology usage
- Proprietary codebase understanding
- Specialized scientific reasoning
You've proven a large model works, but need a smaller model to match quality at 10Γ lower cost / 5Γ lower latency for production scale.
- Distillation to smaller model
- Edge deployment requirements
- High-volume cost reduction (1M+ queries/day)
Fine-tuning does not turn a model into a database. A common misconception is that you can "teach" a model facts by including them in training data. This fundamentally misunderstands how fine-tuning works.
- Pattern learning: The model learns associations and response patterns, not retrievable facts
- Implicit encoding: Knowledge is encoded implicitly in weights, not stored explicitly
- Non-deterministic recall: The model may or may not surface specific facts depending on context
- Blending: Fine-tuned knowledge blends with pre-training knowledge β can't isolate it
- Fine-tuned models will hallucinate domain facts they were trained on
- Updates require full retraining, not data refresh
- Correctness cannot be guaranteed β outputs are probabilistic
- Citation and verification are impossible β no source to reference
If correctness depends on up-to-date or verifiable knowledge β use RAG, not fine-tuning. Fine-tuning is for teaching behavior, style, and format β not for injecting facts. A fine-tuned model that "knows" your product documentation will confidently hallucinate details that were never in the training data.
| Anti-Pattern | What Happens | The Real Solution |
|---|---|---|
| "Our prompts are too long" | Fine-tuning doesn't reliably shorten prompts β behavior may become inconsistent | Prompt compression, RAG optimization, or prompt caching |
| "The model doesn't know X" | Fine-tuning doesn't add knowledge reliably β it's brittle and hallucinates | RAG with verified source documents |
| "We have proprietary data" | You also need to maintain, version, and update that data β fine-tuning freezes it | RAG allows live updates without retraining |
| "We want better quality" | If quality is vaguely defined, fine-tuning won't improve it β garbage in, garbage out | Define quality β measure β improve prompts β then consider fine-tuning |
| "We need to differentiate" | Fine-tuning is not a competitive moat β others can fine-tune too. Data and product are the moat. | Focus on data quality and product experience, not the model itself |
Every fine-tuning run risks catastrophic forgetting β the model loses capabilities it had before. A model fine-tuned heavily on medical text may become worse at general conversation or code. You're not adding capabilities β you're trading general capabilities for specialized ones. This trade-off must be intentional and measured.
Fine-tuning improves whatever your dataset encodes. If your dataset is flawed, the model will learn incorrect behavior β and quality may appear improved while actually degrading in production.
Task-specific metrics: Accuracy, F1, format compliance, latency
Failure cases: What bad outputs look like β examples you'll reject
Acceptance thresholds: Numbers that must be hit before shipping
Regression tests: General capabilities that must not degrade
β You can't tell if training helped or hurt
β You can't compare model versions meaningfully
β You can't catch regressions before production
β You're optimizing blindly toward an undefined target
Build your golden test set before writing a single training example. If you can't measure improvement, you're not doing engineering β you're hoping. The eval set defines what "better" means for your specific use case.
Fine-tuning cannot fix a fundamentally weak base model. If the base model doesn't understand your domain at all, fine-tuning won't magically create that capability β you'll just teach it to confidently produce low-quality outputs.
| Consideration | What to Check | Red Flag |
|---|---|---|
| Baseline capability | Zero-shot performance on your task | Model produces nonsense without prompting |
| Reasoning quality | Chain-of-thought coherence, logic | Model can't follow multi-step reasoning |
| Context length | Max tokens vs your typical input size | Inputs truncated during training/inference |
| Language coverage | Fluency in your target languages | Model struggles with non-English content |
| License & deployment | Commercial use, modification rights | License prohibits your use case |
Choose the smallest model that already performs reasonably well on your task with prompting alone. Fine-tune to specialize and improve consistency β not to compensate for fundamental capability gaps. A 7B model that handles your domain well will outperform a poorly-matched 70B model.
| Cost Category | LoRA (7B model) | Full Fine-Tune (7B) | Full Fine-Tune (70B) |
|---|---|---|---|
| GPU requirement | 1Γ A100 40GB or 1Γ RTX 4090 | 4Γ A100 80GB (FSDP) | 8β16Γ A100 80GB |
| Training time (10K samples) | 2β4 hours | 8β16 hours | 24β72 hours |
| Cloud compute cost | $10β$50 | $200β$500 | $2,000β$10,000 |
| Data prep time | Days to weeks | Days to weeks | Weeks to months |
| Iteration cycle | Hours per experiment | 1β2 days per experiment | Days per experiment |
| Risk of degradation | Lower (fewer params modified) | Moderate β easy to overfit | Higher β harder to debug |
Fine-tuning makes economic sense when: (1) you've proven prompting doesn't work, (2) you have 5Kβ50K+ quality training examples, (3) you need the specialized capability for high-volume production, and (4) you're prepared to maintain the fine-tuned model over time (updates when the base model changes, dataset maintenance, eval pipeline). If any of these are missing, the ROI is negative.
| Dimension | Prompting | RAG | Fine-Tuning |
|---|---|---|---|
| Best for | Behavior control, format, task guidance | External knowledge, up-to-date info | Capability modification, style, internalized expertise |
| Setup time | Hours | Days to weeks | Weeks to months |
| Iteration speed | Instant β edit and deploy | Fast β update documents | Slow β retrain and evaluate |
| Knowledge updates | Edit prompt (limited) | Update index anytime | Retrain required |
| Failure mode | Instruction not followed | Wrong docs retrieved | Catastrophic forgetting + overfitting |
| Maintenance burden | Low β prompt version control | Medium β index maintenance | High β data pipeline, retraining, eval |
1. Prompt engineering (few-shot, CoT, system prompt)
2. Add RAG if knowledge is required
3. Try a larger/better model
4. Only then β fine-tune if still not working
β You need a smaller model at lower cost/latency
β Task requires fundamentally different behavior
β You have abundant high-quality labeled data
β You've already proven prompting + RAG insufficient
A single training run rarely produces a production-ready model. Expect multiple iterations β the first model reveals what's wrong with your data, the second fixes some issues, the third gets closer to acceptable quality.
Compute cost for a single training run is small. The real cost is iteration time: reviewing failures, curating fixes, retraining, re-evaluating. Budget for 3β10 iteration cycles. A team that plans for one training run will be surprised; a team that plans for ten will ship a good model.
Unlike prompts (edit instantly) or RAG (update documents), fine-tuned models are frozen artifacts. You commit to maintaining them over time β or accepting degradation.
When the base model releases a new version (Llama 3.2 β 3.3), you must:
- Retrain on new base
- Re-evaluate for regressions
- Update deployment infra
Production usage changes over time:
- New query patterns emerge
- Domain knowledge evolves
- Model performance degrades
The infrastructure requires upkeep:
- Dataset versioning
- Eval pipeline updates
- Retraining automation
Before fine-tuning, ask: "Who will maintain this model 6 months from now?" If the answer is unclear, reconsider. Unmaintained fine-tuned models become legacy debt β increasingly out-of-sync with reality, impossible to update quickly, and risky to replace.
This is the fundamental insight that should guide every fine-tuning decision: you are not making the model universally better. You are making a trade.
β Better performance on your specific task
β More consistent format and style
β Reduced prompt engineering complexity
β Potentially lower inference cost (smaller model)
β Performance on tasks outside your training distribution
β Flexibility to handle unexpected inputs
β Ability to update quickly (must retrain)
β General reasoning capability (potentially)
Before fine-tuning, explicitly document: (1) What tasks must improve, (2) What tasks can degrade, (3) How you'll measure both. If you can't answer these questions, you're not ready to fine-tune. A fine-tuned model without measured trade-offs is a model you don't understand.
∑ Chapter 01 — Key Takeaways
- Fine-tuning is the last rung on the capability ladder β exhaust prompting, RAG, and model routing first
- ~80% of LLM use cases do not require fine-tuning β most are solved by prompt engineering + RAG
- Fine-tuning wins for: format control, behavior modification, internalized domain knowledge, and cost/latency optimization
- Fine-tuning fails for: "longer prompts," "more knowledge," vague quality improvements, and differentiation alone
- Catastrophic forgetting is real β fine-tuning trades general capability for specialized capability
- Cost: LoRA = $10β$50, Full FT (7B) = $200β$500, Full FT (70B) = $2Kβ$10K+ β data prep time matters more
Fine-tuning performance is 90% data, 10% hyperparameters. A mediocre model trained on excellent data will outperform an excellent model trained on mediocre data. Most fine-tuning failures trace back to dataset problems β low quality, insufficient diversity, wrong format, or insufficient volume. Data engineering is the real work of fine-tuning.
Every fine-tuning dataset is ultimately a collection of (input, target output) pairs. The format depends on your training objective β completion, instruction following, or preference learning.
Single text sequence β model learns to predict continuation. Used for continued pre-training or simple generation tasks.
Multi-turn conversations with role markers. Standard for instruction tuning. Most common format.
Pairs of (chosen, rejected) responses to the same input. Used for preference alignment.
| Training Goal | Format | Fields Required | Example Use |
|---|---|---|---|
| Continued pre-training | Completion | text | Domain adaptation (legal corpus, codebase) |
| Instruction following | Chat/Instruction | messages with role/content | General assistant, task completion |
| Preference alignment | Preference pairs | prompt, chosen, rejected | DPO training, RLHF reward modeling |
| Structured output | Chat + schema | messages with JSON in assistant turn | Extraction, classification, code generation |
A model performs well only when training data matches real usage. The most common cause of fine-tuning failure isn't bad hyperparameters β it's training on data that doesn't represent production.
β Training on clean, idealized examples
β Deploying on noisy, real-world inputs
β Performance collapse in production
β Team surprised: "It worked in testing!"
β Real user queries (sampled from logs)
β Edge cases and unusual inputs
β Typos, grammar errors, incomplete inputs
β Adversarial and out-of-scope inputs
Your dataset should look like production logs β not curated examples. If your training data is cleaner than your production traffic, you're training a model for a world that doesn't exist. Resist the urge to "clean up" training examples too much. Real users don't write perfect queries.
More data is only better if the data is consistently high quality. A 10K sample dataset with 30% low-quality examples will produce worse results than a 3K sample dataset where every example is excellent.
- Correct: The target output is factually accurate and appropriate
- Complete: No truncation, no partial responses
- Consistent: Format matches other samples; no random variations
- Representative: Covers the distribution of real inputs
- Clear: Unambiguous instruction β response mapping
- Noisy labels: Wrong, inconsistent, or ambiguous targets
- Duplicates: Same examples repeated β overfit to those patterns
- Length bias: All short or all long β model learns length, not content
- Format inconsistency: Mixed JSON styles, varying delimiters
- Contamination: Test examples in training set β fake good results
For most task-specific fine-tuning, 1,000β5,000 high-quality examples is the sweet spot. Below 500, you risk underfitting. Above 10K, you get diminishing returns unless data quality remains excellent. A team that spends 80% of effort on data curation and 20% on training will outperform a team that does the opposite.
Fine-tuned models often appear to improve during training β but are actually memorizing. When train loss decreases but validation loss stalls or increases, the model is overfitting to training examples rather than learning generalizable patterns.
- Loss divergence: Training loss decreases, validation does not
- Output similarity: Outputs become overly similar to training examples
- Brittleness: Performance drops on slightly different inputs
- Memorization: Model reproduces training text verbatim
- Narrow behavior: Model only handles exact patterns from training
- Strong validation set: 10β20% of data, held out strictly
- Early stopping: Monitor validation loss, stop when it stalls
- Fewer epochs: 1β3 epochs is often sufficient
- More data diversity: Increase variety, not just volume
- Out-of-distribution eval: Test on inputs unlike training
After training, show the model inputs that are similar but not identical to training examples. If performance is significantly worse than on training-like inputs, you're overfitting. A well-generalized model should handle reasonable variations without degradation.
| Source | Quality | Cost | Best For | Watch Out |
|---|---|---|---|---|
| Production logs | Real distribution | Free (you have it) | Domain adaptation, improving existing models | Needs labeling; may contain PII |
| Expert annotation | Highest quality | $50β$500+ per hour | Small high-quality datasets (500β2K) | Expensive at scale; expert availability |
| Crowd annotation | Variable | $0.10β$2 per sample | Scaling up with quality controls | Needs rigorous QA; inter-annotator agreement |
| Synthetic (LLM-generated) | Good if filtered | $0.001β$0.01 per sample | Bootstrapping, format examples, augmentation | Model collapse risk; echo chamber |
| Public datasets | Variable | Free | Pre-training, general instruction tuning | May be contaminated; license issues |
A powerful technique: use a larger, more capable model (GPT-4o, Claude Sonnet) to generate training examples for a smaller model. This is how most open-source instruction-tuned models were trained (Alpaca, Vicuna, etc.).
Generate diverse (instruction, response) pairs from a seed set. Use the LLM to create variations and new tasks.
- Start with 100β200 seed examples
- Generate 5Kβ50K synthetic examples
- Filter aggressively for quality
Take simple instructions and evolve them into harder, more complex versions. Creates curriculum progression.
- Simple β Complex β Multi-step
- Add constraints, edge cases
- Reject trivial generations
Run your production queries through a large model; use outputs as training signal for a smaller model.
- Deploy large model first
- Log (input, output) pairs
- Fine-tune small model on logs
Model collapse: Training on LLM outputs can amplify quirks and reduce diversity. Echo chamber: Synthetic data reflects the biases of the generating model. Format homogeneity: LLMs generate in patterns β synthetic datasets often lack the noise and variation of real human data. Mitigation: Always filter synthetic data, mix with real data (20β50% real), and validate on held-out human-generated test sets.
Duplicate or near-duplicate examples cause the model to memorize rather than generalize. Deduplication is not optional β it is a required preprocessing step for any fine-tuning dataset.
| Dedup Method | What It Catches | Implementation | When to Use |
|---|---|---|---|
| Exact hash | Identical text strings | MD5/SHA256 of normalized text | Always β baseline dedup |
| N-gram overlap | High textual similarity (70%+ overlap) | MinHash, n-gram Jaccard | Medium-sized datasets (<100K) |
| Embedding similarity | Semantic duplicates (same meaning, different words) | Embed + ANN search (FAISS) + threshold | When paraphrases are a concern |
| Train/test leakage check | Test examples leaked into training | Hash match between splits | Always β critical for valid eval |
Step 1: Normalize text (lowercase, strip whitespace, remove special chars) β Step 2: Exact dedup by hash β Step 3: Near-dedup by MinHash (threshold 0.7) β Step 4: Check for train/test/val leakage β Step 5: Log dedup stats (how many removed, from which sources). A 10% dedup rate is normal; >30% suggests data collection problems.
Training: 80β90% β model learns from this
Validation: 5β10% β used during training to tune hyperparams, detect overfitting
Test: 5β10% β held out until final evaluation; never touched during training
Rule: Validation and test sets must be representative of production distribution
β No validation set β can't detect overfitting during training
β Test set too small β results have high variance
β Leakage between splits β fake good metrics
β Random split when data has structure β should split by time, user, or document
| Check | What to Verify | Red Flags |
|---|---|---|
| Format validation | All samples parse correctly; required fields present | JSON parse errors, missing messages, empty content |
| Length distribution | Reasonable spread of input/output lengths | All samples same length, or extreme outliers |
| Label balance | Classes roughly balanced (or intentionally weighted) | 90% one class, 10% others β model ignores minority |
| Quality audit | Random sample of 50β100 manually reviewed | >5% have errors, inconsistencies, or low quality |
| Deduplication | Exact + near duplicates removed | >10% duplicates; any train/test leakage |
| Split validation | No overlap between train/val/test; splits representative | Same examples in train and test |
| Tokenization check | Samples don't exceed context window after tokenization | Samples truncated silently during training |
The true value of fine-tuning is not the model itself β it's the data pipeline you build. The model is a snapshot; the pipeline is an asset that compounds over time.
Data collection: Log production queries and model outputs
Failure identification: Find where model underperforms
Dataset improvement: Add examples for failure cases
Automated retraining: Regular model updates
Month 1: 1,000 examples, 70% quality
Month 3: 3,000 examples, 80% quality
Month 6: 8,000 examples, 90% quality
Month 12: Competitors can't catch up
A fine-tuned model is a commodity β anyone can train one. A data pipeline that continuously improves from production usage is a moat. Teams that build the flywheel pull ahead over time; teams that treat fine-tuning as a one-time event get left behind.
∑ Chapter 02 — Key Takeaways
- Fine-tuning performance is 90% data, 10% hyperparameters β data engineering is the real work
- Three main formats: completion (text), chat/instruction (messages), preference (chosen/rejected pairs)
- 1,000β5,000 high-quality samples is the sweet spot β more data only helps if quality stays high
- Synthetic data is powerful but risky: filter aggressively, mix with real data, validate on human test sets
- Deduplication is not optional β exact hash β MinHash β embedding similarity β train/test leakage check
- Before training: format validation, length distribution, label balance, quality audit, dedup, split validation, tokenization check
Parameter-efficient fine-tuning (PEFT) is the practical way to fine-tune large models. Instead of updating all 7Bβ70B parameters, you train small adapter layers that modify model behavior. This reduces GPU memory by 80%, training cost by 90%, and makes experimentation fast. For most use cases, PEFT matches full fine-tuning quality while being dramatically easier to run.
LoRA (Low-Rank Adaptation) freezes the original model weights and adds small trainable matrices to specific layers. Instead of updating a 4096Γ4096 weight matrix, you learn two small matrices (4096Γr and rΓ4096, where r=8β64) that together approximate the weight update. The original model is unchanged β you're just learning a delta.
The key insight: fine-tuning updates live in a low-rank subspace. You don't need to modify 16M parameters β the behavioral changes can be captured by two small matrices totaling 100Kβ500K parameters. The base model provides general intelligence; LoRA provides task-specific steering. At inference, AΓB is merged into W β zero inference overhead.
| Hyperparameter | What It Controls | Recommended Range | Trade-off |
|---|---|---|---|
| Rank (r) | Capacity of the adaptation β how much information the LoRA can encode | 8β64 (start with 16) | Higher = more capacity but more params and risk of overfitting |
| Alpha (Ξ±) | Scaling factor for the LoRA update β controls magnitude of behavior change | Ξ± = r or Ξ± = 2r (default: same as rank) | Higher = stronger adaptation; too high = instability |
| Target modules | Which layers get LoRA adapters | q_proj, v_proj (minimum) or all attention + MLP | More modules = more capacity but slower training |
| Dropout | Regularization during training | 0.05β0.1 (often 0) | Helps prevent overfitting on small datasets |
Classification, sentiment, simple extraction. Low capacity needed β small LoRA prevents overfitting.
- Params: ~50K
- Memory: +2% vs base
- Risk: May underfit complex tasks
Instruction following, domain adaptation, style transfer. Good balance of capacity and efficiency.
- Params: ~100Kβ250K
- Memory: +5% vs base
- Most common production choice
Major behavior changes, multi-task, domain pre-training. Higher capacity when you have data to support it.
- Params: ~500Kβ2M
- Memory: +10% vs base
- Risk: Overfitting on small datasets
QLoRA combines 4-bit quantization of the base model with LoRA training. The base model is loaded in 4-bit (NF4) format, reducing memory by 4Γ. LoRA adapters are still trained in full precision. This allows fine-tuning a 7B model on a single 16GB GPU or a 70B model on a single A100.
| Model Size | Full FT Memory | LoRA Memory | QLoRA (4-bit) Memory | Consumer GPU? |
|---|---|---|---|---|
| 7B | ~56GB (4Γ A100) | ~28GB (A100 40GB) | ~8GB (RTX 4090 / 3090) | β Yes |
| 13B | ~104GB (8Γ A100) | ~52GB (A100 80GB) | ~14GB (RTX 4090) | β Yes |
| 70B | ~560GB (16Γ A100) | ~280GB (8Γ A100) | ~48GB (A100 80GB) | β No |
QLoRA reduces memory by 4Γ but introduces quantization error. For most tasks, quality is within 1β2% of full-precision LoRA. However: (1) very small datasets may overfit more easily with QLoRA, (2) complex reasoning tasks sometimes benefit from full precision, (3) merging QLoRA adapters to full precision for serving requires dequantization. Start with QLoRA; switch to full-precision LoRA only if quality is noticeably worse.
| Method | Key Idea | When to Use | Status |
|---|---|---|---|
| LoRA | Low-rank matrices A, B added to frozen weights | Default choice β well-understood, broadly supported | Production-ready |
| QLoRA | LoRA + 4-bit base model quantization | When GPU memory is constrained | Production-ready |
| DoRA | Decomposes weights into magnitude + direction; learns direction | Small quality improvement over LoRA in some tasks | Emerging β worth testing |
| AdaLoRA | Adaptive rank β automatically allocates rank per layer | When optimal rank varies by layer | Experimental |
| LoRA+ | Different learning rates for A, B matrices | Minor optimization; easy to add | Experimental |
| Prefix Tuning | Prepend trainable "virtual tokens" to input | Legacy approach β LoRA generally better | Largely superseded |
Start with LoRA or QLoRA β they're the most tested, best supported, and work for 90%+ of use cases. Try DoRA if you need an extra 1β2% quality improvement and are willing to experiment. Everything else is research-stage or niche β don't use unless you have a specific reason and can validate the improvement on your eval set.
During inference, you can either (1) load adapters separately and apply at runtime, or (2) merge adapters into base weights for a single merged model. Merging is preferred for production β zero inference overhead, simpler deployment.
Pros: Can swap adapters at runtime; multiple adapters per base model
Cons: Slight inference overhead; more complex serving
Use when: Multi-tenant serving, A/B testing adapters
Pros: Zero overhead; standard model format; simple deployment
Cons: Can't swap at runtime; produces a full model copy
Use when: Single-purpose deployment (most cases)
Fine-tuning changes how a model behaves at inference time β not just on your target task, but across all tasks. These trade-offs affect production system design.
- Task consistency: More reliable outputs on trained patterns
- Format compliance: Better adherence to target structure
- Latency (if distilled): Smaller model can match larger model
- Reduced prompting: Less instruction needed in prompt
- Flexibility: Less adaptable to unexpected inputs
- General capability: Worse on tasks outside training distribution
- Creativity: More constrained, less diverse outputs
- Instruction following: May ignore prompts that conflict with training
Production systems often route between base and fine-tuned models: use the fine-tuned model for in-domain queries where it excels, fall back to the base model for out-of-domain or uncertain cases. This preserves flexibility while gaining specialization. Implement confidence-based routing or query classification to decide which model handles each request.
∑ Chapter 03 — Key Takeaways
- LoRA trains 0.1β1% of parameters by adding small low-rank matrices to frozen base weights
- Key hyperparams: rank r (start 16), alpha (= r or 2r), target modules (q, v minimum)
- QLoRA combines 4-bit quantization + LoRA β fine-tune 7B on a 16GB GPU
- DoRA offers small quality gains; everything else is experimental β stick with LoRA/QLoRA
- Merge adapters for production β zero inference overhead, simpler deployment
- LoRA matches full fine-tuning quality for 90%+ of tasks at 10Γ lower cost
Full fine-tuning updates every parameter in the model. It's the most powerful form of adaptation β and the most dangerous. Use it when LoRA isn't enough, you have abundant high-quality data, and you're prepared to invest in compute and validation. Most teams never need it; some absolutely do.
| Scenario | Why Full FT? | Typical Data Volume |
|---|---|---|
| Continued pre-training | Adding domain knowledge to the base model (code, legal, medical corpus) | 10Mβ1B+ tokens |
| Language adaptation | Adapting to a new language the base model doesn't handle well | 1B+ tokens |
| LoRA ceiling reached | LoRA quality plateaus; ablation shows more capacity needed | 50Kβ500K samples |
| Model distillation | Training a smaller model to mimic a larger model's outputs | 100Kβ1M samples |
| Safety fine-tuning | Deep behavioral changes that touch many capabilities | 10Kβ100K samples |
Try LoRA first. Measure quality. If quality is not sufficient and you have evidence that more capacity is needed (ablation with higher rank doesn't help, or quality improves with more data but plateaus), then consider full fine-tuning. Full fine-tuning is a last resort, not a default.
| Model Size | Min GPUs (FSDP/ZeRO-3) | Memory per GPU | Training Time (10K samples) | Cloud Cost |
|---|---|---|---|---|
| 7B | 4Γ A100 40GB | ~28GB per GPU | 8β16 hours | $200β$500 |
| 13B | 8Γ A100 40GB | ~32GB per GPU | 16β32 hours | $500β$1,500 |
| 70B | 16Γ A100 80GB | ~70GB per GPU | 48β120 hours | $5,000β$20,000 |
Fully Sharded Data Parallel β shards model, optimizer, gradients across GPUs. First choice for PyTorch users.
- Built into PyTorch β₯2.0
- Good Hugging Face integration
- Requires homogeneous GPU cluster
Microsoft's distributed training library. ZeRO-3 shards everything; ZeRO-Offload uses CPU memory.
- More memory-efficient than FSDP
- Better for heterogeneous setups
- Slightly more complex config
Managed training: Lambda Labs, RunPod, AWS SageMaker, Google Vertex AI.
- Pre-configured multi-GPU
- Spot instances for cost savings
- Pay per hour β no capital expense
Full fine-tuning requires learning rates 10β100Γ smaller than pre-training. Too high β catastrophic forgetting and instability. Too low β no learning. The optimal range is narrow and model-dependent.
| Hyperparameter | Typical Range | Notes |
|---|---|---|
| Learning rate | 1e-6 to 5e-5 (start: 2e-5) | 10β100Γ lower than pre-training LR |
| LR schedule | Cosine decay or linear decay | Warm up for first 3β10% of steps |
| Batch size | 32β256 (effective, after gradient accumulation) | Larger = more stable; limited by memory |
| Epochs | 1β3 (often just 1) | More epochs β overfitting on small datasets |
| Weight decay | 0.01β0.1 | Regularization β prevents overfitting |
| Gradient clipping | 1.0 | Prevents gradient explosion |
With full fine-tuning, the model can forget everything it knew. Symptoms: degraded general conversation, broken instruction following on unrelated tasks, loss of chain-of-thought ability. Prevention: (1) low learning rate, (2) short training (1β3 epochs), (3) mix in general instruction data (10β20%), (4) evaluate on general benchmarks during training, not just your task.
- Loss decreases smoothly β no spikes or plateaus after warmup
- Validation loss tracks training loss β gap stays small
- Gradient norm stable β no explosions (should be <1.0 with clipping)
- Eval metrics improve on task β accuracy/quality on hold-out set
- General benchmarks stable β MMLU, HumanEval don't drop significantly
- Loss spikes β learning rate too high or data corruption
- Validation loss increases β overfitting; stop training
- NaN loss β numerical instability; reduce LR, check data
- General capability drops β catastrophic forgetting; mix in general data
- Repetitive/degenerate outputs β model collapsed; restart with lower LR
Save checkpoints every 500β1000 steps. Keep the last 3 + best validation loss + best task metric. If training degrades, you can revert to an earlier checkpoint. For full fine-tuning, checkpoints are large (model size Γ 2 for optimizer states) β budget storage accordingly. Use save_only_model=True to save just weights if storage is tight.
∑ Chapter 04 — Key Takeaways
- Full fine-tuning updates all parameters β use only when LoRA isn't enough and you have abundant data
- Use cases: continued pre-training, language adaptation, LoRA ceiling reached, distillation
- Compute: 7B needs 4Γ A100 40GB, 70B needs 16Γ A100 80GB β cloud cost $200β$20K
- Learning rate: 1e-6 to 5e-5 (start 2e-5); 10β100Γ lower than pre-training
- Catastrophic forgetting: low LR, 1β3 epochs, mix in general data, monitor general benchmarks
- Save checkpoints every 500β1000 steps; keep best validation + best task metric + last 3
Not all fine-tuning is the same. The training objective determines what the model learns: imitate examples (SFT), prefer better responses (DPO), or optimize a reward signal (RLHF). Each has different data requirements, complexity, and outcomes. Most teams should start with SFT; add DPO when preference data is available; avoid RLHF unless you have specific alignment needs.
SFT is the simplest and most common fine-tuning objective. You show the model (input, target output) pairs, and it learns to maximize the probability of producing the target. This is what most people mean when they say "fine-tuning."
- You have examples of the exact output you want
- Task has clear right/wrong answers
- You're doing instruction tuning from scratch
- You need format control (JSON, specific templates)
- First fine-tuning attempt β start here
- Model learns to imitate β even bad patterns in data
- Can't express "this is better than that" directly
- Quality ceiling is your data quality
- Quantity-sensitive β needs 1Kβ50K examples
- Doesn't optimize for human preferences explicitly
Data format: {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}. Volume: 1Kβ5K high-quality examples for task-specific; 10Kβ100K for general instruction tuning. Quality bar: Every example should be one you'd be happy to ship in production. 1K excellent examples beats 10K mediocre ones.
DPO learns from preference pairs: given the same prompt, which response is better? This is more directly aligned with how humans judge quality β we often know "A is better than B" even when we can't write the perfect response ourselves.
| Aspect | SFT | DPO |
|---|---|---|
| Data format | (prompt, response) | (prompt, chosen, rejected) |
| What it learns | Maximize probability of target | Increase prob(chosen) relative to prob(rejected) |
| Data collection | Write ideal responses | Generate two responses, pick better one |
| Typical volume | 1Kβ50K examples | 5Kβ50K preference pairs |
| Training stability | Very stable | Moderately stable (watch KL divergence) |
| Complexity | Simple | Requires reference model + DPO loss |
The typical pipeline: SFT first β then DPO. SFT teaches the model what kind of responses to produce; DPO refines which responses are better. Add DPO when: (1) you have clear preference signals (human feedback, A/B test results), (2) SFT quality plateaus but you can still rank responses, (3) you need to reduce specific failure modes (safety, tone, verbosity) that are easier to express as "not this" than "do this."
RLHF is the full alignment approach used by OpenAI, Anthropic, and others to train frontier models. It involves training a separate reward model on human preferences, then using RL (PPO) to optimize the LLM against that reward signal.
Step 1: Collect human preference data (A vs B rankings)
Step 2: Train a reward model to predict human preferences
Step 3: Use PPO to optimize LLM to maximize reward model score
Step 4: Add KL penalty to prevent reward hacking
Result: Model optimizes for human preferences end-to-end
β Requires training a separate reward model (expensive)
β RL training (PPO) is unstable and hard to tune
β Reward hacking is a real failure mode
β Requires 100K+ human preference annotations
β DPO achieves 90% of the benefit at 10% of the complexity
DPO was designed as a simpler alternative to RLHF that doesn't require a separate reward model or RL training. In practice, DPO matches RLHF quality for most alignment tasks at a fraction of the complexity. Unless you're a frontier lab with dedicated alignment researchers, use DPO instead of RLHF. The infrastructure and expertise required for stable RLHF training is not worth it for most production use cases.
| Your Situation | Recommended Objective | Rationale |
|---|---|---|
| First fine-tuning attempt | SFT | Simplest, fastest iteration, establishes baseline |
| You have ideal target outputs | SFT | Directly teach the model what to produce |
| Quality plateau + preference data available | SFT β DPO | SFT establishes capability; DPO refines quality |
| Reducing specific failure modes | DPO (after SFT) | Easier to express "not this" than "do this" |
| Safety / alignment at frontier scale | RLHF | Only if you have dedicated alignment team and resources |
| General instruction following | SFT (then optional DPO) | Proven pipeline from Alpaca β Vicuna β etc. |
90% of production fine-tuning is SFT-only. It's simple, it works, and it's what you should start with. DPO adds 5β15% quality improvement when you have good preference data and SFT has plateaued. RLHF is for frontier labs. Don't over-engineer your training objective β get SFT right first, add DPO if needed, and skip RLHF unless you have a very specific reason and resources.
∑ Chapter 05 — Key Takeaways
- Three training objectives: SFT (learn from examples), DPO (learn from preferences), RLHF (optimize a reward model)
- Start with SFT β simplest, fastest, works for 90% of use cases
- SFT data: (prompt, response) pairs; DPO data: (prompt, chosen, rejected) triplets
- Add DPO after SFT when: quality plateaus, you have preference data, reducing specific failures
- DPO β RLHF quality at 10% complexity β use DPO instead unless you're a frontier lab
- Modern pipeline: Base β SFT β Eval β (optional) DPO β Final Eval β Ship
Fine-tuning without evaluation is guessing. You must measure both your task performance AND general capability retention. A model that aces your custom task but forgets how to reason is not a success. Build an eval suite before training, run it continuously, and never ship a model that hasn't passed your quality gates.
Evaluation for fine-tuning is different from evaluating base models. You need to measure both task-specific improvement and general capability preservation. The hierarchy prioritizes what matters most.
| Eval Level | What It Measures | When to Run | Failure Outcome |
|---|---|---|---|
| β Task-specific | Does the model do YOUR job better? | Every checkpoint, every experiment | Model isn't useful β retrain or adjust data |
| β‘ Format compliance | Does output match required structure? | Every checkpoint | Downstream parsing fails β adjust training data |
| β’ Safety/regression | Did we break safety or introduce new failures? | Before shipping, major changes | Ship blocker β model produces harmful content |
| β£ General capability | Did we lose general intelligence? | Before shipping, weekly during iteration | Catastrophic forgetting β adjust LR, add general data |
A golden test set is a curated collection of examples with known-correct answers that you use to evaluate every model iteration. It's your source of truth and should be treated as sacred β never train on it, never modify it casually.
- Size: 200β500 examples minimum; 1000+ for high-stakes
- Diversity: Cover all task subtypes, edge cases, difficulty levels
- Expert-verified: Every answer reviewed by domain expert
- Version-controlled: Git track with change history
- Never contaminated: Must not appear in training data
- Too small: <100 examples β results have high variance
- Homogeneous: All easy examples β misses edge cases
- Stale: Not updated as task evolves
- Leaked: Examples also in training data β inflated scores
- Ambiguous: Multiple correct answers without accounting for them
Allocate at least 10% of your data curation effort to building and maintaining your golden test set. This is non-negotiable. A team that spends all effort on training data but has a weak eval set will ship bad models and not know it. The golden set is how you know if your fine-tuning is working.
| Task Type | Primary Metric | Secondary Metrics | Implementation |
|---|---|---|---|
| Classification | Accuracy, Macro-F1 | Per-class precision/recall, confusion matrix | Exact match against labels |
| Extraction (NER, slots) | Entity-level F1 | Partial match rate, span accuracy | Compare extracted entities to gold |
| Generation (summaries, content) | ROUGE-L, BERTScore | Human preference, factual accuracy | Automated + sampling for human review |
| Structured output (JSON) | Schema validity + field accuracy | Parse success rate, field-level F1 | JSON parse test + field extraction check |
| Code generation | Pass@1, Pass@5 | Syntax validity, test case pass rate | Execute against test cases |
| QA / reasoning | Exact match, LLM-as-judge | Chain-of-thought quality | String match + GPT-4 evaluation |
BLEU, ROUGE, and even BERTScore correlate poorly with human preferences for open-ended generation. They're useful for directional signals but not final quality judgment. For generation tasks, always supplement automated metrics with LLM-as-judge evaluation and periodic human spot-checks (review 20β50 random examples each iteration).
For open-ended tasks where exact matching fails, use a stronger model (GPT-4o, Claude Sonnet) to judge the quality of your fine-tuned model's outputs. This correlates better with human preferences than traditional metrics.
Scalable: Evaluate 1000s of examples at $0.01β$0.05 each
Consistent: No annotator fatigue or mood variation
Nuanced: Can evaluate subtle quality differences
Fast: Results in minutes, not days
Position bias: May prefer first or second response
Verbosity bias: May prefer longer responses
Self-preference: GPT-4 may prefer GPT-4-style outputs
Mitigation: Randomize order, calibrate with human baseline
Before trusting LLM-as-judge, validate against 50β100 human-labeled examples. Compute agreement rate (should be >80% for binary better/worse). If agreement is low, refine your evaluation prompt or criteria. Also test for position bias by running each comparison twice with swapped order β disagreement rate should be <10%.
Benchmark contamination occurs when your test data appears in training data. The model memorizes answers rather than learning to reason. This is the #1 cause of inflated evaluation scores that don't reflect production performance.
| Contamination Type | How It Happens | Detection Method | Prevention |
|---|---|---|---|
| Direct leakage | Test set examples in training set | Hash matching, n-gram overlap | Strict train/test split management |
| Paraphrase leakage | Same question, different wording in train | Embedding similarity search | Semantic dedup across splits |
| Public benchmark contamination | Test set is public (MMLU, HumanEval) and in web scrapes | Hard to detect | Use held-out custom evals |
| Synthetic data feedback | Generated training data includes benchmark patterns | Manual audit of synthetic data | Exclude benchmark topics from generation |
MMLU, HumanEval, GSM8K, and other popular benchmarks exist in web scrapes that went into pre-training data. A fine-tuned model that scores well on these may be memorizing, not reasoning. For production decisions, always maintain a private golden test set that has never been published. Public benchmarks are useful for comparing to literature but not for shipping decisions.
Fine-tuning can break capabilities the base model had. Regression testing detects this by comparing your fine-tuned model against the base model on a held-out general capability set.
- MMLU subset: 200β500 questions across domains
- Instruction following: Can it still follow basic prompts?
- Conversation quality: Multi-turn coherence
- Reasoning: Simple chain-of-thought problems
- <2% drop: Acceptable β normal fine-tuning cost
- 2β5% drop: Warning β may need to adjust
- >5% drop: Problem β likely catastrophic forgetting
- Always compare to base model as reference
- Refusal rate: Should be similar to base model
- Jailbreak resistance: Common attack prompts
- Harmful content: Violence, bias, PII leakage
- Run dedicated red-team eval before shipping
∑ Chapter 06 — Key Takeaways
- Evaluation hierarchy: task-specific β format compliance β safety/regression β general capability
- Build a golden test set of 200β500+ expert-verified examples before training; never train on it
- For open-ended tasks, use LLM-as-judge (GPT-4); calibrate with 50β100 human labels first
- Benchmark contamination causes fake good scores β rely on private eval sets for shipping decisions
- Regression testing: <2% drop acceptable, 2β5% warning, >5% catastrophic forgetting
- Automate eval pipeline: quick eval every checkpoint, full eval on best checkpoints + before ship
Instruction tuning transforms a raw language model into an assistant. It's what makes the difference between a model that continues text and one that follows instructions. The key insight: task diversity matters more than task volume. A model trained on 1,000 diverse instructions often outperforms one trained on 100,000 similar instructions.
Base language models predict the next token. They're excellent at completing text but terrible at following instructions. Instruction tuning teaches the model to interpret an instruction and produce the requested output rather than just continuing the text pattern.
Input: "Translate to French: Hello, how are you?"
Output: "Translate to Spanish: Hola, cΓ³mo estΓ‘s?"
β Continues the pattern, doesn't follow the instruction
Input: "Translate to French: Hello, how are you?"
Output: "Bonjour, comment allez-vous?"
β Understands and executes the instruction
Before instruction tuning (pre-InstructGPT era), LLMs required careful prompt engineering to produce useful outputs. Instruction tuning (SFT on instruction-response pairs) made models that try to help by default. This is the foundation of ChatGPT, Claude, and every modern assistant. If you're fine-tuning a base model, instruction tuning is almost always the first step.
The LIMA paper showed that a 65B model fine-tuned on just 1,000 carefully curated examples can match models trained on 50K+ examples. The secret: diversity and quality over volume. Each example should teach something different.
- Multiple task types (QA, summarization, code, creative)
- Varied instruction styles (direct, conversational, formal)
- Different response lengths (one-liner to multi-paragraph)
- Diverse domains (science, arts, business, technical)
- Edge cases and difficult examples
- Same task repeated with variations
- All examples same format/length
- Single domain focus
- Template-generated similar examples
- Missing difficulty spectrum
- β At least 10 distinct task categories
- β Both short and long responses represented
- β Single-turn and multi-turn conversations
- β Factual and creative tasks
- β Easy, medium, and hard difficulty
1,000 high-quality, diverse examples often beats 50,000 similar examples. Why? Large models already have the capability β instruction tuning is about eliciting existing capability, not teaching new knowledge. Diversity shows the model the range of behaviors expected; quality shows it the standard to meet. Volume alone just overfits to a narrow distribution.
Modern models use specific chat templates with special tokens to mark turns and roles. Using the wrong template causes the model to ignore your instructions or produce garbled output. Always use the exact template the base model was trained with.
| Model Family | Template Style | Example |
|---|---|---|
| Llama 3 / 3.1 | Special tokens: <|start_header_id|> etc. | <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|> |
| Mistral / Mixtral | [INST] and [/INST] tokens | [INST] Hello, how are you? [/INST] I'm doing well! |
| ChatML (OpenAI style) | <|im_start|> and <|im_end|> | <|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n |
| Vicuna / Alpaca | Plain text markers | USER: Hello\nASSISTANT: |
Using the wrong chat template is a common cause of "my fine-tuned model is worse than base." The model has been trained to recognize specific token patterns β if your data uses different patterns, the model treats it as noise. Always verify your training data uses exactly the same template as the tokenizer's apply_chat_template() method.
| Dataset | Size | Quality | Best For | Notes |
|---|---|---|---|---|
| LIMA | 1,000 | Excellent (curated) | Proving quality > quantity | Research dataset; not for commercial use |
| OpenAssistant | 160K | Variable (crowdsourced) | General assistant, multi-turn | Apache 2.0 license; filter for quality |
| Alpaca (Stanford) | 52K | Moderate (synthetic) | Quick bootstrapping | GPT-3.5 generated; format reference only |
| Dolly (Databricks) | 15K | Good (employee-written) | Commercial use base | CC-BY-SA license; human-written |
| UltraChat | 1.5M | Variable (synthetic) | Large-scale pre-training | Filter heavily; mix with human data |
| ShareGPT (user convos) | Variable | Variable | Real user distribution | Legal gray area; quality varies wildly |
Don't use any single dataset as-is. Combine: (1) High-quality curated examples (200β500, manual or filtered) + (2) Task-diverse public dataset (5Kβ20K, filtered) + (3) Your domain-specific examples (as many as you have). Filter for quality, deduplicate, and shuffle. The blend matters more than any single source.
Single-turn (one instruction, one response) training produces models that answer questions but struggle with context over multiple turns. Include multi-turn conversations in your training data to teach the model to maintain coherence.
- 50β70%: Single-turn (clear instruction β response)
- 20β30%: 2β3 turn conversations
- 10β20%: 4+ turn conversations
- Include clarification, follow-up, topic shift patterns
- Vary conversation styles (casual, professional, technical)
System prompts set the context for a conversation. If you want your model to respect system prompts in production, you must include them in training. But don't over-rely on system prompts β behavior should be robust without them too.
Production use: You'll use system prompts to control behavior
Persona switching: Model needs to adopt different roles
Constraint instruction: Format/safety rules in system prompt
Mix: ~50% with system prompt, ~50% without
Over-reliance: Model only works with specific system prompt
Inconsistency: Different system prompts in train vs production
Brittleness: Small changes in system prompt break behavior
Solution: Test behavior both with and without system prompts
∑ Chapter 07 — Key Takeaways
- Instruction tuning transforms base models into assistants β it elicits existing capability
- LIMA insight: 1,000 diverse, high-quality examples can match 50K similar examples
- Diversity checklist: multiple task types, varied lengths, different domains, difficulty spectrum
- Use the exact chat template your base model expects β template mismatch = silent failure
- Data mix: curated quality (200β500) + filtered public (5Kβ20K) + domain-specific
- Include multi-turn conversations (20β30%) and system prompts (~50%) in training data
Domain adaptation specializes a general model for specific fields. The approach differs by domain: some need continued pre-training on raw text; others just need task-specific SFT. The key is understanding what your domain requires and avoiding the trap of "more training = better." Domain expertise must be balanced against general capability.
Domain adaptation typically happens in two stages, though not all domains need both. The decision depends on how specialized the language and concepts are.
| Stage | What It Does | Data Required | When Needed |
|---|---|---|---|
| Continued Pre-Training | Teaches domain vocabulary, patterns, writing style | 1Mβ1B tokens of raw domain text | Highly specialized domains (medical, legal, scientific) |
| Domain SFT | Teaches how to apply knowledge to specific tasks | 1Kβ50K instruction-response pairs | Almost always β this is where task capability comes from |
Modern base models already have substantial domain knowledge from their training corpus. Try domain SFT first β it's faster and cheaper. Add continued pre-training only if: (1) the model misuses domain terminology, (2) domain text is highly specialized and underrepresented in base model, (3) you have access to 100M+ tokens of domain text. For most use cases, SFT alone is sufficient.
Challenges: Specialized terminology (ICD codes, drug names), high-stakes accuracy, regulatory requirements (HIPAA), need for citation.
- Pre-training: Often needed (PubMed, clinical notes)
- SFT: Q&A with citations, differential diagnosis, report generation
- Key risk: Hallucinated medical advice is dangerous
- Eval: MedQA, expert review, citation verification
Challenges: Precise language matters, jurisdiction-specific, precedent citation, long documents.
- Pre-training: Often needed (case law, statutes, contracts)
- SFT: Contract review, clause extraction, legal Q&A
- Key risk: Unhelpful if answer is "consult a lawyer"
- Eval: Clause identification accuracy, citation correctness
Challenges: Syntax correctness is binary, multiple languages, execution context matters.
- Pre-training: Sometimes (proprietary codebases)
- SFT: Code completion, debugging, code review, translation
- Key risk: Syntactically correct but semantically wrong
- Eval: Pass@k on test suites, human review
Challenges: Numerical precision, temporal awareness (data freshness), regulatory compliance.
- Pre-training: Rarely needed (finance language is less specialized)
- SFT: Report analysis, sentiment analysis, numerical reasoning
- Key risk: Hallucinated numbers, outdated information
- Eval: Numerical accuracy, fact verification against sources
Continued pre-training (CPT) extends the base model's pre-training on domain-specific text. It's computationally expensive but can significantly improve domain understanding for highly specialized fields.
| Parameter | Typical Value | Notes |
|---|---|---|
| Data volume | 100Mβ10B tokens | More is better, but quality matters |
| Learning rate | 1e-5 to 5e-5 | Lower than initial pre-training |
| Training objective | Causal LM (next token prediction) | Same as base pre-training |
| Mix with general data | 10β30% general data | Prevents catastrophic forgetting |
| Compute cost | $1Kβ$100K+ (depending on volume) | Full fine-tuning required β no LoRA |
Continued pre-training on only domain text causes severe catastrophic forgetting. The model becomes excellent at domain language but loses general instruction-following ability. Always mix 10β30% general data (C4, RedPajama, or high-quality web text) with your domain corpus. After CPT, always run a full regression test on general capabilities before proceeding to SFT.
Domain SFT teaches the model to apply domain knowledge to specific tasks. This is where most value is created and where most teams should focus. SFT data should be task-specific rather than general domain text.
Quality over quantity: 500β5,000 expert-curated examples beats 50K noisy ones. Task diversity: Cover all task types you'll use in production. Realistic inputs: Use real examples from your domain, not synthetic simplifications. Include edge cases: Hard examples, errors, ambiguous cases. Expert review: Every response should be verified by a domain expert.
Domain adaptation is a trade-off: the more you specialize, the more you risk losing general capabilities. Managing this trade-off requires deliberate strategies.
| Strategy | How It Works | When to Use |
|---|---|---|
| Data mixing | Include 10β30% general instruction data in domain SFT | Always β minimal cost, significant benefit |
| Low learning rate | Use 1e-5 to 5e-5 instead of standard SFT rates | Always for domain adaptation |
| Fewer epochs | Train for 1β2 epochs instead of 3+ | Standard practice |
| LoRA instead of full FT | Freezes base model; learns adapter weights only | Default choice β preserves base capability |
| Elastic Weight Consolidation (EWC) | Penalizes changes to important weights | Experimental β not widely used |
| Separate adapter per domain | Train different LoRA adapters for different domains | Multi-domain serving scenario |
- Use LoRA: Preserves base model weights entirely
- Mix data: 70β80% domain + 20β30% general instruction
- Low LR: 1e-5 to 2e-5 for domain SFT
- Few epochs: 1β2 epochs, early stopping on validation
- Monitor regression: Check MMLU/general benchmarks
- General conversation quality drops noticeably
- Model struggles with simple reasoning tasks
- Multi-turn coherence degrades
- Instruction following becomes brittle
- MMLU score drops >5% from base model
| Domain | Standard Benchmarks | Custom Eval Needed |
|---|---|---|
| Medical | MedQA, PubMedQA, MedMCQA | Expert review of clinical recommendations; citation verification |
| Legal | LegalBench, CaseHOLD | Contract analysis accuracy; jurisdiction-specific tests |
| Code | HumanEval, MBPP, SWE-Bench | Proprietary codebase tests; style compliance |
| Finance | FinBench, TAT-QA | Numerical accuracy checks; regulatory compliance |
Standard benchmarks give you a baseline, but custom evaluation on your actual production tasks is essential. Create a golden test set of 200β500 examples representing real use cases in your domain. Include expert review for high-stakes domains (medical, legal, finance). Compare to both base model AND to human expert or GPT-4 baseline to understand where your fine-tuned model wins and loses.
∑ Chapter 08 — Key Takeaways
- Two-stage approach: Continued Pre-Training (domain language) β Domain SFT (task capability)
- Try SFT first β continued pre-training is expensive and often not needed
- Domain differences: Medical/Legal often need CPT; Code/Finance usually just SFT
- Continued pre-training requires 100Mβ10B tokens + 10β30% general data mixing
- Prevent catastrophic forgetting: LoRA, low LR (1e-5), 1β2 epochs, data mixing, regression tests
- Domain eval: standard benchmarks + custom production task golden set + expert review
A fine-tuned model in a notebook is worthless. The real work begins when you deploy it for production inference. This chapter covers the full path: merging adapters into base weights, quantizing for efficiency, and deploying with vLLM, Ollama, or cloud endpoints. The goal: maximize throughput, minimize latency, and keep costs under control.
Before diving into technical details, decide where your fine-tuned model will run. The choice depends on scale, latency requirements, cost constraints, and data privacy needs.
| Deployment Option | Best For | Latency | Cost at Scale | Setup Complexity |
|---|---|---|---|---|
| Cloud API (OpenAI, Anthropic) | When provider offers fine-tuning (GPT-4o, Claude) | ~200β500ms | High ($15β60/1M tokens) | Simple |
| Self-hosted vLLM | High throughput, production scale, custom models | 50β200ms | Medium (GPU cost) | Moderate |
| Ollama (local) | Development, privacy, edge deployment | 100β500ms | Low (own hardware) | Simple |
| Serverless (Modal, Replicate) | Variable load, pay-per-use | Cold start: 5β30s | Low for variable load | Simple |
| Dedicated cloud (SageMaker, Vertex) | Enterprise, compliance, managed infrastructure | 100β300ms | High (always-on) | Complex |
Privacy/compliance required? β Self-host or on-prem. Variable/bursty traffic? β Serverless. High constant throughput? β Dedicated vLLM cluster. Just need it to work? β Cloud API with fine-tuning (if available). Development/testing? β Ollama locally.
LoRA adapters are small (~10β100MB) but require the base model at inference. For production, you typically merge the adapter into the base weights to create a standalone model. This eliminates adapter overhead and simplifies deployment.
How: Load base model + adapter at runtime
Pros: Swap adapters dynamically; multiple adapters per base
Cons: Slight latency overhead; more complex serving
Use when: Multi-tenant with different adapters per customer
How: Merge adapter into base β single model file
Pros: Zero overhead; standard model format; simple deployment
Cons: Full model size; can't swap adapters at runtime
Use when: Single-purpose deployment (most cases)
If you trained with QLoRA (4-bit base model), you cannot merge directly into the quantized weights. You must: (1) load the base model in full precision (fp16/bf16), (2) load the LoRA adapter, (3) merge, (4) re-quantize if needed. This means you need enough GPU memory to hold the full-precision model temporarily (~14GB for 7B, ~140GB for 70B).
Quantization reduces model precision (fp16 β int8 β int4) to shrink memory footprint and increase inference speed. A 7B model in fp16 is ~14GB; in 4-bit it's ~4GB. The trade-off: some quality loss, though modern quantization methods minimize this.
| Format | Precision | Size (7B) | Quality Loss | Hardware Support | Best For |
|---|---|---|---|---|---|
| FP16 / BF16 | 16-bit | ~14GB | None (baseline) | All modern GPUs | Production where quality is critical |
| INT8 | 8-bit | ~7GB | Minimal (<1%) | Most GPUs, some CPUs | Good balance of size/quality |
| GPTQ | 4-bit | ~4GB | Small (1β3%) | NVIDIA GPUs | GPU inference, vLLM |
| AWQ | 4-bit | ~4GB | Minimal (best 4-bit) | NVIDIA GPUs | Production 4-bit, vLLM preferred |
| GGUF | 2β8 bit | ~3β7GB | Depends on quant level | CPU, Apple Silicon, GPU | Ollama, llama.cpp, local deployment |
| EXL2 | 2β8 bit (mixed) | ~3β7GB | Very good (adaptive) | NVIDIA GPUs | ExLlamaV2, high-quality 4-bit |
Universal format for CPU and cross-platform inference. Supports Q4_K_M, Q5_K_M, Q8_0, etc.
- Create:
llama.cpp/convert.py - Quantize:
llama.cpp/quantize - Best quant: Q4_K_M (balanced) or Q5_K_M (higher quality)
Activation-aware quantization. Best quality for 4-bit GPU inference.
- Create:
autoawqlibrary - Serve: vLLM with
--quantization awq - Quality: Near-lossless for most tasks
GPU-only, well-established 4-bit format. Slightly lower quality than AWQ.
- Create:
auto-gptqlibrary - Serve: vLLM, text-generation-inference
- Note: Being superseded by AWQ
Production GPU (vLLM): AWQ 4-bit β best quality/speed tradeoff. Local/Edge (Ollama): GGUF Q4_K_M or Q5_K_M. Quality-critical: FP16 or INT8 β don't quantize. Memory-constrained: Q3_K_M or Q2 β significant quality loss, last resort. Rule of thumb: Test your specific task at each quantization level; losses vary by task.
vLLM is the gold standard for production LLM serving. It implements PagedAttention for efficient memory management, continuous batching for high throughput, and supports all major model formats. If you're serving a fine-tuned model at scale, vLLM is likely your best option.
- PagedAttention: 2β4Γ higher throughput than naive serving
- Continuous batching: Dynamic batch size for variable load
- Quantization support: AWQ, GPTQ, INT8 out of the box
- OpenAI-compatible API: Drop-in replacement
- LoRA serving: Hot-swap adapters at runtime
- GPU only: Requires NVIDIA GPU (no CPU inference)
- Memory: Loads full model into GPU memory
- Cold start: Model loading takes 30β120s
- Complexity: More setup than Ollama
- Resource: Needs dedicated GPU server
| vLLM Configuration | Default | Recommendation | Impact |
|---|---|---|---|
--tensor-parallel-size | 1 | Match your GPU count | Enables multi-GPU serving |
--gpu-memory-utilization | 0.9 | 0.85β0.95 | Higher = more KV cache, higher throughput |
--max-model-len | Model default | Set to your actual max | Lower = less memory, faster startup |
--quantization | None | awq for 4-bit | 2Γ memory reduction, slight quality loss |
Ollama makes running LLMs locally as easy as Docker. It handles model management, quantization, and provides an API. Perfect for development, privacy-sensitive applications, or edge deployment. Supports macOS, Linux, and Windows.
Simplicity: One command to run any model
Cross-platform: macOS, Linux, Windows, Docker
Apple Silicon: Excellent Metal GPU support
Privacy: Everything runs locally
Model management: Pull, list, delete models easily
Throughput: Not designed for high-concurrency
No continuous batching: Sequential requests
Limited scaling: Single-machine only
GGUF only: Must convert from HF format
Production: Better for dev/edge than high-scale
| Platform | Type | Pros | Cons | Best For |
|---|---|---|---|---|
| Modal | Serverless GPU | Pay-per-second, auto-scaling, easy deploy | Cold starts (5β30s) | Variable workloads, prototypes |
| Replicate | Serverless GPU | Simple API, model hosting | Cost at scale, cold starts | Quick deployment, demos |
| RunPod | GPU rental | Cheap GPUs, serverless option | Less managed, variable availability | Cost-sensitive production |
| AWS SageMaker | Managed ML | Enterprise features, integration | Complex, expensive | Enterprise, existing AWS stack |
| GCP Vertex AI | Managed ML | Good Gemini integration | Complex pricing | GCP-native applications |
| Together AI | Inference API | Fast, supports custom models | Per-token pricing | Custom model serving |
- Quantization: 4-bit reduces memory, increases speed
- Speculative decoding: Use draft model for speedup
- KV cache: Don't recompute for same prefix
- Streaming: Return tokens as generated
- Shorter prompts: Less prefill time
- Continuous batching: vLLM, TGI default
- Dynamic batching: Group requests together
- Tensor parallelism: Multi-GPU for large models
- Prefix caching: Cache common prompts
- Right-size GPU: Match model to memory
- Spot/preemptible: 60β80% savings
- Right-size model: Smaller if quality allows
- Caching: Cache frequent responses
- Auto-scaling: Scale to zero when idle
- Batching: Higher util = lower cost/token
| Optimization | Latency Impact | Throughput Impact | Complexity |
|---|---|---|---|
| 4-bit quantization (AWQ) | -20β40% | +50β100% | Low |
| Continuous batching | Slight increase | +200β500% | Free (vLLM) |
| Tensor parallelism (2+ GPU) | -30β50% | +80β180% | Medium |
| Speculative decoding | -30β50% | Variable | Medium |
| KV cache / prefix caching | -50β80% for repeat prefixes | +20β50% | Low (vLLM) |
∑ Chapter 09 — Key Takeaways
- Deployment options: vLLM (production), Ollama (local/edge), serverless (variable load), cloud (enterprise)
- Merge LoRA adapters into base weights for simplified deployment (unless you need multi-tenant adapter swapping)
- Quantization: AWQ for GPU (vLLM), GGUF for CPU/local (Ollama) β Q4_K_M is good balance
- vLLM = gold standard: PagedAttention, continuous batching, OpenAI-compatible API
- Ollama = simplest local deployment: one command, cross-platform, Apple Silicon support
- Optimize: quantize (2Γ), continuous batching (4Γ), tensor parallelism (multi-GPU), caching
Fine-tuning is not a one-time event β it's a continuous process. Production MLOps for fine-tuned models means tracking experiments, versioning models, monitoring quality, and iterating on feedback. This chapter covers the infrastructure and practices that turn one-off fine-tuning into a sustainable competitive advantage.
Every fine-tuning run should be tracked: hyperparameters, dataset version, base model, training metrics, and evaluation results. Without tracking, you can't reproduce results, compare runs, or understand what worked.
| Tool | Type | Strengths | Best For |
|---|---|---|---|
| Weights & Biases | SaaS / self-hosted | Best UX, automatic logging, reports | Most teams, production use |
| MLflow | Self-hosted / Databricks | Open source, model registry built-in | On-prem, Databricks users |
| Comet | SaaS | Good comparison views | Alternative to W&B |
| Neptune | SaaS | Good for large experiments | Large-scale experimentation |
| TensorBoard | Self-hosted | Free, basic, no model registry | Simple projects, learning |
- Config: All hyperparameters (LR, rank, epochs, etc.)
- Data: Dataset name, version, size, hash
- Model: Base model name and version
- Metrics: Loss curve, eval metrics per checkpoint
- Artifacts: Final model weights, adapter files
- All minimum items, plus:
- Code version: Git commit hash
- Environment: Package versions, GPU type
- Sample outputs: Example generations per checkpoint
- Regression metrics: MMLU, general benchmarks
A model registry stores model versions with metadata, enables promotion through stages (dev β staging β production), and provides lineage tracking. It's the single source of truth for which model is deployed where.
| Registry Option | Integration | Strengths | Best For |
|---|---|---|---|
| MLflow Model Registry | MLflow, Databricks | Open source, full lifecycle | Self-hosted, Databricks |
| Hugging Face Hub | HF ecosystem | Easy sharing, versioning, spaces | Open models, collaboration |
| W&B Model Registry | W&B | Linked to experiments, good UX | W&B users |
| SageMaker Model Registry | AWS | AWS integration, approval workflows | AWS-native teams |
| DVC + Git | Git | Version control for models | Simple projects, git-native |
If you don't have a formal registry, at minimum: (1) Store models in versioned cloud storage (S3, GCS) with naming convention (e.g., medical-llama-v2.3-2024-04-15/), (2) Keep a YAML/JSON manifest with model version β storage path β training run ID β eval metrics, (3) Document which version is in production. This beats scattered files and forgotten experiments.
Offline eval is not production eval. A model that scores well on your test set may perform differently with real users and real queries. A/B testing compares model versions on live traffic to validate that improvements are real.
- Random assignment: Users randomly see A or B
- Same conditions: Same prompts, same post-processing
- Sufficient sample: Run until statistically significant
- Multiple metrics: Quality, latency, cost, user behavior
- Holdout group: Always keep baseline for comparison
- Quality: Task accuracy, user ratings, thumbs up/down
- Latency: P50, P95, P99 response time
- Cost: Tokens used per request, GPU utilization
- Engagement: Completion rate, follow-up questions
- Errors: Failure rate, refusal rate, format errors
Before exposing users to a new model, run it in shadow mode: send real traffic to both models, but only show the control model's output. Log the new model's outputs for offline comparison. This catches catastrophic failures (crashes, very bad outputs) before users see them. Only promote to A/B once shadow mode looks good.
A fine-tuned model can degrade over time β the input distribution shifts, edge cases emerge, or the world changes. Continuous monitoring detects these issues before users notice.
- Latency: P50, P95, P99 per endpoint
- Throughput: Requests/sec, tokens/sec
- Error rate: 5xx, timeouts, OOM
- GPU utilization: Memory, compute
- Queue depth: Request backlog
- User feedback: Thumbs up/down, ratings
- Format compliance: JSON parse success rate
- Refusal rate: How often model refuses
- Output length: Avg tokens, anomalies
- LLM-as-judge sample: Periodic quality scoring
- Input drift: Embedding distance from training
- Output drift: Response distribution changes
- Concept drift: Same inputs, different correct answers
- Alert: When metrics deviate >2Ο from baseline
| Metric | Alerting Threshold (Example) | Action When Triggered |
|---|---|---|
| P95 latency | >2Γ baseline for 5 min | Check GPU load, model, batch size |
| Error rate | >1% for 5 min | Page on-call, check logs |
| Format compliance | <95% for 1 hour | Review failing examples, consider rollback |
| User thumbs down rate | >2Γ baseline for 1 day | Sample and review bad responses |
| Input embedding drift | >0.2 cosine distance shift | Investigate new input patterns; may need new data |
Metrics: Prometheus + Grafana or Datadog. Logging: Structured logs β aggregator (Loki, CloudWatch, Datadog). Tracing: OpenTelemetry β Jaeger or vendor. LLM-specific: LangSmith, Langfuse, or custom sampled eval. Alerting: PagerDuty, Slack, email for critical metrics.
The best fine-tuning teams don't ship one model β they build a flywheel. Production feedback generates training data, which improves the model, which generates better feedback. This loop compounds over time.
- User corrections: When users edit model output
- Thumbs up/down: Explicit quality signals
- Support escalations: Cases model couldn't handle
- A/B test losers: Examples where new model failed
- Edge cases: Unusual inputs from production logs
- Auto-label: Use current model to bootstrap labels
- Human-in-the-loop: Flag uncertain cases for review
- Scheduled retraining: Weekly/monthly fine-tuning runs
- Auto-eval: CI pipeline runs eval on new checkpoints
- Auto-promote: If eval passes, deploy to staging
Month 1: You ship a fine-tuned model. Month 2: You've collected 1,000 examples of user corrections; you retrain and quality improves 5%. Month 3: The better model gets more usage, generating more feedback. Month 6: You have 10,000 curated examples and 20% better quality than month 1. Teams that build the flywheel pull ahead of teams that treat fine-tuning as a one-time event.
| Phase | Checkpoint | Done? |
|---|---|---|
| Data | Dataset is versioned and reproducible | β |
| Deduplication completed (exact + near) | β | |
| Train/val/test splits verified β no leakage | β | |
| Quality audit passed (random sample reviewed) | β | |
| Training | Experiment tracked (hyperparams, metrics, artifacts) | β |
| Multiple checkpoints saved | β | |
| Training loss and val loss look healthy | β | |
| Evaluation | Task-specific eval passed (golden set) | β |
| Regression tests passed (<2% drop on general benchmarks) | β | |
| Safety eval passed (refusal rate, harmful content) | β | |
| Format compliance verified (JSON parse rate, etc.) | β | |
| Deployment | Model registered with version and metadata | β |
| Quantization tested (if using) | β | |
| Inference latency acceptable in staging | β | |
| Operations | Monitoring and alerting configured | β |
| Rollback plan documented and tested | β | |
| Shadow mode or A/B test plan in place | β |
Fine-tuned models can fail in production. Have a runbook ready.
- Quality regression: Model suddenly worse
- Format failures: JSON/structured output breaks
- Refusal spike: Model refuses valid requests
- Harmful output: Model generates bad content
- Latency spike: Inference slows dramatically
- OOM: Out of memory crashes
- Detect: Alerting triggers on anomaly
- Assess: Severity? Scope? Cause hypothesis?
- Mitigate: Rollback to previous model version
- Investigate: Root cause analysis with logs/traces
- Fix: Address underlying issue
- Postmortem: Document and prevent recurrence
Before deploying any new model version: (1) Keep the previous version running in staging, (2) Document the exact rollback command/process, (3) Test the rollback procedure in staging, (4) Have a "big red button" that can revert in <5 minutes. The ability to quickly rollback is more important than the ability to quickly deploy.
∑ Chapter 10 — Key Takeaways
- Track every experiment: hyperparameters, data version, metrics, artifacts (W&B, MLflow)
- Use a model registry: version models, promote through stages (experiment β staging β production β archived)
- A/B test in production: shadow mode first, then gradual rollout, measure quality + latency + cost
- Monitor continuously: operational metrics (latency, errors) + quality metrics (user feedback, format compliance) + drift detection
- Build the flywheel: production feedback β curated data β fine-tune β deploy β more feedback
- Always have a rollback plan: test it, document it, be able to execute in <5 minutes