AI Advanced · Fine-Tuning LLMs

Fine-Tuning LLMs

From dataset curation to LoRA adapters, SFT, DPO, and production deployment — a practitioner's complete guide to adapting large language models.

Fine-tuning is not a silver bullet — it is a precision tool. Used correctly it produces models that outperform general-purpose models on specific tasks. Used carelessly it wastes GPU budget and makes models worse. This guide teaches you the difference.

Chapter 01 · Decision Framework

Why Fine-Tune? — When It Wins and When It Doesn't

Most teams that think they need fine-tuning don't. Fine-tuning is expensive, slow, and introduces new failure modes. Before training a single step, prove that prompting and RAG cannot solve your problem. Fine-tuning wins when you need capabilities the model doesn't have — not when you need it to do something it already can.

The Decision Ladder — Try Simpler Things First Foundation

Fine-tuning sits at the top of a capability ladder. Each rung is more expensive and time-consuming than the last. The rule: never climb to the next rung unless you've exhausted the current one.

The capability ladder — climb only when necessary

The 80/20 Rule of LLM Capabilities

For ~80% of LLM use cases, prompt engineering + RAG is sufficient. For ~15%, better model routing or a larger model solves the problem. Only ~5% truly require fine-tuning — usually domain-specific tasks, format control beyond what prompting achieves, or latency/cost requirements that demand a smaller, specialized model.

When Fine-Tuning Wins — The Valid Use Cases Core

✅

Format Control Beyond Prompting

You need structured outputs (JSON, XML, code) with 99%+ reliability, and even few-shot + JSON mode doesn't achieve it consistently on your edge cases.

Medical coding with complex schemas
Domain-specific DSLs
Highly structured report generation

✅

Behavior Modification

You need the model to behave fundamentally differently from its base training — different tone, different reasoning style, different refusal boundaries.

Brand voice consistency at scale
Domain-specific communication norms
Reducing over-refusal for valid use cases

✅

Domain Knowledge Integration

The knowledge you need is too large or specialized for RAG context — the model must internalize domain expertise into its weights.

Medical/legal terminology usage
Proprietary codebase understanding
Specialized scientific reasoning

✅

Latency / Cost Optimization

You've proven a large model works, but need a smaller model to match quality at 10× lower cost / 5× lower latency for production scale.

Distillation to smaller model
Edge deployment requirements
High-volume cost reduction (1M+ queries/day)

Fine-Tuning Does Not Reliably Add Knowledge Critical

Fine-tuning does not turn a model into a database. A common misconception is that you can "teach" a model facts by including them in training data. This fundamentally misunderstands how fine-tuning works.

🧠

What Actually Happens

Pattern learning: The model learns associations and response patterns, not retrievable facts
Implicit encoding: Knowledge is encoded implicitly in weights, not stored explicitly
Non-deterministic recall: The model may or may not surface specific facts depending on context
Blending: Fine-tuned knowledge blends with pre-training knowledge — can't isolate it

⚠️

Production Implications

Fine-tuned models will hallucinate domain facts they were trained on
Updates require full retraining, not data refresh
Correctness cannot be guaranteed — outputs are probabilistic
Citation and verification are impossible — no source to reference

The Knowledge Rule

If correctness depends on up-to-date or verifiable knowledge → use RAG, not fine-tuning. Fine-tuning is for teaching behavior, style, and format — not for injecting facts. A fine-tuned model that "knows" your product documentation will confidently hallucinate details that were never in the training data.

When Fine-Tuning Fails — The Anti-Patterns In-depth

Anti-Pattern	What Happens	The Real Solution
"Our prompts are too long"	Fine-tuning doesn't reliably shorten prompts — behavior may become inconsistent	Prompt compression, RAG optimization, or prompt caching
"The model doesn't know X"	Fine-tuning doesn't add knowledge reliably — it's brittle and hallucinates	RAG with verified source documents
"We have proprietary data"	You also need to maintain, version, and update that data — fine-tuning freezes it	RAG allows live updates without retraining
"We want better quality"	If quality is vaguely defined, fine-tuning won't improve it — garbage in, garbage out	Define quality → measure → improve prompts → then consider fine-tuning
"We need to differentiate"	Fine-tuning is not a competitive moat — others can fine-tune too. Data and product are the moat.	Focus on data quality and product experience, not the model itself

Fine-Tuning Makes Models Worse at Other Things

Every fine-tuning run risks catastrophic forgetting — the model loses capabilities it had before. A model fine-tuned heavily on medical text may become worse at general conversation or code. You're not adding capabilities — you're trading general capabilities for specialized ones. This trade-off must be intentional and measured.

Fine-Tuning Without Evaluation Is Blind Optimization Critical

Fine-tuning improves whatever your dataset encodes. If your dataset is flawed, the model will learn incorrect behavior — and quality may appear improved while actually degrading in production.

Before You Train, Define:

Task-specific metrics: Accuracy, F1, format compliance, latency

Failure cases: What bad outputs look like — examples you'll reject

Acceptance thresholds: Numbers that must be hit before shipping

Regression tests: General capabilities that must not degrade

Without Evaluation:

❌ You can't tell if training helped or hurt

❌ You can't compare model versions meaningfully

❌ You can't catch regressions before production

❌ You're optimizing blindly toward an undefined target

Evaluation-First Mindset

Build your golden test set before writing a single training example. If you can't measure improvement, you're not doing engineering — you're hoping. The eval set defines what "better" means for your specific use case.

Choosing the Right Base Model Core

Fine-tuning cannot fix a fundamentally weak base model. If the base model doesn't understand your domain at all, fine-tuning won't magically create that capability — you'll just teach it to confidently produce low-quality outputs.

Consideration	What to Check	Red Flag
Baseline capability	Zero-shot performance on your task	Model produces nonsense without prompting
Reasoning quality	Chain-of-thought coherence, logic	Model can't follow multi-step reasoning
Context length	Max tokens vs your typical input size	Inputs truncated during training/inference
Language coverage	Fluency in your target languages	Model struggles with non-English content
License & deployment	Commercial use, modification rights	License prohibits your use case

The Base Model Rule

Choose the smallest model that already performs reasonably well on your task with prompting alone. Fine-tune to specialize and improve consistency — not to compensate for fundamental capability gaps. A 7B model that handles your domain well will outperform a poorly-matched 70B model.

Cost-Benefit Analysis — The Real Numbers Core

Cost Category	LoRA (7B model)	Full Fine-Tune (7B)	Full Fine-Tune (70B)
GPU requirement	1× A100 40GB or 1× RTX 4090	4× A100 80GB (FSDP)	8–16× A100 80GB
Training time (10K samples)	2–4 hours	8–16 hours	24–72 hours
Cloud compute cost	$10–$50	$200–$500	$2,000–$10,000
Data prep time	Days to weeks	Days to weeks	Weeks to months
Iteration cycle	Hours per experiment	1–2 days per experiment	Days per experiment
Risk of degradation	Lower (fewer params modified)	Moderate — easy to overfit	Higher — harder to debug

When the Math Works Out

Fine-tuning makes economic sense when: (1) you've proven prompting doesn't work, (2) you have 5K–50K+ quality training examples, (3) you need the specialized capability for high-volume production, and (4) you're prepared to maintain the fine-tuned model over time (updates when the base model changes, dataset maintenance, eval pipeline). If any of these are missing, the ROI is negative.

Fine-Tuning vs Prompting vs RAG — When to Use Each In-depth

Dimension	Prompting	RAG	Fine-Tuning
Best for	Behavior control, format, task guidance	External knowledge, up-to-date info	Capability modification, style, internalized expertise
Setup time	Hours	Days to weeks	Weeks to months
Iteration speed	Instant — edit and deploy	Fast — update documents	Slow — retrain and evaluate
Knowledge updates	Edit prompt (limited)	Update index anytime	Retrain required
Failure mode	Instruction not followed	Wrong docs retrieved	Catastrophic forgetting + overfitting
Maintenance burden	Low — prompt version control	Medium — index maintenance	High — data pipeline, retraining, eval

Start Here (default path)

1. Prompt engineering (few-shot, CoT, system prompt)

2. Add RAG if knowledge is required

3. Try a larger/better model

4. Only then → fine-tune if still not working

When to Skip to Fine-Tuning

✅ You need a smaller model at lower cost/latency

✅ Task requires fundamentally different behavior

✅ You have abundant high-quality labeled data

✅ You've already proven prompting + RAG insufficient

Fine-Tuning Is an Iterative Process High Impact

A single training run rarely produces a production-ready model. Expect multiple iterations — the first model reveals what's wrong with your data, the second fixes some issues, the third gets closer to acceptable quality.

The real fine-tuning workflow — iteration is the norm

The True Cost Is Iteration

Compute cost for a single training run is small. The real cost is iteration time: reviewing failures, curating fixes, retraining, re-evaluating. Budget for 3–10 iteration cycles. A team that plans for one training run will be surprised; a team that plans for ten will ship a good model.

Fine-Tuned Models Require Ongoing Maintenance Critical

Unlike prompts (edit instantly) or RAG (update documents), fine-tuned models are frozen artifacts. You commit to maintaining them over time — or accepting degradation.

🔄

Base Model Updates

When the base model releases a new version (Llama 3.2 → 3.3), you must:

Retrain on new base
Re-evaluate for regressions
Update deployment infra

📊

Data Drift

Production usage changes over time:

New query patterns emerge
Domain knowledge evolves
Model performance degrades

🔧

Pipeline Maintenance

The infrastructure requires upkeep:

Dataset versioning
Eval pipeline updates
Retraining automation

The Maintenance Question

Before fine-tuning, ask: "Who will maintain this model 6 months from now?" If the answer is unclear, reconsider. Unmaintained fine-tuned models become legacy debt — increasingly out-of-sync with reality, impossible to update quickly, and risky to replace.

Fine-Tuning Trades Generality for Specialization Golden Insight

This is the fundamental insight that should guide every fine-tuning decision: you are not making the model universally better. You are making a trade.

What You Gain

✅ Better performance on your specific task

✅ More consistent format and style

✅ Reduced prompt engineering complexity

✅ Potentially lower inference cost (smaller model)

What You Lose

❌ Performance on tasks outside your training distribution

❌ Flexibility to handle unexpected inputs

❌ Ability to update quickly (must retrain)

❌ General reasoning capability (potentially)

The Trade-off Must Be Intentional

Before fine-tuning, explicitly document: (1) What tasks must improve, (2) What tasks can degrade, (3) How you'll measure both. If you can't answer these questions, you're not ready to fine-tune. A fine-tuned model without measured trade-offs is a model you don't understand.

∑ Chapter 01 — Key Takeaways

Fine-tuning is the last rung on the capability ladder — exhaust prompting, RAG, and model routing first
~80% of LLM use cases do not require fine-tuning — most are solved by prompt engineering + RAG
Fine-tuning wins for: format control, behavior modification, internalized domain knowledge, and cost/latency optimization
Fine-tuning fails for: "longer prompts," "more knowledge," vague quality improvements, and differentiation alone
Catastrophic forgetting is real — fine-tuning trades general capability for specialized capability
Cost: LoRA = $10–$50, Full FT (7B) = $200–$500, Full FT (70B) = $2K–$10K+ — data prep time matters more

Chapter 02 · Dataset Engineering

Data Preparation — The Make-or-Break Step

Fine-tuning performance is 90% data, 10% hyperparameters. A mediocre model trained on excellent data will outperform an excellent model trained on mediocre data. Most fine-tuning failures trace back to dataset problems — low quality, insufficient diversity, wrong format, or insufficient volume. Data engineering is the real work of fine-tuning.

Dataset Formats — The Building Blocks Foundation

Every fine-tuning dataset is ultimately a collection of (input, target output) pairs. The format depends on your training objective — completion, instruction following, or preference learning.

📄

Format: Completion (SFT)

Single text sequence — model learns to predict continuation. Used for continued pre-training or simple generation tasks.

{ "text": "User: What is photosynthesis?\nAssistant: Photosynthesis is..." }

💬

Format: Chat / Instruction (SFT)

Multi-turn conversations with role markers. Standard for instruction tuning. Most common format.

{ "messages": [ {"role": "system", "content": "You are..."}, {"role": "user", "content": "Explain X"}, {"role": "assistant", "content": "X is..."} ] }

⚖️

Format: Preference (DPO/RLHF)

Pairs of (chosen, rejected) responses to the same input. Used for preference alignment.

{ "prompt": "Explain X", "chosen": "X is a concept that...", "rejected": "I don't know what X is." }

Training Goal	Format	Fields Required	Example Use
Continued pre-training	Completion	`text`	Domain adaptation (legal corpus, codebase)
Instruction following	Chat/Instruction	`messages` with `role`/`content`	General assistant, task completion
Preference alignment	Preference pairs	`prompt`, `chosen`, `rejected`	DPO training, RLHF reward modeling
Structured output	Chat + schema	`messages` with JSON in assistant turn	Extraction, classification, code generation

Train–Production Distribution Alignment Very Important

A model performs well only when training data matches real usage. The most common cause of fine-tuning failure isn't bad hyperparameters — it's training on data that doesn't represent production.

Common Failure Pattern

❌ Training on clean, idealized examples

❌ Deploying on noisy, real-world inputs

❌ Performance collapse in production

❌ Team surprised: "It worked in testing!"

Training Data Must Include:

✅ Real user queries (sampled from logs)

✅ Edge cases and unusual inputs

✅ Typos, grammar errors, incomplete inputs

✅ Adversarial and out-of-scope inputs

The Golden Rule of Training Data

Your dataset should look like production logs — not curated examples. If your training data is cleaner than your production traffic, you're training a model for a world that doesn't exist. Resist the urge to "clean up" training examples too much. Real users don't write perfect queries.

Quality Over Quantity — The 1,000 Sample Rule In-depth

More data is only better if the data is consistently high quality. A 10K sample dataset with 30% low-quality examples will produce worse results than a 3K sample dataset where every example is excellent.

Quality vs quantity — typical fine-tuning performance curve

✅

High-Quality Sample Criteria

Correct: The target output is factually accurate and appropriate
Complete: No truncation, no partial responses
Consistent: Format matches other samples; no random variations
Representative: Covers the distribution of real inputs
Clear: Unambiguous instruction → response mapping

❌

Quality Problems That Ruin Training

Noisy labels: Wrong, inconsistent, or ambiguous targets
Duplicates: Same examples repeated — overfit to those patterns
Length bias: All short or all long — model learns length, not content
Format inconsistency: Mixed JSON styles, varying delimiters
Contamination: Test examples in training set — fake good results

The 1,000 High-Quality Sample Rule

For most task-specific fine-tuning, 1,000–5,000 high-quality examples is the sweet spot. Below 500, you risk underfitting. Above 10K, you get diminishing returns unless data quality remains excellent. A team that spends 80% of effort on data curation and 20% on training will outperform a team that does the opposite.

Detecting Overfitting Early Critical

Fine-tuned models often appear to improve during training — but are actually memorizing. When train loss decreases but validation loss stalls or increases, the model is overfitting to training examples rather than learning generalizable patterns.

🚨

Warning Signs of Overfitting

Loss divergence: Training loss decreases, validation does not
Output similarity: Outputs become overly similar to training examples
Brittleness: Performance drops on slightly different inputs
Memorization: Model reproduces training text verbatim
Narrow behavior: Model only handles exact patterns from training

🛡️

Mitigation Strategies

Strong validation set: 10–20% of data, held out strictly
Early stopping: Monitor validation loss, stop when it stalls
Fewer epochs: 1–3 epochs is often sufficient
More data diversity: Increase variety, not just volume
Out-of-distribution eval: Test on inputs unlike training

The Overfitting Test

After training, show the model inputs that are similar but not identical to training examples. If performance is significantly worse than on training-like inputs, you're overfitting. A well-generalized model should handle reasonable variations without degradation.

Data Sources — Where Good Training Data Comes From Core

Source	Quality	Cost	Best For	Watch Out
Production logs	Real distribution	Free (you have it)	Domain adaptation, improving existing models	Needs labeling; may contain PII
Expert annotation	Highest quality	$50–$500+ per hour	Small high-quality datasets (500–2K)	Expensive at scale; expert availability
Crowd annotation	Variable	$0.10–$2 per sample	Scaling up with quality controls	Needs rigorous QA; inter-annotator agreement
Synthetic (LLM-generated)	Good if filtered	$0.001–$0.01 per sample	Bootstrapping, format examples, augmentation	Model collapse risk; echo chamber
Public datasets	Variable	Free	Pre-training, general instruction tuning	May be contaminated; license issues

Synthetic Data Generation — Using LLMs to Generate Training Data In-depth

A powerful technique: use a larger, more capable model (GPT-4o, Claude Sonnet) to generate training examples for a smaller model. This is how most open-source instruction-tuned models were trained (Alpaca, Vicuna, etc.).

🔄

Self-Instruct Pattern

Generate diverse (instruction, response) pairs from a seed set. Use the LLM to create variations and new tasks.

Start with 100–200 seed examples
Generate 5K–50K synthetic examples
Filter aggressively for quality

🎯

Evol-Instruct Pattern

Take simple instructions and evolve them into harder, more complex versions. Creates curriculum progression.

Simple → Complex → Multi-step
Add constraints, edge cases
Reject trivial generations

⚖️

Distillation Pattern

Run your production queries through a large model; use outputs as training signal for a smaller model.

Deploy large model first
Log (input, output) pairs
Fine-tune small model on logs

🔧

Synthetic data generation prompt (example)

System: You are generating training data for a customer support classification model. Generate 10 diverse customer support tickets that should be classified as "BILLING". Each ticket should be realistic, varied in tone (frustrated, neutral, polite), length (1-3 sentences), and phrasing. Format each as: Ticket: [ticket text] Label: BILLING Do not repeat patterns. Make each ticket distinct. User: Generate 10 billing-related support tickets.

Synthetic Data Risks

Model collapse: Training on LLM outputs can amplify quirks and reduce diversity. Echo chamber: Synthetic data reflects the biases of the generating model. Format homogeneity: LLMs generate in patterns — synthetic datasets often lack the noise and variation of real human data. Mitigation: Always filter synthetic data, mix with real data (20–50% real), and validate on held-out human-generated test sets.

Deduplication — Preventing Overfitting to Repeated Patterns Core

Duplicate or near-duplicate examples cause the model to memorize rather than generalize. Deduplication is not optional — it is a required preprocessing step for any fine-tuning dataset.

Dedup Method	What It Catches	Implementation	When to Use
Exact hash	Identical text strings	MD5/SHA256 of normalized text	Always — baseline dedup
N-gram overlap	High textual similarity (70%+ overlap)	MinHash, n-gram Jaccard	Medium-sized datasets (<100K)
Embedding similarity	Semantic duplicates (same meaning, different words)	Embed + ANN search (FAISS) + threshold	When paraphrases are a concern
Train/test leakage check	Test examples leaked into training	Hash match between splits	Always — critical for valid eval

Dedup Pipeline (Recommended)

Step 1: Normalize text (lowercase, strip whitespace, remove special chars) → Step 2: Exact dedup by hash → Step 3: Near-dedup by MinHash (threshold 0.7) → Step 4: Check for train/test/val leakage → Step 5: Log dedup stats (how many removed, from which sources). A 10% dedup rate is normal; >30% suggests data collection problems.

Dataset Splits — Train / Validation / Test Core

Recommended Splits

Training: 80–90% — model learns from this

Validation: 5–10% — used during training to tune hyperparams, detect overfitting

Test: 5–10% — held out until final evaluation; never touched during training

Rule: Validation and test sets must be representative of production distribution

Common Mistakes

❌ No validation set — can't detect overfitting during training

❌ Test set too small — results have high variance

❌ Leakage between splits — fake good metrics

❌ Random split when data has structure — should split by time, user, or document

Data Preparation Checklist — Before You Train Reference

Check	What to Verify	Red Flags
Format validation	All samples parse correctly; required fields present	JSON parse errors, missing `messages`, empty `content`
Length distribution	Reasonable spread of input/output lengths	All samples same length, or extreme outliers
Label balance	Classes roughly balanced (or intentionally weighted)	90% one class, 10% others → model ignores minority
Quality audit	Random sample of 50–100 manually reviewed	>5% have errors, inconsistencies, or low quality
Deduplication	Exact + near duplicates removed	>10% duplicates; any train/test leakage
Split validation	No overlap between train/val/test; splits representative	Same examples in train and test
Tokenization check	Samples don't exceed context window after tokenization	Samples truncated silently during training

Data Flywheel — The Real Advantage High Impact

The true value of fine-tuning is not the model itself — it's the data pipeline you build. The model is a snapshot; the pipeline is an asset that compounds over time.

The data flywheel — continuous improvement engine

Flywheel Components

Data collection: Log production queries and model outputs

Failure identification: Find where model underperforms

Dataset improvement: Add examples for failure cases

Automated retraining: Regular model updates

Flywheel Compounding

Month 1: 1,000 examples, 70% quality

Month 3: 3,000 examples, 80% quality

Month 6: 8,000 examples, 90% quality

Month 12: Competitors can't catch up

The Flywheel Is the Moat

A fine-tuned model is a commodity — anyone can train one. A data pipeline that continuously improves from production usage is a moat. Teams that build the flywheel pull ahead over time; teams that treat fine-tuning as a one-time event get left behind.

∑ Chapter 02 — Key Takeaways

Fine-tuning performance is 90% data, 10% hyperparameters — data engineering is the real work
Three main formats: completion (text), chat/instruction (messages), preference (chosen/rejected pairs)
1,000–5,000 high-quality samples is the sweet spot — more data only helps if quality stays high
Synthetic data is powerful but risky: filter aggressively, mix with real data, validate on human test sets
Deduplication is not optional — exact hash → MinHash → embedding similarity → train/test leakage check
Before training: format validation, length distribution, label balance, quality audit, dedup, split validation, tokenization check

Chapter 03 · Parameter-Efficient

LoRA & PEFT — Train 1% of Parameters, Get 95% of the Gains

Parameter-efficient fine-tuning (PEFT) is the practical way to fine-tune large models. Instead of updating all 7B–70B parameters, you train small adapter layers that modify model behavior. This reduces GPU memory by 80%, training cost by 90%, and makes experimentation fast. For most use cases, PEFT matches full fine-tuning quality while being dramatically easier to run.

What Is LoRA — Low-Rank Adaptation Explained Foundation

LoRA (Low-Rank Adaptation) freezes the original model weights and adds small trainable matrices to specific layers. Instead of updating a 4096×4096 weight matrix, you learn two small matrices (4096×r and r×4096, where r=8–64) that together approximate the weight update. The original model is unchanged — you're just learning a delta.

LoRA adds low-rank matrices to frozen base weights

Why LoRA Works

The key insight: fine-tuning updates live in a low-rank subspace. You don't need to modify 16M parameters — the behavioral changes can be captured by two small matrices totaling 100K–500K parameters. The base model provides general intelligence; LoRA provides task-specific steering. At inference, A×B is merged into W — zero inference overhead.

Rank and Alpha — The Two Key Hyperparameters Core

Hyperparameter	What It Controls	Recommended Range	Trade-off
Rank (r)	Capacity of the adaptation — how much information the LoRA can encode	8–64 (start with 16)	Higher = more capacity but more params and risk of overfitting
Alpha (α)	Scaling factor for the LoRA update — controls magnitude of behavior change	α = r or α = 2r (default: same as rank)	Higher = stronger adaptation; too high = instability
Target modules	Which layers get LoRA adapters	q_proj, v_proj (minimum) or all attention + MLP	More modules = more capacity but slower training
Dropout	Regularization during training	0.05–0.1 (often 0)	Helps prevent overfitting on small datasets

🎯

Simple Task (r=8)

Classification, sentiment, simple extraction. Low capacity needed — small LoRA prevents overfitting.

Params: ~50K
Memory: +2% vs base
Risk: May underfit complex tasks

⚖️

Standard Task (r=16–32)

Instruction following, domain adaptation, style transfer. Good balance of capacity and efficiency.

Params: ~100K–250K
Memory: +5% vs base
Most common production choice

🚀

Complex Task (r=64+)

Major behavior changes, multi-task, domain pre-training. Higher capacity when you have data to support it.

Params: ~500K–2M
Memory: +10% vs base
Risk: Overfitting on small datasets

QLoRA — 4-bit Quantization + LoRA In-depth

QLoRA combines 4-bit quantization of the base model with LoRA training. The base model is loaded in 4-bit (NF4) format, reducing memory by 4×. LoRA adapters are still trained in full precision. This allows fine-tuning a 7B model on a single 16GB GPU or a 70B model on a single A100.

Model Size	Full FT Memory	LoRA Memory	QLoRA (4-bit) Memory	Consumer GPU?
7B	~56GB (4× A100)	~28GB (A100 40GB)	~8GB (RTX 4090 / 3090)	✅ Yes
13B	~104GB (8× A100)	~52GB (A100 80GB)	~14GB (RTX 4090)	✅ Yes
70B	~560GB (16× A100)	~280GB (8× A100)	~48GB (A100 80GB)	❌ No

🔧

QLoRA training config (Hugging Face PEFT)

from transformers import BitsAndBytesConfig from peft import LoraConfig, get_peft_model # Load base model in 4-bit bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", quantization_config=bnb_config, device_map="auto", ) # Add LoRA adapters lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config)

QLoRA Quality vs Memory Trade-off

QLoRA reduces memory by 4× but introduces quantization error. For most tasks, quality is within 1–2% of full-precision LoRA. However: (1) very small datasets may overfit more easily with QLoRA, (2) complex reasoning tasks sometimes benefit from full precision, (3) merging QLoRA adapters to full precision for serving requires dequantization. Start with QLoRA; switch to full-precision LoRA only if quality is noticeably worse.

Beyond LoRA — DoRA, AdaLoRA, and Variants Core

Method	Key Idea	When to Use	Status
LoRA	Low-rank matrices A, B added to frozen weights	Default choice — well-understood, broadly supported	Production-ready
QLoRA	LoRA + 4-bit base model quantization	When GPU memory is constrained	Production-ready
DoRA	Decomposes weights into magnitude + direction; learns direction	Small quality improvement over LoRA in some tasks	Emerging — worth testing
AdaLoRA	Adaptive rank — automatically allocates rank per layer	When optimal rank varies by layer	Experimental
LoRA+	Different learning rates for A, B matrices	Minor optimization; easy to add	Experimental
Prefix Tuning	Prepend trainable "virtual tokens" to input	Legacy approach — LoRA generally better	Largely superseded

Practical Recommendation

Start with LoRA or QLoRA — they're the most tested, best supported, and work for 90%+ of use cases. Try DoRA if you need an extra 1–2% quality improvement and are willing to experiment. Everything else is research-stage or niche — don't use unless you have a specific reason and can validate the improvement on your eval set.

Adapter Merging — From Training to Deployment In-depth

During inference, you can either (1) load adapters separately and apply at runtime, or (2) merge adapters into base weights for a single merged model. Merging is preferred for production — zero inference overhead, simpler deployment.

Separate Adapters

Pros: Can swap adapters at runtime; multiple adapters per base model

Cons: Slight inference overhead; more complex serving

Use when: Multi-tenant serving, A/B testing adapters

Merged Model

Pros: Zero overhead; standard model format; simple deployment

Cons: Can't swap at runtime; produces a full model copy

Use when: Single-purpose deployment (most cases)

🔧

Merging LoRA adapters into base model

from peft import PeftModel, PeftConfig # Load base model and adapter base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") model = PeftModel.from_pretrained(base_model, "./my-lora-adapter") # Merge adapter into base weights merged_model = model.merge_and_unload() # Save as standard model (no adapter dependency) merged_model.save_pretrained("./my-merged-model") tokenizer.save_pretrained("./my-merged-model")

Inference Trade-offs of Fine-Tuned Models Important

Fine-tuning changes how a model behaves at inference time — not just on your target task, but across all tasks. These trade-offs affect production system design.

✅

What Improves

Task consistency: More reliable outputs on trained patterns
Format compliance: Better adherence to target structure
Latency (if distilled): Smaller model can match larger model
Reduced prompting: Less instruction needed in prompt

⚠️

What May Degrade

Flexibility: Less adaptable to unexpected inputs
General capability: Worse on tasks outside training distribution
Creativity: More constrained, less diverse outputs
Instruction following: May ignore prompts that conflict with training

Production System Pattern: Model Routing

Production systems often route between base and fine-tuned models: use the fine-tuned model for in-domain queries where it excels, fall back to the base model for out-of-domain or uncertain cases. This preserves flexibility while gaining specialization. Implement confidence-based routing or query classification to decide which model handles each request.

∑ Chapter 03 — Key Takeaways

LoRA trains 0.1–1% of parameters by adding small low-rank matrices to frozen base weights
Key hyperparams: rank r (start 16), alpha (= r or 2r), target modules (q, v minimum)
QLoRA combines 4-bit quantization + LoRA — fine-tune 7B on a 16GB GPU
DoRA offers small quality gains; everything else is experimental — stick with LoRA/QLoRA
Merge adapters for production — zero inference overhead, simpler deployment
LoRA matches full fine-tuning quality for 90%+ of tasks at 10× lower cost

Chapter 04 · Training at Scale

Full Fine-Tuning — When to Update All Weights

Full fine-tuning updates every parameter in the model. It's the most powerful form of adaptation — and the most dangerous. Use it when LoRA isn't enough, you have abundant high-quality data, and you're prepared to invest in compute and validation. Most teams never need it; some absolutely do.

When Full Fine-Tuning Makes Sense Core

Scenario	Why Full FT?	Typical Data Volume
Continued pre-training	Adding domain knowledge to the base model (code, legal, medical corpus)	10M–1B+ tokens
Language adaptation	Adapting to a new language the base model doesn't handle well	1B+ tokens
LoRA ceiling reached	LoRA quality plateaus; ablation shows more capacity needed	50K–500K samples
Model distillation	Training a smaller model to mimic a larger model's outputs	100K–1M samples
Safety fine-tuning	Deep behavioral changes that touch many capabilities	10K–100K samples

The Decision Rule

Try LoRA first. Measure quality. If quality is not sufficient and you have evidence that more capacity is needed (ablation with higher rank doesn't help, or quality improves with more data but plateaus), then consider full fine-tuning. Full fine-tuning is a last resort, not a default.

Compute Requirements — What You Actually Need In-depth

Model Size	Min GPUs (FSDP/ZeRO-3)	Memory per GPU	Training Time (10K samples)	Cloud Cost
7B	4× A100 40GB	~28GB per GPU	8–16 hours	$200–$500
13B	8× A100 40GB	~32GB per GPU	16–32 hours	$500–$1,500
70B	16× A100 80GB	~70GB per GPU	48–120 hours	$5,000–$20,000

⚙️

FSDP (PyTorch Native)

Fully Sharded Data Parallel — shards model, optimizer, gradients across GPUs. First choice for PyTorch users.

Built into PyTorch ≥2.0
Good Hugging Face integration
Requires homogeneous GPU cluster

🚀

DeepSpeed ZeRO

Microsoft's distributed training library. ZeRO-3 shards everything; ZeRO-Offload uses CPU memory.

More memory-efficient than FSDP
Better for heterogeneous setups
Slightly more complex config

📦

Cloud Platforms

Managed training: Lambda Labs, RunPod, AWS SageMaker, Google Vertex AI.

Pre-configured multi-GPU
Spot instances for cost savings
Pay per hour — no capital expense

Learning Rate — The Most Dangerous Hyperparameter Core

Full fine-tuning requires learning rates 10–100× smaller than pre-training. Too high → catastrophic forgetting and instability. Too low → no learning. The optimal range is narrow and model-dependent.

Hyperparameter	Typical Range	Notes
Learning rate	1e-6 to 5e-5 (start: 2e-5)	10–100× lower than pre-training LR
LR schedule	Cosine decay or linear decay	Warm up for first 3–10% of steps
Batch size	32–256 (effective, after gradient accumulation)	Larger = more stable; limited by memory
Epochs	1–3 (often just 1)	More epochs → overfitting on small datasets
Weight decay	0.01–0.1	Regularization — prevents overfitting
Gradient clipping	1.0	Prevents gradient explosion

Catastrophic Forgetting Is Real

With full fine-tuning, the model can forget everything it knew. Symptoms: degraded general conversation, broken instruction following on unrelated tasks, loss of chain-of-thought ability. Prevention: (1) low learning rate, (2) short training (1–3 epochs), (3) mix in general instruction data (10–20%), (4) evaluate on general benchmarks during training, not just your task.

Monitoring Training — What to Watch Core

📈

Healthy Training Signs

Loss decreases smoothly — no spikes or plateaus after warmup
Validation loss tracks training loss — gap stays small
Gradient norm stable — no explosions (should be <1.0 with clipping)
Eval metrics improve on task — accuracy/quality on hold-out set
General benchmarks stable — MMLU, HumanEval don't drop significantly

🚨

Warning Signs — Stop and Investigate

Loss spikes — learning rate too high or data corruption
Validation loss increases — overfitting; stop training
NaN loss — numerical instability; reduce LR, check data
General capability drops — catastrophic forgetting; mix in general data
Repetitive/degenerate outputs — model collapsed; restart with lower LR

Checkpoint Strategy

Save checkpoints every 500–1000 steps. Keep the last 3 + best validation loss + best task metric. If training degrades, you can revert to an earlier checkpoint. For full fine-tuning, checkpoints are large (model size × 2 for optimizer states) — budget storage accordingly. Use save_only_model=True to save just weights if storage is tight.

∑ Chapter 04 — Key Takeaways

Full fine-tuning updates all parameters — use only when LoRA isn't enough and you have abundant data
Use cases: continued pre-training, language adaptation, LoRA ceiling reached, distillation
Compute: 7B needs 4× A100 40GB, 70B needs 16× A100 80GB — cloud cost $200–$20K
Learning rate: 1e-6 to 5e-5 (start 2e-5); 10–100× lower than pre-training
Catastrophic forgetting: low LR, 1–3 epochs, mix in general data, monitor general benchmarks
Save checkpoints every 500–1000 steps; keep best validation + best task metric + last 3

Chapter 05 · Alignment Training

SFT vs DPO vs RLHF — Choosing Your Training Objective

Not all fine-tuning is the same. The training objective determines what the model learns: imitate examples (SFT), prefer better responses (DPO), or optimize a reward signal (RLHF). Each has different data requirements, complexity, and outcomes. Most teams should start with SFT; add DPO when preference data is available; avoid RLHF unless you have specific alignment needs.

Training Objectives — The Three Approaches Foundation

Training objective progression — complexity vs capability

Supervised Fine-Tuning (SFT) — Learning from Examples Core

SFT is the simplest and most common fine-tuning objective. You show the model (input, target output) pairs, and it learns to maximize the probability of producing the target. This is what most people mean when they say "fine-tuning."

✅

When to Use SFT

You have examples of the exact output you want
Task has clear right/wrong answers
You're doing instruction tuning from scratch
You need format control (JSON, specific templates)
First fine-tuning attempt — start here

⚠️

SFT Limitations

Model learns to imitate — even bad patterns in data
Can't express "this is better than that" directly
Quality ceiling is your data quality
Quantity-sensitive — needs 1K–50K examples
Doesn't optimize for human preferences explicitly

The SFT Data Formula

Data format: {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}. Volume: 1K–5K high-quality examples for task-specific; 10K–100K for general instruction tuning. Quality bar: Every example should be one you'd be happy to ship in production. 1K excellent examples beats 10K mediocre ones.

Direct Preference Optimization (DPO) — Learning from Comparisons In-depth

DPO learns from preference pairs: given the same prompt, which response is better? This is more directly aligned with how humans judge quality — we often know "A is better than B" even when we can't write the perfect response ourselves.

Aspect	SFT	DPO
Data format	(prompt, response)	(prompt, chosen, rejected)
What it learns	Maximize probability of target	Increase prob(chosen) relative to prob(rejected)
Data collection	Write ideal responses	Generate two responses, pick better one
Typical volume	1K–50K examples	5K–50K preference pairs
Training stability	Very stable	Moderately stable (watch KL divergence)
Complexity	Simple	Requires reference model + DPO loss

📄

DPO data format

{ "prompt": "Explain photosynthesis in one sentence.", "chosen": "Photosynthesis is the process by which plants convert sunlight, water, and CO2 into glucose and oxygen.", "rejected": "It's when plants make food from the sun or something." }

When to Add DPO After SFT

The typical pipeline: SFT first → then DPO. SFT teaches the model what kind of responses to produce; DPO refines which responses are better. Add DPO when: (1) you have clear preference signals (human feedback, A/B test results), (2) SFT quality plateaus but you can still rank responses, (3) you need to reduce specific failure modes (safety, tone, verbosity) that are easier to express as "not this" than "do this."

RLHF — Reinforcement Learning from Human Feedback Core

RLHF is the full alignment approach used by OpenAI, Anthropic, and others to train frontier models. It involves training a separate reward model on human preferences, then using RL (PPO) to optimize the LLM against that reward signal.

RLHF Pipeline

Step 1: Collect human preference data (A vs B rankings)

Step 2: Train a reward model to predict human preferences

Step 3: Use PPO to optimize LLM to maximize reward model score

Step 4: Add KL penalty to prevent reward hacking

Result: Model optimizes for human preferences end-to-end

Why Most Teams Shouldn't Use RLHF

❌ Requires training a separate reward model (expensive)

❌ RL training (PPO) is unstable and hard to tune

❌ Reward hacking is a real failure mode

❌ Requires 100K+ human preference annotations

✅ DPO achieves 90% of the benefit at 10% of the complexity

DPO Is Usually Enough

DPO was designed as a simpler alternative to RLHF that doesn't require a separate reward model or RL training. In practice, DPO matches RLHF quality for most alignment tasks at a fraction of the complexity. Unless you're a frontier lab with dedicated alignment researchers, use DPO instead of RLHF. The infrastructure and expertise required for stable RLHF training is not worth it for most production use cases.

Decision Guide — Which Objective to Use Reference

Your Situation	Recommended Objective	Rationale
First fine-tuning attempt	SFT	Simplest, fastest iteration, establishes baseline
You have ideal target outputs	SFT	Directly teach the model what to produce
Quality plateau + preference data available	SFT → DPO	SFT establishes capability; DPO refines quality
Reducing specific failure modes	DPO (after SFT)	Easier to express "not this" than "do this"
Safety / alignment at frontier scale	RLHF	Only if you have dedicated alignment team and resources
General instruction following	SFT (then optional DPO)	Proven pipeline from Alpaca → Vicuna → etc.

The Modern Training Pipeline — SFT + DPO Core

1️⃣Base ModelLlama, Mistral, etc.

2️⃣SFT10K–50K examples

3️⃣Evaluatetask metrics + general

4️⃣DPO (optional)5K–20K pref pairs

5️⃣Final Evalship if passing

The Practical Truth

90% of production fine-tuning is SFT-only. It's simple, it works, and it's what you should start with. DPO adds 5–15% quality improvement when you have good preference data and SFT has plateaued. RLHF is for frontier labs. Don't over-engineer your training objective — get SFT right first, add DPO if needed, and skip RLHF unless you have a very specific reason and resources.

∑ Chapter 05 — Key Takeaways

Three training objectives: SFT (learn from examples), DPO (learn from preferences), RLHF (optimize a reward model)
Start with SFT — simplest, fastest, works for 90% of use cases
SFT data: (prompt, response) pairs; DPO data: (prompt, chosen, rejected) triplets
Add DPO after SFT when: quality plateaus, you have preference data, reducing specific failures
DPO ≈ RLHF quality at 10% complexity — use DPO instead unless you're a frontier lab
Modern pipeline: Base → SFT → Eval → (optional) DPO → Final Eval → Ship

Chapter 06 · Quality Assurance

Evaluation — Measuring What Your Fine-Tuned Model Actually Does

Fine-tuning without evaluation is guessing. You must measure both your task performance AND general capability retention. A model that aces your custom task but forgets how to reason is not a success. Build an eval suite before training, run it continuously, and never ship a model that hasn't passed your quality gates.

The Evaluation Hierarchy — What to Measure and When Foundation

Evaluation for fine-tuning is different from evaluating base models. You need to measure both task-specific improvement and general capability preservation. The hierarchy prioritizes what matters most.

Evaluation priorities — from specific to general

Eval Level	What It Measures	When to Run	Failure Outcome
① Task-specific	Does the model do YOUR job better?	Every checkpoint, every experiment	Model isn't useful — retrain or adjust data
② Format compliance	Does output match required structure?	Every checkpoint	Downstream parsing fails — adjust training data
③ Safety/regression	Did we break safety or introduce new failures?	Before shipping, major changes	Ship blocker — model produces harmful content
④ General capability	Did we lose general intelligence?	Before shipping, weekly during iteration	Catastrophic forgetting — adjust LR, add general data

Golden Test Sets — Your Ground Truth Core

A golden test set is a curated collection of examples with known-correct answers that you use to evaluate every model iteration. It's your source of truth and should be treated as sacred — never train on it, never modify it casually.

✅

Golden Set Best Practices

Size: 200–500 examples minimum; 1000+ for high-stakes
Diversity: Cover all task subtypes, edge cases, difficulty levels
Expert-verified: Every answer reviewed by domain expert
Version-controlled: Git track with change history
Never contaminated: Must not appear in training data

❌

Golden Set Anti-Patterns

Too small: <100 examples — results have high variance
Homogeneous: All easy examples — misses edge cases
Stale: Not updated as task evolves
Leaked: Examples also in training data — inflated scores
Ambiguous: Multiple correct answers without accounting for them

The 10% Rule for Golden Sets

Allocate at least 10% of your data curation effort to building and maintaining your golden test set. This is non-negotiable. A team that spends all effort on training data but has a weak eval set will ship bad models and not know it. The golden set is how you know if your fine-tuning is working.

Task-Specific Metrics — Measuring What Matters In-depth

Task Type	Primary Metric	Secondary Metrics	Implementation
Classification	Accuracy, Macro-F1	Per-class precision/recall, confusion matrix	Exact match against labels
Extraction (NER, slots)	Entity-level F1	Partial match rate, span accuracy	Compare extracted entities to gold
Generation (summaries, content)	ROUGE-L, BERTScore	Human preference, factual accuracy	Automated + sampling for human review
Structured output (JSON)	Schema validity + field accuracy	Parse success rate, field-level F1	JSON parse test + field extraction check
Code generation	Pass@1, Pass@5	Syntax validity, test case pass rate	Execute against test cases
QA / reasoning	Exact match, LLM-as-judge	Chain-of-thought quality	String match + GPT-4 evaluation

Automated Metrics Have Limits

BLEU, ROUGE, and even BERTScore correlate poorly with human preferences for open-ended generation. They're useful for directional signals but not final quality judgment. For generation tasks, always supplement automated metrics with LLM-as-judge evaluation and periodic human spot-checks (review 20–50 random examples each iteration).

LLM-as-Judge — Using GPT-4 to Evaluate Your Fine-Tuned Model Core

For open-ended tasks where exact matching fails, use a stronger model (GPT-4o, Claude Sonnet) to judge the quality of your fine-tuned model's outputs. This correlates better with human preferences than traditional metrics.

🔧

LLM-as-Judge prompt template (pairwise comparison)

System: You are an expert evaluator comparing two AI assistant responses. Compare Response A and Response B for the given task. Consider: 1. Correctness — Is the information accurate? 2. Completeness — Does it fully address the query? 3. Clarity — Is the response well-structured and clear? 4. Relevance — Does it stay on topic? Respond with ONLY one of: "A" (A is better), "B" (B is better), or "TIE" (roughly equal). User: Task: {task_description} Input: {user_input} Response A: {response_a} Response B: {response_b} Which response is better?

LLM-as-Judge Benefits

Scalable: Evaluate 1000s of examples at $0.01–$0.05 each

Consistent: No annotator fatigue or mood variation

Nuanced: Can evaluate subtle quality differences

Fast: Results in minutes, not days

LLM-as-Judge Limitations

Position bias: May prefer first or second response

Verbosity bias: May prefer longer responses

Self-preference: GPT-4 may prefer GPT-4-style outputs

Mitigation: Randomize order, calibrate with human baseline

LLM-as-Judge Calibration

Before trusting LLM-as-judge, validate against 50–100 human-labeled examples. Compute agreement rate (should be >80% for binary better/worse). If agreement is low, refine your evaluation prompt or criteria. Also test for position bias by running each comparison twice with swapped order — disagreement rate should be <10%.

Benchmark Contamination — The Silent Killer of Valid Evals In-depth

Benchmark contamination occurs when your test data appears in training data. The model memorizes answers rather than learning to reason. This is the #1 cause of inflated evaluation scores that don't reflect production performance.

Contamination Type	How It Happens	Detection Method	Prevention
Direct leakage	Test set examples in training set	Hash matching, n-gram overlap	Strict train/test split management
Paraphrase leakage	Same question, different wording in train	Embedding similarity search	Semantic dedup across splits
Public benchmark contamination	Test set is public (MMLU, HumanEval) and in web scrapes	Hard to detect	Use held-out custom evals
Synthetic data feedback	Generated training data includes benchmark patterns	Manual audit of synthetic data	Exclude benchmark topics from generation

Public Benchmarks Are Likely Contaminated

MMLU, HumanEval, GSM8K, and other popular benchmarks exist in web scrapes that went into pre-training data. A fine-tuned model that scores well on these may be memorizing, not reasoning. For production decisions, always maintain a private golden test set that has never been published. Public benchmarks are useful for comparing to literature but not for shipping decisions.

Regression Testing — Did Fine-Tuning Break Anything? Core

Fine-tuning can break capabilities the base model had. Regression testing detects this by comparing your fine-tuned model against the base model on a held-out general capability set.

📊

General Capability Tests

MMLU subset: 200–500 questions across domains
Instruction following: Can it still follow basic prompts?
Conversation quality: Multi-turn coherence
Reasoning: Simple chain-of-thought problems

⚠️

Regression Thresholds

<2% drop: Acceptable — normal fine-tuning cost
2–5% drop: Warning — may need to adjust
>5% drop: Problem — likely catastrophic forgetting
Always compare to base model as reference

🛡️

Safety Regression Tests

Refusal rate: Should be similar to base model
Jailbreak resistance: Common attack prompts
Harmful content: Violence, bias, PII leakage
Run dedicated red-team eval before shipping

The Eval Pipeline — Automation for Every Checkpoint Reference

1️⃣Checkpoint Savedevery 500–1000 steps

2️⃣Quick Evaltask metric on 200 samples

3️⃣Log to DashboardW&B, MLflow

4️⃣Full Eval (best ckpt)golden set + regression

5️⃣Ship Decisionpasses all gates?

🔧

Eval automation script structure

def run_eval_pipeline(checkpoint_path: str) -> EvalResults: # 1. Load model model = load_model(checkpoint_path) # 2. Quick task eval (always run) task_score = evaluate_on_task(model, quick_eval_set) # 3. Log to dashboard log_metrics({"task_score": task_score, "step": get_step(checkpoint_path)}) # 4. Full eval (only on best checkpoints or before ship) if task_score > best_score or is_final_checkpoint: golden_score = evaluate_on_golden_set(model) regression_score = evaluate_regression(model, base_model) safety_score = evaluate_safety(model) return EvalResults( task=task_score, golden=golden_score, regression=regression_score, safety=safety_score, ship_ready=all_gates_pass(golden_score, regression_score, safety_score) )

∑ Chapter 06 — Key Takeaways

Evaluation hierarchy: task-specific → format compliance → safety/regression → general capability
Build a golden test set of 200–500+ expert-verified examples before training; never train on it
For open-ended tasks, use LLM-as-judge (GPT-4); calibrate with 50–100 human labels first
Benchmark contamination causes fake good scores — rely on private eval sets for shipping decisions
Regression testing: <2% drop acceptable, 2–5% warning, >5% catastrophic forgetting
Automate eval pipeline: quick eval every checkpoint, full eval on best checkpoints + before ship

Chapter 07 · Task Generalization

Instruction Fine-Tuning — Teaching a Model to Follow Instructions

Instruction tuning transforms a raw language model into an assistant. It's what makes the difference between a model that continues text and one that follows instructions. The key insight: task diversity matters more than task volume. A model trained on 1,000 diverse instructions often outperforms one trained on 100,000 similar instructions.

What Is Instruction Tuning — From Completion to Assistant Foundation

Base language models predict the next token. They're excellent at completing text but terrible at following instructions. Instruction tuning teaches the model to interpret an instruction and produce the requested output rather than just continuing the text pattern.

Base Model (Pre-Instruction Tuning)

Input: "Translate to French: Hello, how are you?"

Output: "Translate to Spanish: Hola, cómo estás?"

→ Continues the pattern, doesn't follow the instruction

Instruction-Tuned Model

Input: "Translate to French: Hello, how are you?"

Output: "Bonjour, comment allez-vous?"

→ Understands and executes the instruction

The Instruction Tuning Revolution

Before instruction tuning (pre-InstructGPT era), LLMs required careful prompt engineering to produce useful outputs. Instruction tuning (SFT on instruction-response pairs) made models that try to help by default. This is the foundation of ChatGPT, Claude, and every modern assistant. If you're fine-tuning a base model, instruction tuning is almost always the first step.

Task Diversity — The LIMA Insight Core

The LIMA paper showed that a 65B model fine-tuned on just 1,000 carefully curated examples can match models trained on 50K+ examples. The secret: diversity and quality over volume. Each example should teach something different.

✅

High-Diversity Dataset

Multiple task types (QA, summarization, code, creative)
Varied instruction styles (direct, conversational, formal)
Different response lengths (one-liner to multi-paragraph)
Diverse domains (science, arts, business, technical)
Edge cases and difficult examples

❌

Low-Diversity Dataset

Same task repeated with variations
All examples same format/length
Single domain focus
Template-generated similar examples
Missing difficulty spectrum

📊

Diversity Checklist

☐ At least 10 distinct task categories
☐ Both short and long responses represented
☐ Single-turn and multi-turn conversations
☐ Factual and creative tasks
☐ Easy, medium, and hard difficulty

The 1K vs 50K Trade-off

1,000 high-quality, diverse examples often beats 50,000 similar examples. Why? Large models already have the capability — instruction tuning is about eliciting existing capability, not teaching new knowledge. Diversity shows the model the range of behaviors expected; quality shows it the standard to meet. Volume alone just overfits to a narrow distribution.

Chat Templates — Format Matters In-depth

Modern models use specific chat templates with special tokens to mark turns and roles. Using the wrong template causes the model to ignore your instructions or produce garbled output. Always use the exact template the base model was trained with.

Model Family	Template Style	Example
Llama 3 / 3.1	Special tokens: `<\|start_header_id\|>` etc.	`<\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>\n\nHello<\|eot_id\|>`
Mistral / Mixtral	`[INST]` and `[/INST]` tokens	`[INST] Hello, how are you? [/INST] I'm doing well!`
ChatML (OpenAI style)	`<\|im_start\|>` and `<\|im_end\|>`	`<\|im_start\|>user\nHello<\|im_end\|>\n<\|im_start\|>assistant\n`
Vicuna / Alpaca	Plain text markers	`USER: Hello\nASSISTANT:`

🔧

Applying chat template (Hugging Face)

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain photosynthesis."}, {"role": "assistant", "content": "Photosynthesis is..."}, ] # Apply model's chat template automatically formatted = tokenizer.apply_chat_template(messages, tokenize=False) print(formatted)

Template Mismatch = Silent Failure

Using the wrong chat template is a common cause of "my fine-tuned model is worse than base." The model has been trained to recognize specific token patterns — if your data uses different patterns, the model treats it as noise. Always verify your training data uses exactly the same template as the tokenizer's apply_chat_template() method.

Instruction Dataset Sources — Starting Points Core

Dataset	Size	Quality	Best For	Notes
LIMA	1,000	Excellent (curated)	Proving quality > quantity	Research dataset; not for commercial use
OpenAssistant	160K	Variable (crowdsourced)	General assistant, multi-turn	Apache 2.0 license; filter for quality
Alpaca (Stanford)	52K	Moderate (synthetic)	Quick bootstrapping	GPT-3.5 generated; format reference only
Dolly (Databricks)	15K	Good (employee-written)	Commercial use base	CC-BY-SA license; human-written
UltraChat	1.5M	Variable (synthetic)	Large-scale pre-training	Filter heavily; mix with human data
ShareGPT (user convos)	Variable	Variable	Real user distribution	Legal gray area; quality varies wildly

The Data Assembly Strategy

Don't use any single dataset as-is. Combine: (1) High-quality curated examples (200–500, manual or filtered) + (2) Task-diverse public dataset (5K–20K, filtered) + (3) Your domain-specific examples (as many as you have). Filter for quality, deduplicate, and shuffle. The blend matters more than any single source.

Multi-Turn Conversations — Teaching Coherent Dialogue In-depth

Single-turn (one instruction, one response) training produces models that answer questions but struggle with context over multiple turns. Include multi-turn conversations in your training data to teach the model to maintain coherence.

💬

Multi-Turn Training Example

{ "messages": [ {"role": "user", "content": "What is Python?"}, {"role": "assistant", "content": "Python is a programming language..."}, {"role": "user", "content": "How do I install it?"}, {"role": "assistant", "content": "To install Python..."}, {"role": "user", "content": "What about on Mac?"}, {"role": "assistant", "content": "On macOS, you can..."} ] }

📊

Multi-Turn Mix Guidelines

50–70%: Single-turn (clear instruction → response)
20–30%: 2–3 turn conversations
10–20%: 4+ turn conversations
Include clarification, follow-up, topic shift patterns
Vary conversation styles (casual, professional, technical)

System Prompts in Training — When and How Core

System prompts set the context for a conversation. If you want your model to respect system prompts in production, you must include them in training. But don't over-rely on system prompts — behavior should be robust without them too.

Include System Prompts When

Production use: You'll use system prompts to control behavior

Persona switching: Model needs to adopt different roles

Constraint instruction: Format/safety rules in system prompt

Mix: ~50% with system prompt, ~50% without

System Prompt Pitfalls

Over-reliance: Model only works with specific system prompt

Inconsistency: Different system prompts in train vs production

Brittleness: Small changes in system prompt break behavior

Solution: Test behavior both with and without system prompts

∑ Chapter 07 — Key Takeaways

Instruction tuning transforms base models into assistants — it elicits existing capability
LIMA insight: 1,000 diverse, high-quality examples can match 50K similar examples
Diversity checklist: multiple task types, varied lengths, different domains, difficulty spectrum
Use the exact chat template your base model expects — template mismatch = silent failure
Data mix: curated quality (200–500) + filtered public (5K–20K) + domain-specific
Include multi-turn conversations (20–30%) and system prompts (~50%) in training data

Chapter 08 · Specialization

Domain Adaptation — Medical, Legal, Code, and Finance

Domain adaptation specializes a general model for specific fields. The approach differs by domain: some need continued pre-training on raw text; others just need task-specific SFT. The key is understanding what your domain requires and avoiding the trap of "more training = better." Domain expertise must be balanced against general capability.

The Two-Stage Approach — Pre-Training vs Fine-Tuning Foundation

Domain adaptation typically happens in two stages, though not all domains need both. The decision depends on how specialized the language and concepts are.

Two-stage domain adaptation pipeline

Stage	What It Does	Data Required	When Needed
Continued Pre-Training	Teaches domain vocabulary, patterns, writing style	1M–1B tokens of raw domain text	Highly specialized domains (medical, legal, scientific)
Domain SFT	Teaches how to apply knowledge to specific tasks	1K–50K instruction-response pairs	Almost always — this is where task capability comes from

When to Skip Continued Pre-Training

Modern base models already have substantial domain knowledge from their training corpus. Try domain SFT first — it's faster and cheaper. Add continued pre-training only if: (1) the model misuses domain terminology, (2) domain text is highly specialized and underrepresented in base model, (3) you have access to 100M+ tokens of domain text. For most use cases, SFT alone is sufficient.

Domain-Specific Patterns — Medical, Legal, Code, Finance Core

🏥

Medical / Clinical

Challenges: Specialized terminology (ICD codes, drug names), high-stakes accuracy, regulatory requirements (HIPAA), need for citation.

Pre-training: Often needed (PubMed, clinical notes)
SFT: Q&A with citations, differential diagnosis, report generation
Key risk: Hallucinated medical advice is dangerous
Eval: MedQA, expert review, citation verification

⚖️

Legal

Challenges: Precise language matters, jurisdiction-specific, precedent citation, long documents.

Pre-training: Often needed (case law, statutes, contracts)
SFT: Contract review, clause extraction, legal Q&A
Key risk: Unhelpful if answer is "consult a lawyer"
Eval: Clause identification accuracy, citation correctness

💻

Code / Software

Challenges: Syntax correctness is binary, multiple languages, execution context matters.

Pre-training: Sometimes (proprietary codebases)
SFT: Code completion, debugging, code review, translation
Key risk: Syntactically correct but semantically wrong
Eval: Pass@k on test suites, human review

💰

Finance

Challenges: Numerical precision, temporal awareness (data freshness), regulatory compliance.

Pre-training: Rarely needed (finance language is less specialized)
SFT: Report analysis, sentiment analysis, numerical reasoning
Key risk: Hallucinated numbers, outdated information
Eval: Numerical accuracy, fact verification against sources

Continued Pre-Training — When and How In-depth

Continued pre-training (CPT) extends the base model's pre-training on domain-specific text. It's computationally expensive but can significantly improve domain understanding for highly specialized fields.

Parameter	Typical Value	Notes
Data volume	100M–10B tokens	More is better, but quality matters
Learning rate	1e-5 to 5e-5	Lower than initial pre-training
Training objective	Causal LM (next token prediction)	Same as base pre-training
Mix with general data	10–30% general data	Prevents catastrophic forgetting
Compute cost	$1K–$100K+ (depending on volume)	Full fine-tuning required — no LoRA

🔧

Domain CPT data format

# Simple completion format — just raw text { "text": "CLINICAL NOTE\nPatient presents with acute onset chest pain, radiating to left arm. ECG shows ST elevation in leads V1-V4, consistent with anterior STEMI. Troponin I elevated at 2.5 ng/mL. Cardiology consulted for emergent PCI..." } # Training processes text as next-token prediction # Pack multiple documents into context window for efficiency

CPT Forgetting Risk Is High

Continued pre-training on only domain text causes severe catastrophic forgetting. The model becomes excellent at domain language but loses general instruction-following ability. Always mix 10–30% general data (C4, RedPajama, or high-quality web text) with your domain corpus. After CPT, always run a full regression test on general capabilities before proceeding to SFT.

Domain SFT — Teaching Domain-Specific Tasks Core

Domain SFT teaches the model to apply domain knowledge to specific tasks. This is where most value is created and where most teams should focus. SFT data should be task-specific rather than general domain text.

📝

Medical SFT Example

{ "messages": [ { "role": "user", "content": "Patient: 55M, chest pain, ST elevation. What is the likely diagnosis?" }, { "role": "assistant", "content": "Based on the presentation of chest pain with ST elevation in a 55-year-old male, the likely diagnosis is **acute myocardial infarction (STEMI)**..." } ] }

⚖️

Legal SFT Example

{ "messages": [ { "role": "user", "content": "Identify the indemnification clause in this contract: [contract text]" }, { "role": "assistant", "content": "The indemnification clause is in Section 8.2: 'Party A shall indemnify...' This clause is **mutual**, covering both parties..." } ] }

💻

Code SFT Example

{ "messages": [ { "role": "user", "content": "Fix the bug in this function:\\n```python\\ndef calc(x):\\n return x / 0\\n```" }, { "role": "assistant", "content": "The bug is division by zero. Here's the fix:\\n```python\\ndef calc(x, divisor=1):\\n if divisor == 0:\\n raise ValueError()\\n return x / divisor\\n```" } ] }

Domain SFT Best Practices

Quality over quantity: 500–5,000 expert-curated examples beats 50K noisy ones. Task diversity: Cover all task types you'll use in production. Realistic inputs: Use real examples from your domain, not synthetic simplifications. Include edge cases: Hard examples, errors, ambiguous cases. Expert review: Every response should be verified by a domain expert.

Avoiding Catastrophic Forgetting — The Balancing Act In-depth

Domain adaptation is a trade-off: the more you specialize, the more you risk losing general capabilities. Managing this trade-off requires deliberate strategies.

Strategy	How It Works	When to Use
Data mixing	Include 10–30% general instruction data in domain SFT	Always — minimal cost, significant benefit
Low learning rate	Use 1e-5 to 5e-5 instead of standard SFT rates	Always for domain adaptation
Fewer epochs	Train for 1–2 epochs instead of 3+	Standard practice
LoRA instead of full FT	Freezes base model; learns adapter weights only	Default choice — preserves base capability
Elastic Weight Consolidation (EWC)	Penalizes changes to important weights	Experimental — not widely used
Separate adapter per domain	Train different LoRA adapters for different domains	Multi-domain serving scenario

✅

Recommended Forgetting Prevention

Use LoRA: Preserves base model weights entirely
Mix data: 70–80% domain + 20–30% general instruction
Low LR: 1e-5 to 2e-5 for domain SFT
Few epochs: 1–2 epochs, early stopping on validation
Monitor regression: Check MMLU/general benchmarks

⚠️

Warning Signs of Forgetting

General conversation quality drops noticeably
Model struggles with simple reasoning tasks
Multi-turn coherence degrades
Instruction following becomes brittle
MMLU score drops >5% from base model

Domain-Specific Evaluation — Measuring Success Core

Domain	Standard Benchmarks	Custom Eval Needed
Medical	MedQA, PubMedQA, MedMCQA	Expert review of clinical recommendations; citation verification
Legal	LegalBench, CaseHOLD	Contract analysis accuracy; jurisdiction-specific tests
Code	HumanEval, MBPP, SWE-Bench	Proprietary codebase tests; style compliance
Finance	FinBench, TAT-QA	Numerical accuracy checks; regulatory compliance

Domain Eval Best Practice

Standard benchmarks give you a baseline, but custom evaluation on your actual production tasks is essential. Create a golden test set of 200–500 examples representing real use cases in your domain. Include expert review for high-stakes domains (medical, legal, finance). Compare to both base model AND to human expert or GPT-4 baseline to understand where your fine-tuned model wins and loses.

∑ Chapter 08 — Key Takeaways

Two-stage approach: Continued Pre-Training (domain language) → Domain SFT (task capability)
Try SFT first — continued pre-training is expensive and often not needed
Domain differences: Medical/Legal often need CPT; Code/Finance usually just SFT
Continued pre-training requires 100M–10B tokens + 10–30% general data mixing
Prevent catastrophic forgetting: LoRA, low LR (1e-5), 1–2 epochs, data mixing, regression tests
Domain eval: standard benchmarks + custom production task golden set + expert review

Chapter 09 · Deployment

Serving Fine-Tuned Models — Merge, Quantize, Deploy

A fine-tuned model in a notebook is worthless. The real work begins when you deploy it for production inference. This chapter covers the full path: merging adapters into base weights, quantizing for efficiency, and deploying with vLLM, Ollama, or cloud endpoints. The goal: maximize throughput, minimize latency, and keep costs under control.

Deployment Options Overview — Where Will Your Model Run? Foundation

Before diving into technical details, decide where your fine-tuned model will run. The choice depends on scale, latency requirements, cost constraints, and data privacy needs.

Deployment Option	Best For	Latency	Cost at Scale	Setup Complexity
Cloud API (OpenAI, Anthropic)	When provider offers fine-tuning (GPT-4o, Claude)	~200–500ms	High ($15–60/1M tokens)	Simple
Self-hosted vLLM	High throughput, production scale, custom models	50–200ms	Medium (GPU cost)	Moderate
Ollama (local)	Development, privacy, edge deployment	100–500ms	Low (own hardware)	Simple
Serverless (Modal, Replicate)	Variable load, pay-per-use	Cold start: 5–30s	Low for variable load	Simple
Dedicated cloud (SageMaker, Vertex)	Enterprise, compliance, managed infrastructure	100–300ms	High (always-on)	Complex

The Deployment Decision Tree

Privacy/compliance required? → Self-host or on-prem. Variable/bursty traffic? → Serverless. High constant throughput? → Dedicated vLLM cluster. Just need it to work? → Cloud API with fine-tuning (if available). Development/testing? → Ollama locally.

Adapter Merging — From LoRA to Standalone Model Core

LoRA adapters are small (~10–100MB) but require the base model at inference. For production, you typically merge the adapter into the base weights to create a standalone model. This eliminates adapter overhead and simplifies deployment.

Separate Adapter Serving

How: Load base model + adapter at runtime

Pros: Swap adapters dynamically; multiple adapters per base

Cons: Slight latency overhead; more complex serving

Use when: Multi-tenant with different adapters per customer

Merged Model Serving

How: Merge adapter into base → single model file

Pros: Zero overhead; standard model format; simple deployment

Cons: Full model size; can't swap adapters at runtime

Use when: Single-purpose deployment (most cases)

🔧

Merging LoRA adapters (PEFT)

from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer # Load base model in full precision base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.float16, device_map="auto", ) # Load adapter on top of base model = PeftModel.from_pretrained(base_model, "./my-lora-adapter") # Merge adapter weights into base model merged_model = model.merge_and_unload() # Save as standard HF model (ready for vLLM, Ollama, etc.) merged_model.save_pretrained("./my-merged-model") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") tokenizer.save_pretrained("./my-merged-model")

QLoRA Merging Requires Dequantization

If you trained with QLoRA (4-bit base model), you cannot merge directly into the quantized weights. You must: (1) load the base model in full precision (fp16/bf16), (2) load the LoRA adapter, (3) merge, (4) re-quantize if needed. This means you need enough GPU memory to hold the full-precision model temporarily (~14GB for 7B, ~140GB for 70B).

Quantization — Shrinking Models for Deployment In-depth

Quantization reduces model precision (fp16 → int8 → int4) to shrink memory footprint and increase inference speed. A 7B model in fp16 is ~14GB; in 4-bit it's ~4GB. The trade-off: some quality loss, though modern quantization methods minimize this.

Format	Precision	Size (7B)	Quality Loss	Hardware Support	Best For
FP16 / BF16	16-bit	~14GB	None (baseline)	All modern GPUs	Production where quality is critical
INT8	8-bit	~7GB	Minimal (<1%)	Most GPUs, some CPUs	Good balance of size/quality
GPTQ	4-bit	~4GB	Small (1–3%)	NVIDIA GPUs	GPU inference, vLLM
AWQ	4-bit	~4GB	Minimal (best 4-bit)	NVIDIA GPUs	Production 4-bit, vLLM preferred
GGUF	2–8 bit	~3–7GB	Depends on quant level	CPU, Apple Silicon, GPU	Ollama, llama.cpp, local deployment
EXL2	2–8 bit (mixed)	~3–7GB	Very good (adaptive)	NVIDIA GPUs	ExLlamaV2, high-quality 4-bit

📦

GGUF (llama.cpp / Ollama)

Universal format for CPU and cross-platform inference. Supports Q4_K_M, Q5_K_M, Q8_0, etc.

Create: llama.cpp/convert.py
Quantize: llama.cpp/quantize
Best quant: Q4_K_M (balanced) or Q5_K_M (higher quality)

⚡

AWQ (vLLM preferred)

Activation-aware quantization. Best quality for 4-bit GPU inference.

Create: autoawq library
Serve: vLLM with --quantization awq
Quality: Near-lossless for most tasks

🎯

GPTQ (widely supported)

GPU-only, well-established 4-bit format. Slightly lower quality than AWQ.

Create: auto-gptq library
Serve: vLLM, text-generation-inference
Note: Being superseded by AWQ

🔧

Converting to GGUF for Ollama

# Clone llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make # Convert HF model to GGUF (fp16 first) python convert_hf_to_gguf.py ../my-merged-model --outtype f16 --outfile model-f16.gguf # Quantize to 4-bit (Q4_K_M is good balance) ./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M # Or Q5_K_M for higher quality ./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M

Quantization Recommendation by Use Case

Production GPU (vLLM): AWQ 4-bit — best quality/speed tradeoff. Local/Edge (Ollama): GGUF Q4_K_M or Q5_K_M. Quality-critical: FP16 or INT8 — don't quantize. Memory-constrained: Q3_K_M or Q2 — significant quality loss, last resort. Rule of thumb: Test your specific task at each quantization level; losses vary by task.

vLLM Deployment — Production-Grade Serving Core

vLLM is the gold standard for production LLM serving. It implements PagedAttention for efficient memory management, continuous batching for high throughput, and supports all major model formats. If you're serving a fine-tuned model at scale, vLLM is likely your best option.

✅

vLLM Advantages

PagedAttention: 2–4× higher throughput than naive serving
Continuous batching: Dynamic batch size for variable load
Quantization support: AWQ, GPTQ, INT8 out of the box
OpenAI-compatible API: Drop-in replacement
LoRA serving: Hot-swap adapters at runtime

⚠️

vLLM Considerations

GPU only: Requires NVIDIA GPU (no CPU inference)
Memory: Loads full model into GPU memory
Cold start: Model loading takes 30–120s
Complexity: More setup than Ollama
Resource: Needs dedicated GPU server

🔧

vLLM serving (OpenAI-compatible API)

# Install vLLM pip install vllm # Serve merged model (FP16) python -m vllm.entrypoints.openai.api_server \ --model ./my-merged-model \ --port 8000 # Serve with AWQ quantization python -m vllm.entrypoints.openai.api_server \ --model ./my-merged-model-awq \ --quantization awq \ --port 8000 # Serve with LoRA adapters (hot-swappable) python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --enable-lora \ --lora-modules my-adapter=./my-lora-adapter \ --port 8000

🐍

Calling vLLM from Python (OpenAI SDK)

from openai import OpenAI # Point to local vLLM server client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") response = client.chat.completions.create( model="my-merged-model", # or model path messages=[ {"role": "user", "content": "Explain photosynthesis."} ], temperature=0.7, max_tokens=512, ) print(response.choices[0].message.content)

vLLM Configuration	Default	Recommendation	Impact
`--tensor-parallel-size`	1	Match your GPU count	Enables multi-GPU serving
`--gpu-memory-utilization`	0.9	0.85–0.95	Higher = more KV cache, higher throughput
`--max-model-len`	Model default	Set to your actual max	Lower = less memory, faster startup
`--quantization`	None	awq for 4-bit	2× memory reduction, slight quality loss

Ollama & Local Deployment — Simple Self-Hosting Core

Ollama makes running LLMs locally as easy as Docker. It handles model management, quantization, and provides an API. Perfect for development, privacy-sensitive applications, or edge deployment. Supports macOS, Linux, and Windows.

🔧

Deploying fine-tuned model with Ollama

# 1. Create a Modelfile cat << 'EOF' > Modelfile FROM ./model-q4_k_m.gguf SYSTEM """You are a helpful assistant specialized in medical Q&A. Always cite sources and recommend consulting a healthcare professional for medical decisions.""" PARAMETER temperature 0.7 PARAMETER top_p 0.9 PARAMETER stop "<|eot_id|>" EOF # 2. Create the Ollama model ollama create my-medical-model -f Modelfile # 3. Run it ollama run my-medical-model # 4. Serve via API ollama serve # Runs on localhost:11434

🐍

Calling Ollama from Python

import requests response = requests.post( "http://localhost:11434/api/generate", json={ "model": "my-medical-model", "prompt": "What are the symptoms of diabetes?", "stream": False, } ) print(response.json()["response"]) # Or use the official Ollama Python library import ollama response = ollama.chat( model="my-medical-model", messages=[{"role": "user", "content": "What are the symptoms of diabetes?"}] ) print(response["message"]["content"])

Ollama Strengths

Simplicity: One command to run any model

Cross-platform: macOS, Linux, Windows, Docker

Apple Silicon: Excellent Metal GPU support

Privacy: Everything runs locally

Model management: Pull, list, delete models easily

Ollama Limitations

Throughput: Not designed for high-concurrency

No continuous batching: Sequential requests

Limited scaling: Single-machine only

GGUF only: Must convert from HF format

Production: Better for dev/edge than high-scale

Cloud Deployment Options — Managed & Serverless In-depth

Platform	Type	Pros	Cons	Best For
Modal	Serverless GPU	Pay-per-second, auto-scaling, easy deploy	Cold starts (5–30s)	Variable workloads, prototypes
Replicate	Serverless GPU	Simple API, model hosting	Cost at scale, cold starts	Quick deployment, demos
RunPod	GPU rental	Cheap GPUs, serverless option	Less managed, variable availability	Cost-sensitive production
AWS SageMaker	Managed ML	Enterprise features, integration	Complex, expensive	Enterprise, existing AWS stack
GCP Vertex AI	Managed ML	Good Gemini integration	Complex pricing	GCP-native applications
Together AI	Inference API	Fast, supports custom models	Per-token pricing	Custom model serving

🔧

Modal deployment example

import modal app = modal.App("my-fine-tuned-model") # Define the inference function @app.function( gpu="A100", image=modal.Image.debian_slim().pip_install("vllm", "torch"), secrets=[modal.Secret.from_name("huggingface")], ) def generate(prompt: str) -> str: from vllm import LLM, SamplingParams llm = LLM(model="./my-merged-model") params = SamplingParams(temperature=0.7, max_tokens=512) output = llm.generate([prompt], params) return output[0].outputs[0].text # Deploy # modal deploy my_app.py

Inference Optimization — Latency & Throughput Tuning Core

⚡

Latency Optimization

Quantization: 4-bit reduces memory, increases speed
Speculative decoding: Use draft model for speedup
KV cache: Don't recompute for same prefix
Streaming: Return tokens as generated
Shorter prompts: Less prefill time

📈

Throughput Optimization

Continuous batching: vLLM, TGI default
Dynamic batching: Group requests together
Tensor parallelism: Multi-GPU for large models
Prefix caching: Cache common prompts
Right-size GPU: Match model to memory

💰

Cost Optimization

Spot/preemptible: 60–80% savings
Right-size model: Smaller if quality allows
Caching: Cache frequent responses
Auto-scaling: Scale to zero when idle
Batching: Higher util = lower cost/token

Optimization	Latency Impact	Throughput Impact	Complexity
4-bit quantization (AWQ)	-20–40%	+50–100%	Low
Continuous batching	Slight increase	+200–500%	Free (vLLM)
Tensor parallelism (2+ GPU)	-30–50%	+80–180%	Medium
Speculative decoding	-30–50%	Variable	Medium
KV cache / prefix caching	-50–80% for repeat prefixes	+20–50%	Low (vLLM)

∑ Chapter 09 — Key Takeaways

Deployment options: vLLM (production), Ollama (local/edge), serverless (variable load), cloud (enterprise)
Merge LoRA adapters into base weights for simplified deployment (unless you need multi-tenant adapter swapping)
Quantization: AWQ for GPU (vLLM), GGUF for CPU/local (Ollama) — Q4_K_M is good balance
vLLM = gold standard: PagedAttention, continuous batching, OpenAI-compatible API
Ollama = simplest local deployment: one command, cross-platform, Apple Silicon support
Optimize: quantize (2×), continuous batching (4×), tensor parallelism (multi-GPU), caching

Chapter 10 · Production Systems

Production MLOps — Versioning, Monitoring & Iteration

Fine-tuning is not a one-time event — it's a continuous process. Production MLOps for fine-tuned models means tracking experiments, versioning models, monitoring quality, and iterating on feedback. This chapter covers the infrastructure and practices that turn one-off fine-tuning into a sustainable competitive advantage.

Experiment Tracking — Never Lose a Training Run Core

Every fine-tuning run should be tracked: hyperparameters, dataset version, base model, training metrics, and evaluation results. Without tracking, you can't reproduce results, compare runs, or understand what worked.

Tool	Type	Strengths	Best For
Weights & Biases	SaaS / self-hosted	Best UX, automatic logging, reports	Most teams, production use
MLflow	Self-hosted / Databricks	Open source, model registry built-in	On-prem, Databricks users
Comet	SaaS	Good comparison views	Alternative to W&B
Neptune	SaaS	Good for large experiments	Large-scale experimentation
TensorBoard	Self-hosted	Free, basic, no model registry	Simple projects, learning

🔧

W&B integration with Hugging Face Trainer

import wandb from transformers import TrainingArguments, Trainer # Initialize W&B wandb.init( project="medical-fine-tuning", name="llama3-8b-lora-r16", config={ "base_model": "meta-llama/Llama-3.1-8B-Instruct", "lora_rank": 16, "lora_alpha": 32, "dataset_version": "v2.3", "num_examples": 5000, } ) training_args = TrainingArguments( output_dir="./output", report_to="wandb", # Automatic logging logging_steps=10, # ... other args ) trainer = Trainer(model=model, args=training_args, ...) trainer.train() # Log final eval metrics wandb.log({"eval/accuracy": 0.92, "eval/f1": 0.89}) wandb.finish()

✅

What to Track (Minimum)

Config: All hyperparameters (LR, rank, epochs, etc.)
Data: Dataset name, version, size, hash
Model: Base model name and version
Metrics: Loss curve, eval metrics per checkpoint
Artifacts: Final model weights, adapter files

⭐

What to Track (Best Practice)

All minimum items, plus:
Code version: Git commit hash
Environment: Package versions, GPU type
Sample outputs: Example generations per checkpoint
Regression metrics: MMLU, general benchmarks

Model Registry — Version and Promote Models In-depth

A model registry stores model versions with metadata, enables promotion through stages (dev → staging → production), and provides lineage tracking. It's the single source of truth for which model is deployed where.

Model promotion lifecycle

Registry Option	Integration	Strengths	Best For
MLflow Model Registry	MLflow, Databricks	Open source, full lifecycle	Self-hosted, Databricks
Hugging Face Hub	HF ecosystem	Easy sharing, versioning, spaces	Open models, collaboration
W&B Model Registry	W&B	Linked to experiments, good UX	W&B users
SageMaker Model Registry	AWS	AWS integration, approval workflows	AWS-native teams
DVC + Git	Git	Version control for models	Simple projects, git-native

Minimum Viable Model Registry

If you don't have a formal registry, at minimum: (1) Store models in versioned cloud storage (S3, GCS) with naming convention (e.g., medical-llama-v2.3-2024-04-15/), (2) Keep a YAML/JSON manifest with model version → storage path → training run ID → eval metrics, (3) Document which version is in production. This beats scattered files and forgotten experiments.

A/B Testing — Validating Improvements in Production Core

Offline eval is not production eval. A model that scores well on your test set may perform differently with real users and real queries. A/B testing compares model versions on live traffic to validate that improvements are real.

✅

A/B Testing Best Practices

Random assignment: Users randomly see A or B
Same conditions: Same prompts, same post-processing
Sufficient sample: Run until statistically significant
Multiple metrics: Quality, latency, cost, user behavior
Holdout group: Always keep baseline for comparison

📊

Metrics to Compare

Quality: Task accuracy, user ratings, thumbs up/down
Latency: P50, P95, P99 response time
Cost: Tokens used per request, GPU utilization
Engagement: Completion rate, follow-up questions
Errors: Failure rate, refusal rate, format errors

🔧

Simple A/B routing implementation

import random import hashlib def get_model_for_request(user_id: str, experiment_config: dict) -> str: """Deterministic assignment based on user_id.""" # Hash user_id for consistent assignment hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) bucket = hash_val % 100 # Route based on traffic split if bucket < experiment_config["control_pct"]: # e.g., 80% return experiment_config["control_model"] # v2.2 else: return experiment_config["treatment_model"] # v2.3 # Log which variant was served for analysis def log_request(user_id: str, model_version: str, response: str, metrics: dict): analytics.log({ "user_id": user_id, "model_version": model_version, "latency_ms": metrics["latency"], "tokens": metrics["tokens"], "timestamp": datetime.now(), })

Shadow Mode Before A/B

Before exposing users to a new model, run it in shadow mode: send real traffic to both models, but only show the control model's output. Log the new model's outputs for offline comparison. This catches catastrophic failures (crashes, very bad outputs) before users see them. Only promote to A/B once shadow mode looks good.

Monitoring & Drift Detection — Knowing When Quality Drops In-depth

A fine-tuned model can degrade over time — the input distribution shifts, edge cases emerge, or the world changes. Continuous monitoring detects these issues before users notice.

📊

Operational Metrics

Latency: P50, P95, P99 per endpoint
Throughput: Requests/sec, tokens/sec
Error rate: 5xx, timeouts, OOM
GPU utilization: Memory, compute
Queue depth: Request backlog

🎯

Quality Metrics

User feedback: Thumbs up/down, ratings
Format compliance: JSON parse success rate
Refusal rate: How often model refuses
Output length: Avg tokens, anomalies
LLM-as-judge sample: Periodic quality scoring

🔍

Drift Detection

Input drift: Embedding distance from training
Output drift: Response distribution changes
Concept drift: Same inputs, different correct answers
Alert: When metrics deviate >2σ from baseline

Metric	Alerting Threshold (Example)	Action When Triggered
P95 latency	>2× baseline for 5 min	Check GPU load, model, batch size
Error rate	>1% for 5 min	Page on-call, check logs
Format compliance	<95% for 1 hour	Review failing examples, consider rollback
User thumbs down rate	>2× baseline for 1 day	Sample and review bad responses
Input embedding drift	>0.2 cosine distance shift	Investigate new input patterns; may need new data

The Monitoring Stack

Metrics: Prometheus + Grafana or Datadog. Logging: Structured logs → aggregator (Loki, CloudWatch, Datadog). Tracing: OpenTelemetry → Jaeger or vendor. LLM-specific: LangSmith, Langfuse, or custom sampled eval. Alerting: PagerDuty, Slack, email for critical metrics.

The Fine-Tuning Flywheel — Continuous Improvement Core

The best fine-tuning teams don't ship one model — they build a flywheel. Production feedback generates training data, which improves the model, which generates better feedback. This loop compounds over time.

The fine-tuning flywheel — continuous improvement loop

🔄

Flywheel Data Sources

User corrections: When users edit model output
Thumbs up/down: Explicit quality signals
Support escalations: Cases model couldn't handle
A/B test losers: Examples where new model failed
Edge cases: Unusual inputs from production logs

⚡

Flywheel Automation

Auto-label: Use current model to bootstrap labels
Human-in-the-loop: Flag uncertain cases for review
Scheduled retraining: Weekly/monthly fine-tuning runs
Auto-eval: CI pipeline runs eval on new checkpoints
Auto-promote: If eval passes, deploy to staging

The Flywheel Compounds

Month 1: You ship a fine-tuned model. Month 2: You've collected 1,000 examples of user corrections; you retrain and quality improves 5%. Month 3: The better model gets more usage, generating more feedback. Month 6: You have 10,000 curated examples and 20% better quality than month 1. Teams that build the flywheel pull ahead of teams that treat fine-tuning as a one-time event.

Production Fine-Tuning Checklist — Before You Ship Reference

Phase	Checkpoint	Done?
Data	Dataset is versioned and reproducible	☐
	Deduplication completed (exact + near)	☐
	Train/val/test splits verified — no leakage	☐
	Quality audit passed (random sample reviewed)	☐
Training	Experiment tracked (hyperparams, metrics, artifacts)	☐
	Multiple checkpoints saved	☐
	Training loss and val loss look healthy	☐
Evaluation	Task-specific eval passed (golden set)	☐
	Regression tests passed (<2% drop on general benchmarks)	☐
	Safety eval passed (refusal rate, harmful content)	☐
	Format compliance verified (JSON parse rate, etc.)	☐
Deployment	Model registered with version and metadata	☐
	Quantization tested (if using)	☐
	Inference latency acceptable in staging	☐
Operations	Monitoring and alerting configured	☐
	Rollback plan documented and tested	☐
	Shadow mode or A/B test plan in place	☐

Incident Response — When Things Go Wrong In-depth

Fine-tuned models can fail in production. Have a runbook ready.

🚨

Common Failure Modes

Quality regression: Model suddenly worse
Format failures: JSON/structured output breaks
Refusal spike: Model refuses valid requests
Harmful output: Model generates bad content
Latency spike: Inference slows dramatically
OOM: Out of memory crashes

🛡️

Incident Response Steps

Detect: Alerting triggers on anomaly
Assess: Severity? Scope? Cause hypothesis?
Mitigate: Rollback to previous model version
Investigate: Root cause analysis with logs/traces
Fix: Address underlying issue
Postmortem: Document and prevent recurrence

Always Have a Rollback Plan

Before deploying any new model version: (1) Keep the previous version running in staging, (2) Document the exact rollback command/process, (3) Test the rollback procedure in staging, (4) Have a "big red button" that can revert in <5 minutes. The ability to quickly rollback is more important than the ability to quickly deploy.

∑ Chapter 10 — Key Takeaways

Track every experiment: hyperparameters, data version, metrics, artifacts (W&B, MLflow)
Use a model registry: version models, promote through stages (experiment → staging → production → archived)
A/B test in production: shadow mode first, then gradual rollout, measure quality + latency + cost
Monitor continuously: operational metrics (latency, errors) + quality metrics (user feedback, format compliance) + drift detection
Build the flywheel: production feedback → curated data → fine-tune → deploy → more feedback
Always have a rollback plan: test it, document it, be able to execute in <5 minutes

← Evaluation & Observability Advanced Overview →