AI Advanced · Fine-Tuning LLMs

Fine-Tuning LLMs

From dataset curation to LoRA adapters, SFT, DPO, and production deployment β€” a practitioner's complete guide to adapting large language models.

Fine-tuning is not a silver bullet β€” it is a precision tool. Used correctly it produces models that outperform general-purpose models on specific tasks. Used carelessly it wastes GPU budget and makes models worse. This guide teaches you the difference.

01
Chapter 01 Β· Decision Framework
Why Fine-Tune? β€” When It Wins and When It Doesn't

Most teams that think they need fine-tuning don't. Fine-tuning is expensive, slow, and introduces new failure modes. Before training a single step, prove that prompting and RAG cannot solve your problem. Fine-tuning wins when you need capabilities the model doesn't have β€” not when you need it to do something it already can.

Fine-tuning sits at the top of a capability ladder. Each rung is more expensive and time-consuming than the last. The rule: never climb to the next rung unless you've exhausted the current one.

The capability ladder β€” climb only when necessary
β‘  Prompt Engineering Zero-shot β†’ Few-shot β†’ CoT β†’ System prompt β‘‘ RAG / Context Injection Add external knowledge at inference time β‘’ Model Routing / Ensemble Use different models for different tasks β‘£ Fine-Tuning Modify model weights (LoRA β†’ Full) Hours Β· $0 Β· Instant deploy Days Β· $100s Β· Infra cost Weeks Β· $1Ks Β· Multiple models Weeks–Months Β· $1K–$100K Β· GPU required Cost ↑ Start Here
The 80/20 Rule of LLM Capabilities

For ~80% of LLM use cases, prompt engineering + RAG is sufficient. For ~15%, better model routing or a larger model solves the problem. Only ~5% truly require fine-tuning β€” usually domain-specific tasks, format control beyond what prompting achieves, or latency/cost requirements that demand a smaller, specialized model.

βœ…
Format Control Beyond Prompting

You need structured outputs (JSON, XML, code) with 99%+ reliability, and even few-shot + JSON mode doesn't achieve it consistently on your edge cases.

  • Medical coding with complex schemas
  • Domain-specific DSLs
  • Highly structured report generation
βœ…
Behavior Modification

You need the model to behave fundamentally differently from its base training β€” different tone, different reasoning style, different refusal boundaries.

  • Brand voice consistency at scale
  • Domain-specific communication norms
  • Reducing over-refusal for valid use cases
βœ…
Domain Knowledge Integration

The knowledge you need is too large or specialized for RAG context β€” the model must internalize domain expertise into its weights.

  • Medical/legal terminology usage
  • Proprietary codebase understanding
  • Specialized scientific reasoning
βœ…
Latency / Cost Optimization

You've proven a large model works, but need a smaller model to match quality at 10Γ— lower cost / 5Γ— lower latency for production scale.

  • Distillation to smaller model
  • Edge deployment requirements
  • High-volume cost reduction (1M+ queries/day)

Fine-tuning does not turn a model into a database. A common misconception is that you can "teach" a model facts by including them in training data. This fundamentally misunderstands how fine-tuning works.

🧠
What Actually Happens
  • Pattern learning: The model learns associations and response patterns, not retrievable facts
  • Implicit encoding: Knowledge is encoded implicitly in weights, not stored explicitly
  • Non-deterministic recall: The model may or may not surface specific facts depending on context
  • Blending: Fine-tuned knowledge blends with pre-training knowledge β€” can't isolate it
⚠️
Production Implications
  • Fine-tuned models will hallucinate domain facts they were trained on
  • Updates require full retraining, not data refresh
  • Correctness cannot be guaranteed β€” outputs are probabilistic
  • Citation and verification are impossible β€” no source to reference
The Knowledge Rule

If correctness depends on up-to-date or verifiable knowledge β†’ use RAG, not fine-tuning. Fine-tuning is for teaching behavior, style, and format β€” not for injecting facts. A fine-tuned model that "knows" your product documentation will confidently hallucinate details that were never in the training data.

Anti-PatternWhat HappensThe Real Solution
"Our prompts are too long" Fine-tuning doesn't reliably shorten prompts β€” behavior may become inconsistent Prompt compression, RAG optimization, or prompt caching
"The model doesn't know X" Fine-tuning doesn't add knowledge reliably β€” it's brittle and hallucinates RAG with verified source documents
"We have proprietary data" You also need to maintain, version, and update that data β€” fine-tuning freezes it RAG allows live updates without retraining
"We want better quality" If quality is vaguely defined, fine-tuning won't improve it β€” garbage in, garbage out Define quality β†’ measure β†’ improve prompts β†’ then consider fine-tuning
"We need to differentiate" Fine-tuning is not a competitive moat β€” others can fine-tune too. Data and product are the moat. Focus on data quality and product experience, not the model itself
Fine-Tuning Makes Models Worse at Other Things

Every fine-tuning run risks catastrophic forgetting β€” the model loses capabilities it had before. A model fine-tuned heavily on medical text may become worse at general conversation or code. You're not adding capabilities β€” you're trading general capabilities for specialized ones. This trade-off must be intentional and measured.

Fine-tuning improves whatever your dataset encodes. If your dataset is flawed, the model will learn incorrect behavior β€” and quality may appear improved while actually degrading in production.

Before You Train, Define:

Task-specific metrics: Accuracy, F1, format compliance, latency

Failure cases: What bad outputs look like β€” examples you'll reject

Acceptance thresholds: Numbers that must be hit before shipping

Regression tests: General capabilities that must not degrade

Without Evaluation:

❌ You can't tell if training helped or hurt

❌ You can't compare model versions meaningfully

❌ You can't catch regressions before production

❌ You're optimizing blindly toward an undefined target

Evaluation-First Mindset

Build your golden test set before writing a single training example. If you can't measure improvement, you're not doing engineering β€” you're hoping. The eval set defines what "better" means for your specific use case.

Fine-tuning cannot fix a fundamentally weak base model. If the base model doesn't understand your domain at all, fine-tuning won't magically create that capability β€” you'll just teach it to confidently produce low-quality outputs.

ConsiderationWhat to CheckRed Flag
Baseline capability Zero-shot performance on your task Model produces nonsense without prompting
Reasoning quality Chain-of-thought coherence, logic Model can't follow multi-step reasoning
Context length Max tokens vs your typical input size Inputs truncated during training/inference
Language coverage Fluency in your target languages Model struggles with non-English content
License & deployment Commercial use, modification rights License prohibits your use case
The Base Model Rule

Choose the smallest model that already performs reasonably well on your task with prompting alone. Fine-tune to specialize and improve consistency β€” not to compensate for fundamental capability gaps. A 7B model that handles your domain well will outperform a poorly-matched 70B model.

Cost CategoryLoRA (7B model)Full Fine-Tune (7B)Full Fine-Tune (70B)
GPU requirement 1Γ— A100 40GB or 1Γ— RTX 4090 4Γ— A100 80GB (FSDP) 8–16Γ— A100 80GB
Training time (10K samples) 2–4 hours 8–16 hours 24–72 hours
Cloud compute cost $10–$50 $200–$500 $2,000–$10,000
Data prep time Days to weeks Days to weeks Weeks to months
Iteration cycle Hours per experiment 1–2 days per experiment Days per experiment
Risk of degradation Lower (fewer params modified) Moderate β€” easy to overfit Higher β€” harder to debug
When the Math Works Out

Fine-tuning makes economic sense when: (1) you've proven prompting doesn't work, (2) you have 5K–50K+ quality training examples, (3) you need the specialized capability for high-volume production, and (4) you're prepared to maintain the fine-tuned model over time (updates when the base model changes, dataset maintenance, eval pipeline). If any of these are missing, the ROI is negative.

DimensionPromptingRAGFine-Tuning
Best for Behavior control, format, task guidance External knowledge, up-to-date info Capability modification, style, internalized expertise
Setup time Hours Days to weeks Weeks to months
Iteration speed Instant β€” edit and deploy Fast β€” update documents Slow β€” retrain and evaluate
Knowledge updates Edit prompt (limited) Update index anytime Retrain required
Failure mode Instruction not followed Wrong docs retrieved Catastrophic forgetting + overfitting
Maintenance burden Low β€” prompt version control Medium β€” index maintenance High β€” data pipeline, retraining, eval
Start Here (default path)

1. Prompt engineering (few-shot, CoT, system prompt)

2. Add RAG if knowledge is required

3. Try a larger/better model

4. Only then β†’ fine-tune if still not working

When to Skip to Fine-Tuning

βœ… You need a smaller model at lower cost/latency

βœ… Task requires fundamentally different behavior

βœ… You have abundant high-quality labeled data

βœ… You've already proven prompting + RAG insufficient

A single training run rarely produces a production-ready model. Expect multiple iterations β€” the first model reveals what's wrong with your data, the second fixes some issues, the third gets closer to acceptable quality.

The real fine-tuning workflow β€” iteration is the norm
Prepare Data Train Evaluate Identify Failures Update Dataset Repeat 3–10Γ— until quality threshold met
The True Cost Is Iteration

Compute cost for a single training run is small. The real cost is iteration time: reviewing failures, curating fixes, retraining, re-evaluating. Budget for 3–10 iteration cycles. A team that plans for one training run will be surprised; a team that plans for ten will ship a good model.

Unlike prompts (edit instantly) or RAG (update documents), fine-tuned models are frozen artifacts. You commit to maintaining them over time β€” or accepting degradation.

πŸ”„
Base Model Updates

When the base model releases a new version (Llama 3.2 β†’ 3.3), you must:

  • Retrain on new base
  • Re-evaluate for regressions
  • Update deployment infra
πŸ“Š
Data Drift

Production usage changes over time:

  • New query patterns emerge
  • Domain knowledge evolves
  • Model performance degrades
πŸ”§
Pipeline Maintenance

The infrastructure requires upkeep:

  • Dataset versioning
  • Eval pipeline updates
  • Retraining automation
The Maintenance Question

Before fine-tuning, ask: "Who will maintain this model 6 months from now?" If the answer is unclear, reconsider. Unmaintained fine-tuned models become legacy debt β€” increasingly out-of-sync with reality, impossible to update quickly, and risky to replace.

This is the fundamental insight that should guide every fine-tuning decision: you are not making the model universally better. You are making a trade.

What You Gain

βœ… Better performance on your specific task

βœ… More consistent format and style

βœ… Reduced prompt engineering complexity

βœ… Potentially lower inference cost (smaller model)

What You Lose

❌ Performance on tasks outside your training distribution

❌ Flexibility to handle unexpected inputs

❌ Ability to update quickly (must retrain)

❌ General reasoning capability (potentially)

The Trade-off Must Be Intentional

Before fine-tuning, explicitly document: (1) What tasks must improve, (2) What tasks can degrade, (3) How you'll measure both. If you can't answer these questions, you're not ready to fine-tune. A fine-tuned model without measured trade-offs is a model you don't understand.

∑ Chapter 01 — Key Takeaways

  • Fine-tuning is the last rung on the capability ladder β€” exhaust prompting, RAG, and model routing first
  • ~80% of LLM use cases do not require fine-tuning β€” most are solved by prompt engineering + RAG
  • Fine-tuning wins for: format control, behavior modification, internalized domain knowledge, and cost/latency optimization
  • Fine-tuning fails for: "longer prompts," "more knowledge," vague quality improvements, and differentiation alone
  • Catastrophic forgetting is real β€” fine-tuning trades general capability for specialized capability
  • Cost: LoRA = $10–$50, Full FT (7B) = $200–$500, Full FT (70B) = $2K–$10K+ β€” data prep time matters more
02
Chapter 02 Β· Dataset Engineering
Data Preparation β€” The Make-or-Break Step

Fine-tuning performance is 90% data, 10% hyperparameters. A mediocre model trained on excellent data will outperform an excellent model trained on mediocre data. Most fine-tuning failures trace back to dataset problems β€” low quality, insufficient diversity, wrong format, or insufficient volume. Data engineering is the real work of fine-tuning.

Every fine-tuning dataset is ultimately a collection of (input, target output) pairs. The format depends on your training objective β€” completion, instruction following, or preference learning.

πŸ“„
Format: Completion (SFT)

Single text sequence β€” model learns to predict continuation. Used for continued pre-training or simple generation tasks.

{ "text": "User: What is photosynthesis?\nAssistant: Photosynthesis is..." }
πŸ’¬
Format: Chat / Instruction (SFT)

Multi-turn conversations with role markers. Standard for instruction tuning. Most common format.

{ "messages": [ {"role": "system", "content": "You are..."}, {"role": "user", "content": "Explain X"}, {"role": "assistant", "content": "X is..."} ] }
βš–οΈ
Format: Preference (DPO/RLHF)

Pairs of (chosen, rejected) responses to the same input. Used for preference alignment.

{ "prompt": "Explain X", "chosen": "X is a concept that...", "rejected": "I don't know what X is." }
Training GoalFormatFields RequiredExample Use
Continued pre-training Completion text Domain adaptation (legal corpus, codebase)
Instruction following Chat/Instruction messages with role/content General assistant, task completion
Preference alignment Preference pairs prompt, chosen, rejected DPO training, RLHF reward modeling
Structured output Chat + schema messages with JSON in assistant turn Extraction, classification, code generation

A model performs well only when training data matches real usage. The most common cause of fine-tuning failure isn't bad hyperparameters β€” it's training on data that doesn't represent production.

Common Failure Pattern

❌ Training on clean, idealized examples

❌ Deploying on noisy, real-world inputs

❌ Performance collapse in production

❌ Team surprised: "It worked in testing!"

Training Data Must Include:

βœ… Real user queries (sampled from logs)

βœ… Edge cases and unusual inputs

βœ… Typos, grammar errors, incomplete inputs

βœ… Adversarial and out-of-scope inputs

The Golden Rule of Training Data

Your dataset should look like production logs β€” not curated examples. If your training data is cleaner than your production traffic, you're training a model for a world that doesn't exist. Resist the urge to "clean up" training examples too much. Real users don't write perfect queries.

More data is only better if the data is consistently high quality. A 10K sample dataset with 30% low-quality examples will produce worse results than a 3K sample dataset where every example is excellent.

Quality vs quantity β€” typical fine-tuning performance curve
0 Quality 100 500 1K 5K 10K 50K 100K+ Dataset Size (samples) High-quality data Mixed quality Sweet spot: 1K–10K
βœ…
High-Quality Sample Criteria
  • Correct: The target output is factually accurate and appropriate
  • Complete: No truncation, no partial responses
  • Consistent: Format matches other samples; no random variations
  • Representative: Covers the distribution of real inputs
  • Clear: Unambiguous instruction β†’ response mapping
❌
Quality Problems That Ruin Training
  • Noisy labels: Wrong, inconsistent, or ambiguous targets
  • Duplicates: Same examples repeated β€” overfit to those patterns
  • Length bias: All short or all long β€” model learns length, not content
  • Format inconsistency: Mixed JSON styles, varying delimiters
  • Contamination: Test examples in training set β€” fake good results
The 1,000 High-Quality Sample Rule

For most task-specific fine-tuning, 1,000–5,000 high-quality examples is the sweet spot. Below 500, you risk underfitting. Above 10K, you get diminishing returns unless data quality remains excellent. A team that spends 80% of effort on data curation and 20% on training will outperform a team that does the opposite.

Fine-tuned models often appear to improve during training β€” but are actually memorizing. When train loss decreases but validation loss stalls or increases, the model is overfitting to training examples rather than learning generalizable patterns.

🚨
Warning Signs of Overfitting
  • Loss divergence: Training loss decreases, validation does not
  • Output similarity: Outputs become overly similar to training examples
  • Brittleness: Performance drops on slightly different inputs
  • Memorization: Model reproduces training text verbatim
  • Narrow behavior: Model only handles exact patterns from training
πŸ›‘οΈ
Mitigation Strategies
  • Strong validation set: 10–20% of data, held out strictly
  • Early stopping: Monitor validation loss, stop when it stalls
  • Fewer epochs: 1–3 epochs is often sufficient
  • More data diversity: Increase variety, not just volume
  • Out-of-distribution eval: Test on inputs unlike training
The Overfitting Test

After training, show the model inputs that are similar but not identical to training examples. If performance is significantly worse than on training-like inputs, you're overfitting. A well-generalized model should handle reasonable variations without degradation.

SourceQualityCostBest ForWatch Out
Production logs Real distribution Free (you have it) Domain adaptation, improving existing models Needs labeling; may contain PII
Expert annotation Highest quality $50–$500+ per hour Small high-quality datasets (500–2K) Expensive at scale; expert availability
Crowd annotation Variable $0.10–$2 per sample Scaling up with quality controls Needs rigorous QA; inter-annotator agreement
Synthetic (LLM-generated) Good if filtered $0.001–$0.01 per sample Bootstrapping, format examples, augmentation Model collapse risk; echo chamber
Public datasets Variable Free Pre-training, general instruction tuning May be contaminated; license issues

A powerful technique: use a larger, more capable model (GPT-4o, Claude Sonnet) to generate training examples for a smaller model. This is how most open-source instruction-tuned models were trained (Alpaca, Vicuna, etc.).

πŸ”„
Self-Instruct Pattern

Generate diverse (instruction, response) pairs from a seed set. Use the LLM to create variations and new tasks.

  • Start with 100–200 seed examples
  • Generate 5K–50K synthetic examples
  • Filter aggressively for quality
🎯
Evol-Instruct Pattern

Take simple instructions and evolve them into harder, more complex versions. Creates curriculum progression.

  • Simple β†’ Complex β†’ Multi-step
  • Add constraints, edge cases
  • Reject trivial generations
βš–οΈ
Distillation Pattern

Run your production queries through a large model; use outputs as training signal for a smaller model.

  • Deploy large model first
  • Log (input, output) pairs
  • Fine-tune small model on logs
πŸ”§
Synthetic data generation prompt (example)
System: You are generating training data for a customer support classification model. Generate 10 diverse customer support tickets that should be classified as "BILLING". Each ticket should be realistic, varied in tone (frustrated, neutral, polite), length (1-3 sentences), and phrasing. Format each as: Ticket: [ticket text] Label: BILLING Do not repeat patterns. Make each ticket distinct. User: Generate 10 billing-related support tickets.
Synthetic Data Risks

Model collapse: Training on LLM outputs can amplify quirks and reduce diversity. Echo chamber: Synthetic data reflects the biases of the generating model. Format homogeneity: LLMs generate in patterns β€” synthetic datasets often lack the noise and variation of real human data. Mitigation: Always filter synthetic data, mix with real data (20–50% real), and validate on held-out human-generated test sets.

Duplicate or near-duplicate examples cause the model to memorize rather than generalize. Deduplication is not optional β€” it is a required preprocessing step for any fine-tuning dataset.

Dedup MethodWhat It CatchesImplementationWhen to Use
Exact hash Identical text strings MD5/SHA256 of normalized text Always β€” baseline dedup
N-gram overlap High textual similarity (70%+ overlap) MinHash, n-gram Jaccard Medium-sized datasets (<100K)
Embedding similarity Semantic duplicates (same meaning, different words) Embed + ANN search (FAISS) + threshold When paraphrases are a concern
Train/test leakage check Test examples leaked into training Hash match between splits Always β€” critical for valid eval
Dedup Pipeline (Recommended)

Step 1: Normalize text (lowercase, strip whitespace, remove special chars) β†’ Step 2: Exact dedup by hash β†’ Step 3: Near-dedup by MinHash (threshold 0.7) β†’ Step 4: Check for train/test/val leakage β†’ Step 5: Log dedup stats (how many removed, from which sources). A 10% dedup rate is normal; >30% suggests data collection problems.

Recommended Splits

Training: 80–90% β€” model learns from this

Validation: 5–10% β€” used during training to tune hyperparams, detect overfitting

Test: 5–10% β€” held out until final evaluation; never touched during training

Rule: Validation and test sets must be representative of production distribution

Common Mistakes

❌ No validation set β€” can't detect overfitting during training

❌ Test set too small β€” results have high variance

❌ Leakage between splits β€” fake good metrics

❌ Random split when data has structure β€” should split by time, user, or document

CheckWhat to VerifyRed Flags
Format validation All samples parse correctly; required fields present JSON parse errors, missing messages, empty content
Length distribution Reasonable spread of input/output lengths All samples same length, or extreme outliers
Label balance Classes roughly balanced (or intentionally weighted) 90% one class, 10% others β†’ model ignores minority
Quality audit Random sample of 50–100 manually reviewed >5% have errors, inconsistencies, or low quality
Deduplication Exact + near duplicates removed >10% duplicates; any train/test leakage
Split validation No overlap between train/val/test; splits representative Same examples in train and test
Tokenization check Samples don't exceed context window after tokenization Samples truncated silently during training

The true value of fine-tuning is not the model itself β€” it's the data pipeline you build. The model is a snapshot; the pipeline is an asset that compounds over time.

The data flywheel β€” continuous improvement engine
Collect Production Data Identify Failures Improve Dataset Retrain Model Deploy Better data β†’ Better model β†’ Better data collection
Flywheel Components

Data collection: Log production queries and model outputs

Failure identification: Find where model underperforms

Dataset improvement: Add examples for failure cases

Automated retraining: Regular model updates

Flywheel Compounding

Month 1: 1,000 examples, 70% quality

Month 3: 3,000 examples, 80% quality

Month 6: 8,000 examples, 90% quality

Month 12: Competitors can't catch up

The Flywheel Is the Moat

A fine-tuned model is a commodity β€” anyone can train one. A data pipeline that continuously improves from production usage is a moat. Teams that build the flywheel pull ahead over time; teams that treat fine-tuning as a one-time event get left behind.

∑ Chapter 02 — Key Takeaways

  • Fine-tuning performance is 90% data, 10% hyperparameters β€” data engineering is the real work
  • Three main formats: completion (text), chat/instruction (messages), preference (chosen/rejected pairs)
  • 1,000–5,000 high-quality samples is the sweet spot β€” more data only helps if quality stays high
  • Synthetic data is powerful but risky: filter aggressively, mix with real data, validate on human test sets
  • Deduplication is not optional β€” exact hash β†’ MinHash β†’ embedding similarity β†’ train/test leakage check
  • Before training: format validation, length distribution, label balance, quality audit, dedup, split validation, tokenization check
03
Chapter 03 Β· Parameter-Efficient
LoRA & PEFT β€” Train 1% of Parameters, Get 95% of the Gains

Parameter-efficient fine-tuning (PEFT) is the practical way to fine-tune large models. Instead of updating all 7B–70B parameters, you train small adapter layers that modify model behavior. This reduces GPU memory by 80%, training cost by 90%, and makes experimentation fast. For most use cases, PEFT matches full fine-tuning quality while being dramatically easier to run.

LoRA (Low-Rank Adaptation) freezes the original model weights and adds small trainable matrices to specific layers. Instead of updating a 4096Γ—4096 weight matrix, you learn two small matrices (4096Γ—r and rΓ—4096, where r=8–64) that together approximate the weight update. The original model is unchanged β€” you're just learning a delta.

LoRA adds low-rank matrices to frozen base weights
Input x W (frozen) 4096 Γ— 4096 16M params + A 4096 Γ— r Γ— B r Γ— 4096 LoRA Ξ” (r=16 β†’ 131K) = W + AB Output Train only A, B (0.8% of params)
Why LoRA Works

The key insight: fine-tuning updates live in a low-rank subspace. You don't need to modify 16M parameters β€” the behavioral changes can be captured by two small matrices totaling 100K–500K parameters. The base model provides general intelligence; LoRA provides task-specific steering. At inference, AΓ—B is merged into W β€” zero inference overhead.

HyperparameterWhat It ControlsRecommended RangeTrade-off
Rank (r) Capacity of the adaptation β€” how much information the LoRA can encode 8–64 (start with 16) Higher = more capacity but more params and risk of overfitting
Alpha (Ξ±) Scaling factor for the LoRA update β€” controls magnitude of behavior change Ξ± = r or Ξ± = 2r (default: same as rank) Higher = stronger adaptation; too high = instability
Target modules Which layers get LoRA adapters q_proj, v_proj (minimum) or all attention + MLP More modules = more capacity but slower training
Dropout Regularization during training 0.05–0.1 (often 0) Helps prevent overfitting on small datasets
🎯
Simple Task (r=8)

Classification, sentiment, simple extraction. Low capacity needed β€” small LoRA prevents overfitting.

  • Params: ~50K
  • Memory: +2% vs base
  • Risk: May underfit complex tasks
βš–οΈ
Standard Task (r=16–32)

Instruction following, domain adaptation, style transfer. Good balance of capacity and efficiency.

  • Params: ~100K–250K
  • Memory: +5% vs base
  • Most common production choice
πŸš€
Complex Task (r=64+)

Major behavior changes, multi-task, domain pre-training. Higher capacity when you have data to support it.

  • Params: ~500K–2M
  • Memory: +10% vs base
  • Risk: Overfitting on small datasets

QLoRA combines 4-bit quantization of the base model with LoRA training. The base model is loaded in 4-bit (NF4) format, reducing memory by 4Γ—. LoRA adapters are still trained in full precision. This allows fine-tuning a 7B model on a single 16GB GPU or a 70B model on a single A100.

Model SizeFull FT MemoryLoRA MemoryQLoRA (4-bit) MemoryConsumer GPU?
7B ~56GB (4Γ— A100) ~28GB (A100 40GB) ~8GB (RTX 4090 / 3090) βœ… Yes
13B ~104GB (8Γ— A100) ~52GB (A100 80GB) ~14GB (RTX 4090) βœ… Yes
70B ~560GB (16Γ— A100) ~280GB (8Γ— A100) ~48GB (A100 80GB) ❌ No
πŸ”§
QLoRA training config (Hugging Face PEFT)
from transformers import BitsAndBytesConfig from peft import LoraConfig, get_peft_model # Load base model in 4-bit bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", quantization_config=bnb_config, device_map="auto", ) # Add LoRA adapters lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config)
QLoRA Quality vs Memory Trade-off

QLoRA reduces memory by 4Γ— but introduces quantization error. For most tasks, quality is within 1–2% of full-precision LoRA. However: (1) very small datasets may overfit more easily with QLoRA, (2) complex reasoning tasks sometimes benefit from full precision, (3) merging QLoRA adapters to full precision for serving requires dequantization. Start with QLoRA; switch to full-precision LoRA only if quality is noticeably worse.

MethodKey IdeaWhen to UseStatus
LoRA Low-rank matrices A, B added to frozen weights Default choice β€” well-understood, broadly supported Production-ready
QLoRA LoRA + 4-bit base model quantization When GPU memory is constrained Production-ready
DoRA Decomposes weights into magnitude + direction; learns direction Small quality improvement over LoRA in some tasks Emerging β€” worth testing
AdaLoRA Adaptive rank β€” automatically allocates rank per layer When optimal rank varies by layer Experimental
LoRA+ Different learning rates for A, B matrices Minor optimization; easy to add Experimental
Prefix Tuning Prepend trainable "virtual tokens" to input Legacy approach β€” LoRA generally better Largely superseded
Practical Recommendation

Start with LoRA or QLoRA β€” they're the most tested, best supported, and work for 90%+ of use cases. Try DoRA if you need an extra 1–2% quality improvement and are willing to experiment. Everything else is research-stage or niche β€” don't use unless you have a specific reason and can validate the improvement on your eval set.

During inference, you can either (1) load adapters separately and apply at runtime, or (2) merge adapters into base weights for a single merged model. Merging is preferred for production β€” zero inference overhead, simpler deployment.

Separate Adapters

Pros: Can swap adapters at runtime; multiple adapters per base model

Cons: Slight inference overhead; more complex serving

Use when: Multi-tenant serving, A/B testing adapters

Merged Model

Pros: Zero overhead; standard model format; simple deployment

Cons: Can't swap at runtime; produces a full model copy

Use when: Single-purpose deployment (most cases)

πŸ”§
Merging LoRA adapters into base model
from peft import PeftModel, PeftConfig # Load base model and adapter base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") model = PeftModel.from_pretrained(base_model, "./my-lora-adapter") # Merge adapter into base weights merged_model = model.merge_and_unload() # Save as standard model (no adapter dependency) merged_model.save_pretrained("./my-merged-model") tokenizer.save_pretrained("./my-merged-model")

Fine-tuning changes how a model behaves at inference time β€” not just on your target task, but across all tasks. These trade-offs affect production system design.

βœ…
What Improves
  • Task consistency: More reliable outputs on trained patterns
  • Format compliance: Better adherence to target structure
  • Latency (if distilled): Smaller model can match larger model
  • Reduced prompting: Less instruction needed in prompt
⚠️
What May Degrade
  • Flexibility: Less adaptable to unexpected inputs
  • General capability: Worse on tasks outside training distribution
  • Creativity: More constrained, less diverse outputs
  • Instruction following: May ignore prompts that conflict with training
Production System Pattern: Model Routing

Production systems often route between base and fine-tuned models: use the fine-tuned model for in-domain queries where it excels, fall back to the base model for out-of-domain or uncertain cases. This preserves flexibility while gaining specialization. Implement confidence-based routing or query classification to decide which model handles each request.

∑ Chapter 03 — Key Takeaways

  • LoRA trains 0.1–1% of parameters by adding small low-rank matrices to frozen base weights
  • Key hyperparams: rank r (start 16), alpha (= r or 2r), target modules (q, v minimum)
  • QLoRA combines 4-bit quantization + LoRA β€” fine-tune 7B on a 16GB GPU
  • DoRA offers small quality gains; everything else is experimental β€” stick with LoRA/QLoRA
  • Merge adapters for production β€” zero inference overhead, simpler deployment
  • LoRA matches full fine-tuning quality for 90%+ of tasks at 10Γ— lower cost
04
Chapter 04 Β· Training at Scale
Full Fine-Tuning β€” When to Update All Weights

Full fine-tuning updates every parameter in the model. It's the most powerful form of adaptation β€” and the most dangerous. Use it when LoRA isn't enough, you have abundant high-quality data, and you're prepared to invest in compute and validation. Most teams never need it; some absolutely do.

ScenarioWhy Full FT?Typical Data Volume
Continued pre-training Adding domain knowledge to the base model (code, legal, medical corpus) 10M–1B+ tokens
Language adaptation Adapting to a new language the base model doesn't handle well 1B+ tokens
LoRA ceiling reached LoRA quality plateaus; ablation shows more capacity needed 50K–500K samples
Model distillation Training a smaller model to mimic a larger model's outputs 100K–1M samples
Safety fine-tuning Deep behavioral changes that touch many capabilities 10K–100K samples
The Decision Rule

Try LoRA first. Measure quality. If quality is not sufficient and you have evidence that more capacity is needed (ablation with higher rank doesn't help, or quality improves with more data but plateaus), then consider full fine-tuning. Full fine-tuning is a last resort, not a default.

Model SizeMin GPUs (FSDP/ZeRO-3)Memory per GPUTraining Time (10K samples)Cloud Cost
7B 4Γ— A100 40GB ~28GB per GPU 8–16 hours $200–$500
13B 8Γ— A100 40GB ~32GB per GPU 16–32 hours $500–$1,500
70B 16Γ— A100 80GB ~70GB per GPU 48–120 hours $5,000–$20,000
βš™οΈ
FSDP (PyTorch Native)

Fully Sharded Data Parallel β€” shards model, optimizer, gradients across GPUs. First choice for PyTorch users.

  • Built into PyTorch β‰₯2.0
  • Good Hugging Face integration
  • Requires homogeneous GPU cluster
πŸš€
DeepSpeed ZeRO

Microsoft's distributed training library. ZeRO-3 shards everything; ZeRO-Offload uses CPU memory.

  • More memory-efficient than FSDP
  • Better for heterogeneous setups
  • Slightly more complex config
πŸ“¦
Cloud Platforms

Managed training: Lambda Labs, RunPod, AWS SageMaker, Google Vertex AI.

  • Pre-configured multi-GPU
  • Spot instances for cost savings
  • Pay per hour β€” no capital expense

Full fine-tuning requires learning rates 10–100Γ— smaller than pre-training. Too high β†’ catastrophic forgetting and instability. Too low β†’ no learning. The optimal range is narrow and model-dependent.

HyperparameterTypical RangeNotes
Learning rate 1e-6 to 5e-5 (start: 2e-5) 10–100Γ— lower than pre-training LR
LR schedule Cosine decay or linear decay Warm up for first 3–10% of steps
Batch size 32–256 (effective, after gradient accumulation) Larger = more stable; limited by memory
Epochs 1–3 (often just 1) More epochs β†’ overfitting on small datasets
Weight decay 0.01–0.1 Regularization β€” prevents overfitting
Gradient clipping 1.0 Prevents gradient explosion
Catastrophic Forgetting Is Real

With full fine-tuning, the model can forget everything it knew. Symptoms: degraded general conversation, broken instruction following on unrelated tasks, loss of chain-of-thought ability. Prevention: (1) low learning rate, (2) short training (1–3 epochs), (3) mix in general instruction data (10–20%), (4) evaluate on general benchmarks during training, not just your task.

πŸ“ˆ
Healthy Training Signs
  • Loss decreases smoothly β€” no spikes or plateaus after warmup
  • Validation loss tracks training loss β€” gap stays small
  • Gradient norm stable β€” no explosions (should be <1.0 with clipping)
  • Eval metrics improve on task β€” accuracy/quality on hold-out set
  • General benchmarks stable β€” MMLU, HumanEval don't drop significantly
🚨
Warning Signs β€” Stop and Investigate
  • Loss spikes β€” learning rate too high or data corruption
  • Validation loss increases β€” overfitting; stop training
  • NaN loss β€” numerical instability; reduce LR, check data
  • General capability drops β€” catastrophic forgetting; mix in general data
  • Repetitive/degenerate outputs β€” model collapsed; restart with lower LR
Checkpoint Strategy

Save checkpoints every 500–1000 steps. Keep the last 3 + best validation loss + best task metric. If training degrades, you can revert to an earlier checkpoint. For full fine-tuning, checkpoints are large (model size Γ— 2 for optimizer states) β€” budget storage accordingly. Use save_only_model=True to save just weights if storage is tight.

∑ Chapter 04 — Key Takeaways

  • Full fine-tuning updates all parameters β€” use only when LoRA isn't enough and you have abundant data
  • Use cases: continued pre-training, language adaptation, LoRA ceiling reached, distillation
  • Compute: 7B needs 4Γ— A100 40GB, 70B needs 16Γ— A100 80GB β€” cloud cost $200–$20K
  • Learning rate: 1e-6 to 5e-5 (start 2e-5); 10–100Γ— lower than pre-training
  • Catastrophic forgetting: low LR, 1–3 epochs, mix in general data, monitor general benchmarks
  • Save checkpoints every 500–1000 steps; keep best validation + best task metric + last 3
05
Chapter 05 Β· Alignment Training
SFT vs DPO vs RLHF β€” Choosing Your Training Objective

Not all fine-tuning is the same. The training objective determines what the model learns: imitate examples (SFT), prefer better responses (DPO), or optimize a reward signal (RLHF). Each has different data requirements, complexity, and outcomes. Most teams should start with SFT; add DPO when preference data is available; avoid RLHF unless you have specific alignment needs.

Training objective progression β€” complexity vs capability
SFT β€” Supervised Fine-Tuning "Learn to produce this output" Simplest Β· Most used Β· Start here DPO β€” Direct Preference "Learn to prefer A over B" Moderate Β· Good results Β· Needs pairs RLHF β€” RL from Human Feedback "Optimize a reward model" Complex Β· Unstable Β· Frontier labs only Complexity: Low Complexity: Medium Complexity: High

SFT is the simplest and most common fine-tuning objective. You show the model (input, target output) pairs, and it learns to maximize the probability of producing the target. This is what most people mean when they say "fine-tuning."

βœ…
When to Use SFT
  • You have examples of the exact output you want
  • Task has clear right/wrong answers
  • You're doing instruction tuning from scratch
  • You need format control (JSON, specific templates)
  • First fine-tuning attempt β€” start here
⚠️
SFT Limitations
  • Model learns to imitate β€” even bad patterns in data
  • Can't express "this is better than that" directly
  • Quality ceiling is your data quality
  • Quantity-sensitive β€” needs 1K–50K examples
  • Doesn't optimize for human preferences explicitly
The SFT Data Formula

Data format: {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}. Volume: 1K–5K high-quality examples for task-specific; 10K–100K for general instruction tuning. Quality bar: Every example should be one you'd be happy to ship in production. 1K excellent examples beats 10K mediocre ones.

DPO learns from preference pairs: given the same prompt, which response is better? This is more directly aligned with how humans judge quality β€” we often know "A is better than B" even when we can't write the perfect response ourselves.

AspectSFTDPO
Data format (prompt, response) (prompt, chosen, rejected)
What it learns Maximize probability of target Increase prob(chosen) relative to prob(rejected)
Data collection Write ideal responses Generate two responses, pick better one
Typical volume 1K–50K examples 5K–50K preference pairs
Training stability Very stable Moderately stable (watch KL divergence)
Complexity Simple Requires reference model + DPO loss
πŸ“„
DPO data format
{ "prompt": "Explain photosynthesis in one sentence.", "chosen": "Photosynthesis is the process by which plants convert sunlight, water, and CO2 into glucose and oxygen.", "rejected": "It's when plants make food from the sun or something." }
When to Add DPO After SFT

The typical pipeline: SFT first β†’ then DPO. SFT teaches the model what kind of responses to produce; DPO refines which responses are better. Add DPO when: (1) you have clear preference signals (human feedback, A/B test results), (2) SFT quality plateaus but you can still rank responses, (3) you need to reduce specific failure modes (safety, tone, verbosity) that are easier to express as "not this" than "do this."

RLHF is the full alignment approach used by OpenAI, Anthropic, and others to train frontier models. It involves training a separate reward model on human preferences, then using RL (PPO) to optimize the LLM against that reward signal.

RLHF Pipeline

Step 1: Collect human preference data (A vs B rankings)

Step 2: Train a reward model to predict human preferences

Step 3: Use PPO to optimize LLM to maximize reward model score

Step 4: Add KL penalty to prevent reward hacking

Result: Model optimizes for human preferences end-to-end

Why Most Teams Shouldn't Use RLHF

❌ Requires training a separate reward model (expensive)

❌ RL training (PPO) is unstable and hard to tune

❌ Reward hacking is a real failure mode

❌ Requires 100K+ human preference annotations

βœ… DPO achieves 90% of the benefit at 10% of the complexity

DPO Is Usually Enough

DPO was designed as a simpler alternative to RLHF that doesn't require a separate reward model or RL training. In practice, DPO matches RLHF quality for most alignment tasks at a fraction of the complexity. Unless you're a frontier lab with dedicated alignment researchers, use DPO instead of RLHF. The infrastructure and expertise required for stable RLHF training is not worth it for most production use cases.

Your SituationRecommended ObjectiveRationale
First fine-tuning attempt SFT Simplest, fastest iteration, establishes baseline
You have ideal target outputs SFT Directly teach the model what to produce
Quality plateau + preference data available SFT β†’ DPO SFT establishes capability; DPO refines quality
Reducing specific failure modes DPO (after SFT) Easier to express "not this" than "do this"
Safety / alignment at frontier scale RLHF Only if you have dedicated alignment team and resources
General instruction following SFT (then optional DPO) Proven pipeline from Alpaca β†’ Vicuna β†’ etc.
1️⃣Base ModelLlama, Mistral, etc.
2️⃣SFT10K–50K examples
3️⃣Evaluatetask metrics + general
4️⃣DPO (optional)5K–20K pref pairs
5️⃣Final Evalship if passing
The Practical Truth

90% of production fine-tuning is SFT-only. It's simple, it works, and it's what you should start with. DPO adds 5–15% quality improvement when you have good preference data and SFT has plateaued. RLHF is for frontier labs. Don't over-engineer your training objective β€” get SFT right first, add DPO if needed, and skip RLHF unless you have a very specific reason and resources.

∑ Chapter 05 — Key Takeaways

  • Three training objectives: SFT (learn from examples), DPO (learn from preferences), RLHF (optimize a reward model)
  • Start with SFT β€” simplest, fastest, works for 90% of use cases
  • SFT data: (prompt, response) pairs; DPO data: (prompt, chosen, rejected) triplets
  • Add DPO after SFT when: quality plateaus, you have preference data, reducing specific failures
  • DPO β‰ˆ RLHF quality at 10% complexity β€” use DPO instead unless you're a frontier lab
  • Modern pipeline: Base β†’ SFT β†’ Eval β†’ (optional) DPO β†’ Final Eval β†’ Ship
06
Chapter 06 Β· Quality Assurance
Evaluation β€” Measuring What Your Fine-Tuned Model Actually Does

Fine-tuning without evaluation is guessing. You must measure both your task performance AND general capability retention. A model that aces your custom task but forgets how to reason is not a success. Build an eval suite before training, run it continuously, and never ship a model that hasn't passed your quality gates.

Evaluation for fine-tuning is different from evaluating base models. You need to measure both task-specific improvement and general capability preservation. The hierarchy prioritizes what matters most.

Evaluation priorities β€” from specific to general
β‘  Task-Specific Metrics Accuracy, F1, BLEU on YOUR task β‘‘ Format Compliance JSON validity, schema adherence β‘’ Safety & Regression Refusal rate, harmful output check β‘£ General Capability Retention MMLU subset, HumanEval if relevant, general conversation quality High Priority ← Always check
Eval LevelWhat It MeasuresWhen to RunFailure Outcome
β‘  Task-specific Does the model do YOUR job better? Every checkpoint, every experiment Model isn't useful β€” retrain or adjust data
β‘‘ Format compliance Does output match required structure? Every checkpoint Downstream parsing fails β€” adjust training data
β‘’ Safety/regression Did we break safety or introduce new failures? Before shipping, major changes Ship blocker β€” model produces harmful content
β‘£ General capability Did we lose general intelligence? Before shipping, weekly during iteration Catastrophic forgetting β€” adjust LR, add general data

A golden test set is a curated collection of examples with known-correct answers that you use to evaluate every model iteration. It's your source of truth and should be treated as sacred β€” never train on it, never modify it casually.

βœ…
Golden Set Best Practices
  • Size: 200–500 examples minimum; 1000+ for high-stakes
  • Diversity: Cover all task subtypes, edge cases, difficulty levels
  • Expert-verified: Every answer reviewed by domain expert
  • Version-controlled: Git track with change history
  • Never contaminated: Must not appear in training data
❌
Golden Set Anti-Patterns
  • Too small: <100 examples β€” results have high variance
  • Homogeneous: All easy examples β€” misses edge cases
  • Stale: Not updated as task evolves
  • Leaked: Examples also in training data β€” inflated scores
  • Ambiguous: Multiple correct answers without accounting for them
The 10% Rule for Golden Sets

Allocate at least 10% of your data curation effort to building and maintaining your golden test set. This is non-negotiable. A team that spends all effort on training data but has a weak eval set will ship bad models and not know it. The golden set is how you know if your fine-tuning is working.

Task TypePrimary MetricSecondary MetricsImplementation
Classification Accuracy, Macro-F1 Per-class precision/recall, confusion matrix Exact match against labels
Extraction (NER, slots) Entity-level F1 Partial match rate, span accuracy Compare extracted entities to gold
Generation (summaries, content) ROUGE-L, BERTScore Human preference, factual accuracy Automated + sampling for human review
Structured output (JSON) Schema validity + field accuracy Parse success rate, field-level F1 JSON parse test + field extraction check
Code generation Pass@1, Pass@5 Syntax validity, test case pass rate Execute against test cases
QA / reasoning Exact match, LLM-as-judge Chain-of-thought quality String match + GPT-4 evaluation
Automated Metrics Have Limits

BLEU, ROUGE, and even BERTScore correlate poorly with human preferences for open-ended generation. They're useful for directional signals but not final quality judgment. For generation tasks, always supplement automated metrics with LLM-as-judge evaluation and periodic human spot-checks (review 20–50 random examples each iteration).

For open-ended tasks where exact matching fails, use a stronger model (GPT-4o, Claude Sonnet) to judge the quality of your fine-tuned model's outputs. This correlates better with human preferences than traditional metrics.

πŸ”§
LLM-as-Judge prompt template (pairwise comparison)
System: You are an expert evaluator comparing two AI assistant responses. Compare Response A and Response B for the given task. Consider: 1. Correctness β€” Is the information accurate? 2. Completeness β€” Does it fully address the query? 3. Clarity β€” Is the response well-structured and clear? 4. Relevance β€” Does it stay on topic? Respond with ONLY one of: "A" (A is better), "B" (B is better), or "TIE" (roughly equal). User: Task: {task_description} Input: {user_input} Response A: {response_a} Response B: {response_b} Which response is better?
LLM-as-Judge Benefits

Scalable: Evaluate 1000s of examples at $0.01–$0.05 each

Consistent: No annotator fatigue or mood variation

Nuanced: Can evaluate subtle quality differences

Fast: Results in minutes, not days

LLM-as-Judge Limitations

Position bias: May prefer first or second response

Verbosity bias: May prefer longer responses

Self-preference: GPT-4 may prefer GPT-4-style outputs

Mitigation: Randomize order, calibrate with human baseline

LLM-as-Judge Calibration

Before trusting LLM-as-judge, validate against 50–100 human-labeled examples. Compute agreement rate (should be >80% for binary better/worse). If agreement is low, refine your evaluation prompt or criteria. Also test for position bias by running each comparison twice with swapped order β€” disagreement rate should be <10%.

Benchmark contamination occurs when your test data appears in training data. The model memorizes answers rather than learning to reason. This is the #1 cause of inflated evaluation scores that don't reflect production performance.

Contamination TypeHow It HappensDetection MethodPrevention
Direct leakage Test set examples in training set Hash matching, n-gram overlap Strict train/test split management
Paraphrase leakage Same question, different wording in train Embedding similarity search Semantic dedup across splits
Public benchmark contamination Test set is public (MMLU, HumanEval) and in web scrapes Hard to detect Use held-out custom evals
Synthetic data feedback Generated training data includes benchmark patterns Manual audit of synthetic data Exclude benchmark topics from generation
Public Benchmarks Are Likely Contaminated

MMLU, HumanEval, GSM8K, and other popular benchmarks exist in web scrapes that went into pre-training data. A fine-tuned model that scores well on these may be memorizing, not reasoning. For production decisions, always maintain a private golden test set that has never been published. Public benchmarks are useful for comparing to literature but not for shipping decisions.

Fine-tuning can break capabilities the base model had. Regression testing detects this by comparing your fine-tuned model against the base model on a held-out general capability set.

πŸ“Š
General Capability Tests
  • MMLU subset: 200–500 questions across domains
  • Instruction following: Can it still follow basic prompts?
  • Conversation quality: Multi-turn coherence
  • Reasoning: Simple chain-of-thought problems
⚠️
Regression Thresholds
  • <2% drop: Acceptable β€” normal fine-tuning cost
  • 2–5% drop: Warning β€” may need to adjust
  • >5% drop: Problem β€” likely catastrophic forgetting
  • Always compare to base model as reference
πŸ›‘οΈ
Safety Regression Tests
  • Refusal rate: Should be similar to base model
  • Jailbreak resistance: Common attack prompts
  • Harmful content: Violence, bias, PII leakage
  • Run dedicated red-team eval before shipping
1️⃣Checkpoint Savedevery 500–1000 steps
2️⃣Quick Evaltask metric on 200 samples
3️⃣Log to DashboardW&B, MLflow
4️⃣Full Eval (best ckpt)golden set + regression
5️⃣Ship Decisionpasses all gates?
πŸ”§
Eval automation script structure
def run_eval_pipeline(checkpoint_path: str) -> EvalResults: # 1. Load model model = load_model(checkpoint_path) # 2. Quick task eval (always run) task_score = evaluate_on_task(model, quick_eval_set) # 3. Log to dashboard log_metrics({"task_score": task_score, "step": get_step(checkpoint_path)}) # 4. Full eval (only on best checkpoints or before ship) if task_score > best_score or is_final_checkpoint: golden_score = evaluate_on_golden_set(model) regression_score = evaluate_regression(model, base_model) safety_score = evaluate_safety(model) return EvalResults( task=task_score, golden=golden_score, regression=regression_score, safety=safety_score, ship_ready=all_gates_pass(golden_score, regression_score, safety_score) )

∑ Chapter 06 — Key Takeaways

  • Evaluation hierarchy: task-specific β†’ format compliance β†’ safety/regression β†’ general capability
  • Build a golden test set of 200–500+ expert-verified examples before training; never train on it
  • For open-ended tasks, use LLM-as-judge (GPT-4); calibrate with 50–100 human labels first
  • Benchmark contamination causes fake good scores β€” rely on private eval sets for shipping decisions
  • Regression testing: <2% drop acceptable, 2–5% warning, >5% catastrophic forgetting
  • Automate eval pipeline: quick eval every checkpoint, full eval on best checkpoints + before ship
07
Chapter 07 Β· Task Generalization
Instruction Fine-Tuning β€” Teaching a Model to Follow Instructions

Instruction tuning transforms a raw language model into an assistant. It's what makes the difference between a model that continues text and one that follows instructions. The key insight: task diversity matters more than task volume. A model trained on 1,000 diverse instructions often outperforms one trained on 100,000 similar instructions.

Base language models predict the next token. They're excellent at completing text but terrible at following instructions. Instruction tuning teaches the model to interpret an instruction and produce the requested output rather than just continuing the text pattern.

Base Model (Pre-Instruction Tuning)

Input: "Translate to French: Hello, how are you?"

Output: "Translate to Spanish: Hola, cΓ³mo estΓ‘s?"

β†’ Continues the pattern, doesn't follow the instruction

Instruction-Tuned Model

Input: "Translate to French: Hello, how are you?"

Output: "Bonjour, comment allez-vous?"

β†’ Understands and executes the instruction

The Instruction Tuning Revolution

Before instruction tuning (pre-InstructGPT era), LLMs required careful prompt engineering to produce useful outputs. Instruction tuning (SFT on instruction-response pairs) made models that try to help by default. This is the foundation of ChatGPT, Claude, and every modern assistant. If you're fine-tuning a base model, instruction tuning is almost always the first step.

The LIMA paper showed that a 65B model fine-tuned on just 1,000 carefully curated examples can match models trained on 50K+ examples. The secret: diversity and quality over volume. Each example should teach something different.

βœ…
High-Diversity Dataset
  • Multiple task types (QA, summarization, code, creative)
  • Varied instruction styles (direct, conversational, formal)
  • Different response lengths (one-liner to multi-paragraph)
  • Diverse domains (science, arts, business, technical)
  • Edge cases and difficult examples
❌
Low-Diversity Dataset
  • Same task repeated with variations
  • All examples same format/length
  • Single domain focus
  • Template-generated similar examples
  • Missing difficulty spectrum
πŸ“Š
Diversity Checklist
  • ☐ At least 10 distinct task categories
  • ☐ Both short and long responses represented
  • ☐ Single-turn and multi-turn conversations
  • ☐ Factual and creative tasks
  • ☐ Easy, medium, and hard difficulty
The 1K vs 50K Trade-off

1,000 high-quality, diverse examples often beats 50,000 similar examples. Why? Large models already have the capability β€” instruction tuning is about eliciting existing capability, not teaching new knowledge. Diversity shows the model the range of behaviors expected; quality shows it the standard to meet. Volume alone just overfits to a narrow distribution.

Modern models use specific chat templates with special tokens to mark turns and roles. Using the wrong template causes the model to ignore your instructions or produce garbled output. Always use the exact template the base model was trained with.

Model FamilyTemplate StyleExample
Llama 3 / 3.1 Special tokens: <|start_header_id|> etc. <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|>
Mistral / Mixtral [INST] and [/INST] tokens [INST] Hello, how are you? [/INST] I'm doing well!
ChatML (OpenAI style) <|im_start|> and <|im_end|> <|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n
Vicuna / Alpaca Plain text markers USER: Hello\nASSISTANT:
πŸ”§
Applying chat template (Hugging Face)
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain photosynthesis."}, {"role": "assistant", "content": "Photosynthesis is..."}, ] # Apply model's chat template automatically formatted = tokenizer.apply_chat_template(messages, tokenize=False) print(formatted)
Template Mismatch = Silent Failure

Using the wrong chat template is a common cause of "my fine-tuned model is worse than base." The model has been trained to recognize specific token patterns β€” if your data uses different patterns, the model treats it as noise. Always verify your training data uses exactly the same template as the tokenizer's apply_chat_template() method.

DatasetSizeQualityBest ForNotes
LIMA 1,000 Excellent (curated) Proving quality > quantity Research dataset; not for commercial use
OpenAssistant 160K Variable (crowdsourced) General assistant, multi-turn Apache 2.0 license; filter for quality
Alpaca (Stanford) 52K Moderate (synthetic) Quick bootstrapping GPT-3.5 generated; format reference only
Dolly (Databricks) 15K Good (employee-written) Commercial use base CC-BY-SA license; human-written
UltraChat 1.5M Variable (synthetic) Large-scale pre-training Filter heavily; mix with human data
ShareGPT (user convos) Variable Variable Real user distribution Legal gray area; quality varies wildly
The Data Assembly Strategy

Don't use any single dataset as-is. Combine: (1) High-quality curated examples (200–500, manual or filtered) + (2) Task-diverse public dataset (5K–20K, filtered) + (3) Your domain-specific examples (as many as you have). Filter for quality, deduplicate, and shuffle. The blend matters more than any single source.

Single-turn (one instruction, one response) training produces models that answer questions but struggle with context over multiple turns. Include multi-turn conversations in your training data to teach the model to maintain coherence.

πŸ’¬
Multi-Turn Training Example
{ "messages": [ {"role": "user", "content": "What is Python?"}, {"role": "assistant", "content": "Python is a programming language..."}, {"role": "user", "content": "How do I install it?"}, {"role": "assistant", "content": "To install Python..."}, {"role": "user", "content": "What about on Mac?"}, {"role": "assistant", "content": "On macOS, you can..."} ] }
πŸ“Š
Multi-Turn Mix Guidelines
  • 50–70%: Single-turn (clear instruction β†’ response)
  • 20–30%: 2–3 turn conversations
  • 10–20%: 4+ turn conversations
  • Include clarification, follow-up, topic shift patterns
  • Vary conversation styles (casual, professional, technical)

System prompts set the context for a conversation. If you want your model to respect system prompts in production, you must include them in training. But don't over-rely on system prompts β€” behavior should be robust without them too.

Include System Prompts When

Production use: You'll use system prompts to control behavior

Persona switching: Model needs to adopt different roles

Constraint instruction: Format/safety rules in system prompt

Mix: ~50% with system prompt, ~50% without

System Prompt Pitfalls

Over-reliance: Model only works with specific system prompt

Inconsistency: Different system prompts in train vs production

Brittleness: Small changes in system prompt break behavior

Solution: Test behavior both with and without system prompts

∑ Chapter 07 — Key Takeaways

  • Instruction tuning transforms base models into assistants β€” it elicits existing capability
  • LIMA insight: 1,000 diverse, high-quality examples can match 50K similar examples
  • Diversity checklist: multiple task types, varied lengths, different domains, difficulty spectrum
  • Use the exact chat template your base model expects β€” template mismatch = silent failure
  • Data mix: curated quality (200–500) + filtered public (5K–20K) + domain-specific
  • Include multi-turn conversations (20–30%) and system prompts (~50%) in training data
08
Chapter 08 Β· Specialization
Domain Adaptation β€” Medical, Legal, Code, and Finance

Domain adaptation specializes a general model for specific fields. The approach differs by domain: some need continued pre-training on raw text; others just need task-specific SFT. The key is understanding what your domain requires and avoiding the trap of "more training = better." Domain expertise must be balanced against general capability.

Domain adaptation typically happens in two stages, though not all domains need both. The decision depends on how specialized the language and concepts are.

Two-stage domain adaptation pipeline
Base Model (Llama, Mistral) Stage 1: Continued Pre-Training Raw domain text (1M–1B tokens) Learns terminology, patterns, style Stage 2: Task SFT Instruction-response pairs (1K–50K) Learns how to use domain knowledge Domain Model Optional β€” only if domain language is specialized
StageWhat It DoesData RequiredWhen Needed
Continued Pre-Training Teaches domain vocabulary, patterns, writing style 1M–1B tokens of raw domain text Highly specialized domains (medical, legal, scientific)
Domain SFT Teaches how to apply knowledge to specific tasks 1K–50K instruction-response pairs Almost always β€” this is where task capability comes from
When to Skip Continued Pre-Training

Modern base models already have substantial domain knowledge from their training corpus. Try domain SFT first β€” it's faster and cheaper. Add continued pre-training only if: (1) the model misuses domain terminology, (2) domain text is highly specialized and underrepresented in base model, (3) you have access to 100M+ tokens of domain text. For most use cases, SFT alone is sufficient.

πŸ₯
Medical / Clinical

Challenges: Specialized terminology (ICD codes, drug names), high-stakes accuracy, regulatory requirements (HIPAA), need for citation.

  • Pre-training: Often needed (PubMed, clinical notes)
  • SFT: Q&A with citations, differential diagnosis, report generation
  • Key risk: Hallucinated medical advice is dangerous
  • Eval: MedQA, expert review, citation verification
βš–οΈ
Legal

Challenges: Precise language matters, jurisdiction-specific, precedent citation, long documents.

  • Pre-training: Often needed (case law, statutes, contracts)
  • SFT: Contract review, clause extraction, legal Q&A
  • Key risk: Unhelpful if answer is "consult a lawyer"
  • Eval: Clause identification accuracy, citation correctness
πŸ’»
Code / Software

Challenges: Syntax correctness is binary, multiple languages, execution context matters.

  • Pre-training: Sometimes (proprietary codebases)
  • SFT: Code completion, debugging, code review, translation
  • Key risk: Syntactically correct but semantically wrong
  • Eval: Pass@k on test suites, human review
πŸ’°
Finance

Challenges: Numerical precision, temporal awareness (data freshness), regulatory compliance.

  • Pre-training: Rarely needed (finance language is less specialized)
  • SFT: Report analysis, sentiment analysis, numerical reasoning
  • Key risk: Hallucinated numbers, outdated information
  • Eval: Numerical accuracy, fact verification against sources

Continued pre-training (CPT) extends the base model's pre-training on domain-specific text. It's computationally expensive but can significantly improve domain understanding for highly specialized fields.

ParameterTypical ValueNotes
Data volume 100M–10B tokens More is better, but quality matters
Learning rate 1e-5 to 5e-5 Lower than initial pre-training
Training objective Causal LM (next token prediction) Same as base pre-training
Mix with general data 10–30% general data Prevents catastrophic forgetting
Compute cost $1K–$100K+ (depending on volume) Full fine-tuning required β€” no LoRA
πŸ”§
Domain CPT data format
# Simple completion format β€” just raw text { "text": "CLINICAL NOTE\nPatient presents with acute onset chest pain, radiating to left arm. ECG shows ST elevation in leads V1-V4, consistent with anterior STEMI. Troponin I elevated at 2.5 ng/mL. Cardiology consulted for emergent PCI..." } # Training processes text as next-token prediction # Pack multiple documents into context window for efficiency
CPT Forgetting Risk Is High

Continued pre-training on only domain text causes severe catastrophic forgetting. The model becomes excellent at domain language but loses general instruction-following ability. Always mix 10–30% general data (C4, RedPajama, or high-quality web text) with your domain corpus. After CPT, always run a full regression test on general capabilities before proceeding to SFT.

Domain SFT teaches the model to apply domain knowledge to specific tasks. This is where most value is created and where most teams should focus. SFT data should be task-specific rather than general domain text.

πŸ“
Medical SFT Example
{ "messages": [ { "role": "user", "content": "Patient: 55M, chest pain, ST elevation. What is the likely diagnosis?" }, { "role": "assistant", "content": "Based on the presentation of chest pain with ST elevation in a 55-year-old male, the likely diagnosis is **acute myocardial infarction (STEMI)**..." } ] }
βš–οΈ
Legal SFT Example
{ "messages": [ { "role": "user", "content": "Identify the indemnification clause in this contract: [contract text]" }, { "role": "assistant", "content": "The indemnification clause is in Section 8.2: 'Party A shall indemnify...' This clause is **mutual**, covering both parties..." } ] }
πŸ’»
Code SFT Example
{ "messages": [ { "role": "user", "content": "Fix the bug in this function:\\n```python\\ndef calc(x):\\n return x / 0\\n```" }, { "role": "assistant", "content": "The bug is division by zero. Here's the fix:\\n```python\\ndef calc(x, divisor=1):\\n if divisor == 0:\\n raise ValueError()\\n return x / divisor\\n```" } ] }
Domain SFT Best Practices

Quality over quantity: 500–5,000 expert-curated examples beats 50K noisy ones. Task diversity: Cover all task types you'll use in production. Realistic inputs: Use real examples from your domain, not synthetic simplifications. Include edge cases: Hard examples, errors, ambiguous cases. Expert review: Every response should be verified by a domain expert.

Domain adaptation is a trade-off: the more you specialize, the more you risk losing general capabilities. Managing this trade-off requires deliberate strategies.

StrategyHow It WorksWhen to Use
Data mixing Include 10–30% general instruction data in domain SFT Always β€” minimal cost, significant benefit
Low learning rate Use 1e-5 to 5e-5 instead of standard SFT rates Always for domain adaptation
Fewer epochs Train for 1–2 epochs instead of 3+ Standard practice
LoRA instead of full FT Freezes base model; learns adapter weights only Default choice β€” preserves base capability
Elastic Weight Consolidation (EWC) Penalizes changes to important weights Experimental β€” not widely used
Separate adapter per domain Train different LoRA adapters for different domains Multi-domain serving scenario
βœ…
Recommended Forgetting Prevention
  • Use LoRA: Preserves base model weights entirely
  • Mix data: 70–80% domain + 20–30% general instruction
  • Low LR: 1e-5 to 2e-5 for domain SFT
  • Few epochs: 1–2 epochs, early stopping on validation
  • Monitor regression: Check MMLU/general benchmarks
⚠️
Warning Signs of Forgetting
  • General conversation quality drops noticeably
  • Model struggles with simple reasoning tasks
  • Multi-turn coherence degrades
  • Instruction following becomes brittle
  • MMLU score drops >5% from base model
DomainStandard BenchmarksCustom Eval Needed
Medical MedQA, PubMedQA, MedMCQA Expert review of clinical recommendations; citation verification
Legal LegalBench, CaseHOLD Contract analysis accuracy; jurisdiction-specific tests
Code HumanEval, MBPP, SWE-Bench Proprietary codebase tests; style compliance
Finance FinBench, TAT-QA Numerical accuracy checks; regulatory compliance
Domain Eval Best Practice

Standard benchmarks give you a baseline, but custom evaluation on your actual production tasks is essential. Create a golden test set of 200–500 examples representing real use cases in your domain. Include expert review for high-stakes domains (medical, legal, finance). Compare to both base model AND to human expert or GPT-4 baseline to understand where your fine-tuned model wins and loses.

∑ Chapter 08 — Key Takeaways

  • Two-stage approach: Continued Pre-Training (domain language) β†’ Domain SFT (task capability)
  • Try SFT first β€” continued pre-training is expensive and often not needed
  • Domain differences: Medical/Legal often need CPT; Code/Finance usually just SFT
  • Continued pre-training requires 100M–10B tokens + 10–30% general data mixing
  • Prevent catastrophic forgetting: LoRA, low LR (1e-5), 1–2 epochs, data mixing, regression tests
  • Domain eval: standard benchmarks + custom production task golden set + expert review
09
Chapter 09 Β· Deployment
Serving Fine-Tuned Models β€” Merge, Quantize, Deploy

A fine-tuned model in a notebook is worthless. The real work begins when you deploy it for production inference. This chapter covers the full path: merging adapters into base weights, quantizing for efficiency, and deploying with vLLM, Ollama, or cloud endpoints. The goal: maximize throughput, minimize latency, and keep costs under control.

Before diving into technical details, decide where your fine-tuned model will run. The choice depends on scale, latency requirements, cost constraints, and data privacy needs.

Deployment OptionBest ForLatencyCost at ScaleSetup Complexity
Cloud API (OpenAI, Anthropic) When provider offers fine-tuning (GPT-4o, Claude) ~200–500ms High ($15–60/1M tokens) Simple
Self-hosted vLLM High throughput, production scale, custom models 50–200ms Medium (GPU cost) Moderate
Ollama (local) Development, privacy, edge deployment 100–500ms Low (own hardware) Simple
Serverless (Modal, Replicate) Variable load, pay-per-use Cold start: 5–30s Low for variable load Simple
Dedicated cloud (SageMaker, Vertex) Enterprise, compliance, managed infrastructure 100–300ms High (always-on) Complex
The Deployment Decision Tree

Privacy/compliance required? β†’ Self-host or on-prem. Variable/bursty traffic? β†’ Serverless. High constant throughput? β†’ Dedicated vLLM cluster. Just need it to work? β†’ Cloud API with fine-tuning (if available). Development/testing? β†’ Ollama locally.

LoRA adapters are small (~10–100MB) but require the base model at inference. For production, you typically merge the adapter into the base weights to create a standalone model. This eliminates adapter overhead and simplifies deployment.

Separate Adapter Serving

How: Load base model + adapter at runtime

Pros: Swap adapters dynamically; multiple adapters per base

Cons: Slight latency overhead; more complex serving

Use when: Multi-tenant with different adapters per customer

Merged Model Serving

How: Merge adapter into base β†’ single model file

Pros: Zero overhead; standard model format; simple deployment

Cons: Full model size; can't swap adapters at runtime

Use when: Single-purpose deployment (most cases)

πŸ”§
Merging LoRA adapters (PEFT)
from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer # Load base model in full precision base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.float16, device_map="auto", ) # Load adapter on top of base model = PeftModel.from_pretrained(base_model, "./my-lora-adapter") # Merge adapter weights into base model merged_model = model.merge_and_unload() # Save as standard HF model (ready for vLLM, Ollama, etc.) merged_model.save_pretrained("./my-merged-model") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") tokenizer.save_pretrained("./my-merged-model")
QLoRA Merging Requires Dequantization

If you trained with QLoRA (4-bit base model), you cannot merge directly into the quantized weights. You must: (1) load the base model in full precision (fp16/bf16), (2) load the LoRA adapter, (3) merge, (4) re-quantize if needed. This means you need enough GPU memory to hold the full-precision model temporarily (~14GB for 7B, ~140GB for 70B).

Quantization reduces model precision (fp16 β†’ int8 β†’ int4) to shrink memory footprint and increase inference speed. A 7B model in fp16 is ~14GB; in 4-bit it's ~4GB. The trade-off: some quality loss, though modern quantization methods minimize this.

FormatPrecisionSize (7B)Quality LossHardware SupportBest For
FP16 / BF16 16-bit ~14GB None (baseline) All modern GPUs Production where quality is critical
INT8 8-bit ~7GB Minimal (<1%) Most GPUs, some CPUs Good balance of size/quality
GPTQ 4-bit ~4GB Small (1–3%) NVIDIA GPUs GPU inference, vLLM
AWQ 4-bit ~4GB Minimal (best 4-bit) NVIDIA GPUs Production 4-bit, vLLM preferred
GGUF 2–8 bit ~3–7GB Depends on quant level CPU, Apple Silicon, GPU Ollama, llama.cpp, local deployment
EXL2 2–8 bit (mixed) ~3–7GB Very good (adaptive) NVIDIA GPUs ExLlamaV2, high-quality 4-bit
πŸ“¦
GGUF (llama.cpp / Ollama)

Universal format for CPU and cross-platform inference. Supports Q4_K_M, Q5_K_M, Q8_0, etc.

  • Create: llama.cpp/convert.py
  • Quantize: llama.cpp/quantize
  • Best quant: Q4_K_M (balanced) or Q5_K_M (higher quality)
⚑
AWQ (vLLM preferred)

Activation-aware quantization. Best quality for 4-bit GPU inference.

  • Create: autoawq library
  • Serve: vLLM with --quantization awq
  • Quality: Near-lossless for most tasks
🎯
GPTQ (widely supported)

GPU-only, well-established 4-bit format. Slightly lower quality than AWQ.

  • Create: auto-gptq library
  • Serve: vLLM, text-generation-inference
  • Note: Being superseded by AWQ
πŸ”§
Converting to GGUF for Ollama
# Clone llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make # Convert HF model to GGUF (fp16 first) python convert_hf_to_gguf.py ../my-merged-model --outtype f16 --outfile model-f16.gguf # Quantize to 4-bit (Q4_K_M is good balance) ./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M # Or Q5_K_M for higher quality ./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
Quantization Recommendation by Use Case

Production GPU (vLLM): AWQ 4-bit β€” best quality/speed tradeoff. Local/Edge (Ollama): GGUF Q4_K_M or Q5_K_M. Quality-critical: FP16 or INT8 β€” don't quantize. Memory-constrained: Q3_K_M or Q2 β€” significant quality loss, last resort. Rule of thumb: Test your specific task at each quantization level; losses vary by task.

vLLM is the gold standard for production LLM serving. It implements PagedAttention for efficient memory management, continuous batching for high throughput, and supports all major model formats. If you're serving a fine-tuned model at scale, vLLM is likely your best option.

βœ…
vLLM Advantages
  • PagedAttention: 2–4Γ— higher throughput than naive serving
  • Continuous batching: Dynamic batch size for variable load
  • Quantization support: AWQ, GPTQ, INT8 out of the box
  • OpenAI-compatible API: Drop-in replacement
  • LoRA serving: Hot-swap adapters at runtime
⚠️
vLLM Considerations
  • GPU only: Requires NVIDIA GPU (no CPU inference)
  • Memory: Loads full model into GPU memory
  • Cold start: Model loading takes 30–120s
  • Complexity: More setup than Ollama
  • Resource: Needs dedicated GPU server
πŸ”§
vLLM serving (OpenAI-compatible API)
# Install vLLM pip install vllm # Serve merged model (FP16) python -m vllm.entrypoints.openai.api_server \ --model ./my-merged-model \ --port 8000 # Serve with AWQ quantization python -m vllm.entrypoints.openai.api_server \ --model ./my-merged-model-awq \ --quantization awq \ --port 8000 # Serve with LoRA adapters (hot-swappable) python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --enable-lora \ --lora-modules my-adapter=./my-lora-adapter \ --port 8000
🐍
Calling vLLM from Python (OpenAI SDK)
from openai import OpenAI # Point to local vLLM server client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") response = client.chat.completions.create( model="my-merged-model", # or model path messages=[ {"role": "user", "content": "Explain photosynthesis."} ], temperature=0.7, max_tokens=512, ) print(response.choices[0].message.content)
vLLM ConfigurationDefaultRecommendationImpact
--tensor-parallel-size 1 Match your GPU count Enables multi-GPU serving
--gpu-memory-utilization 0.9 0.85–0.95 Higher = more KV cache, higher throughput
--max-model-len Model default Set to your actual max Lower = less memory, faster startup
--quantization None awq for 4-bit 2Γ— memory reduction, slight quality loss

Ollama makes running LLMs locally as easy as Docker. It handles model management, quantization, and provides an API. Perfect for development, privacy-sensitive applications, or edge deployment. Supports macOS, Linux, and Windows.

πŸ”§
Deploying fine-tuned model with Ollama
# 1. Create a Modelfile cat << 'EOF' > Modelfile FROM ./model-q4_k_m.gguf SYSTEM """You are a helpful assistant specialized in medical Q&A. Always cite sources and recommend consulting a healthcare professional for medical decisions.""" PARAMETER temperature 0.7 PARAMETER top_p 0.9 PARAMETER stop "<|eot_id|>" EOF # 2. Create the Ollama model ollama create my-medical-model -f Modelfile # 3. Run it ollama run my-medical-model # 4. Serve via API ollama serve # Runs on localhost:11434
🐍
Calling Ollama from Python
import requests response = requests.post( "http://localhost:11434/api/generate", json={ "model": "my-medical-model", "prompt": "What are the symptoms of diabetes?", "stream": False, } ) print(response.json()["response"]) # Or use the official Ollama Python library import ollama response = ollama.chat( model="my-medical-model", messages=[{"role": "user", "content": "What are the symptoms of diabetes?"}] ) print(response["message"]["content"])
Ollama Strengths

Simplicity: One command to run any model

Cross-platform: macOS, Linux, Windows, Docker

Apple Silicon: Excellent Metal GPU support

Privacy: Everything runs locally

Model management: Pull, list, delete models easily

Ollama Limitations

Throughput: Not designed for high-concurrency

No continuous batching: Sequential requests

Limited scaling: Single-machine only

GGUF only: Must convert from HF format

Production: Better for dev/edge than high-scale

PlatformTypeProsConsBest For
Modal Serverless GPU Pay-per-second, auto-scaling, easy deploy Cold starts (5–30s) Variable workloads, prototypes
Replicate Serverless GPU Simple API, model hosting Cost at scale, cold starts Quick deployment, demos
RunPod GPU rental Cheap GPUs, serverless option Less managed, variable availability Cost-sensitive production
AWS SageMaker Managed ML Enterprise features, integration Complex, expensive Enterprise, existing AWS stack
GCP Vertex AI Managed ML Good Gemini integration Complex pricing GCP-native applications
Together AI Inference API Fast, supports custom models Per-token pricing Custom model serving
πŸ”§
Modal deployment example
import modal app = modal.App("my-fine-tuned-model") # Define the inference function @app.function( gpu="A100", image=modal.Image.debian_slim().pip_install("vllm", "torch"), secrets=[modal.Secret.from_name("huggingface")], ) def generate(prompt: str) -> str: from vllm import LLM, SamplingParams llm = LLM(model="./my-merged-model") params = SamplingParams(temperature=0.7, max_tokens=512) output = llm.generate([prompt], params) return output[0].outputs[0].text # Deploy # modal deploy my_app.py
⚑
Latency Optimization
  • Quantization: 4-bit reduces memory, increases speed
  • Speculative decoding: Use draft model for speedup
  • KV cache: Don't recompute for same prefix
  • Streaming: Return tokens as generated
  • Shorter prompts: Less prefill time
πŸ“ˆ
Throughput Optimization
  • Continuous batching: vLLM, TGI default
  • Dynamic batching: Group requests together
  • Tensor parallelism: Multi-GPU for large models
  • Prefix caching: Cache common prompts
  • Right-size GPU: Match model to memory
πŸ’°
Cost Optimization
  • Spot/preemptible: 60–80% savings
  • Right-size model: Smaller if quality allows
  • Caching: Cache frequent responses
  • Auto-scaling: Scale to zero when idle
  • Batching: Higher util = lower cost/token
OptimizationLatency ImpactThroughput ImpactComplexity
4-bit quantization (AWQ) -20–40% +50–100% Low
Continuous batching Slight increase +200–500% Free (vLLM)
Tensor parallelism (2+ GPU) -30–50% +80–180% Medium
Speculative decoding -30–50% Variable Medium
KV cache / prefix caching -50–80% for repeat prefixes +20–50% Low (vLLM)

∑ Chapter 09 — Key Takeaways

  • Deployment options: vLLM (production), Ollama (local/edge), serverless (variable load), cloud (enterprise)
  • Merge LoRA adapters into base weights for simplified deployment (unless you need multi-tenant adapter swapping)
  • Quantization: AWQ for GPU (vLLM), GGUF for CPU/local (Ollama) β€” Q4_K_M is good balance
  • vLLM = gold standard: PagedAttention, continuous batching, OpenAI-compatible API
  • Ollama = simplest local deployment: one command, cross-platform, Apple Silicon support
  • Optimize: quantize (2Γ—), continuous batching (4Γ—), tensor parallelism (multi-GPU), caching
10
Chapter 10 Β· Production Systems
Production MLOps β€” Versioning, Monitoring & Iteration

Fine-tuning is not a one-time event β€” it's a continuous process. Production MLOps for fine-tuned models means tracking experiments, versioning models, monitoring quality, and iterating on feedback. This chapter covers the infrastructure and practices that turn one-off fine-tuning into a sustainable competitive advantage.

Every fine-tuning run should be tracked: hyperparameters, dataset version, base model, training metrics, and evaluation results. Without tracking, you can't reproduce results, compare runs, or understand what worked.

ToolTypeStrengthsBest For
Weights & Biases SaaS / self-hosted Best UX, automatic logging, reports Most teams, production use
MLflow Self-hosted / Databricks Open source, model registry built-in On-prem, Databricks users
Comet SaaS Good comparison views Alternative to W&B
Neptune SaaS Good for large experiments Large-scale experimentation
TensorBoard Self-hosted Free, basic, no model registry Simple projects, learning
πŸ”§
W&B integration with Hugging Face Trainer
import wandb from transformers import TrainingArguments, Trainer # Initialize W&B wandb.init( project="medical-fine-tuning", name="llama3-8b-lora-r16", config={ "base_model": "meta-llama/Llama-3.1-8B-Instruct", "lora_rank": 16, "lora_alpha": 32, "dataset_version": "v2.3", "num_examples": 5000, } ) training_args = TrainingArguments( output_dir="./output", report_to="wandb", # Automatic logging logging_steps=10, # ... other args ) trainer = Trainer(model=model, args=training_args, ...) trainer.train() # Log final eval metrics wandb.log({"eval/accuracy": 0.92, "eval/f1": 0.89}) wandb.finish()
βœ…
What to Track (Minimum)
  • Config: All hyperparameters (LR, rank, epochs, etc.)
  • Data: Dataset name, version, size, hash
  • Model: Base model name and version
  • Metrics: Loss curve, eval metrics per checkpoint
  • Artifacts: Final model weights, adapter files
⭐
What to Track (Best Practice)
  • All minimum items, plus:
  • Code version: Git commit hash
  • Environment: Package versions, GPU type
  • Sample outputs: Example generations per checkpoint
  • Regression metrics: MMLU, general benchmarks

A model registry stores model versions with metadata, enables promotion through stages (dev β†’ staging β†’ production), and provides lineage tracking. It's the single source of truth for which model is deployed where.

Model promotion lifecycle
Experiment Many candidates Staging Eval + regression tests Production Serving live traffic Archived Retired, kept for rollback Auto from training Manual promote After approval After replacement
Registry OptionIntegrationStrengthsBest For
MLflow Model Registry MLflow, Databricks Open source, full lifecycle Self-hosted, Databricks
Hugging Face Hub HF ecosystem Easy sharing, versioning, spaces Open models, collaboration
W&B Model Registry W&B Linked to experiments, good UX W&B users
SageMaker Model Registry AWS AWS integration, approval workflows AWS-native teams
DVC + Git Git Version control for models Simple projects, git-native
Minimum Viable Model Registry

If you don't have a formal registry, at minimum: (1) Store models in versioned cloud storage (S3, GCS) with naming convention (e.g., medical-llama-v2.3-2024-04-15/), (2) Keep a YAML/JSON manifest with model version β†’ storage path β†’ training run ID β†’ eval metrics, (3) Document which version is in production. This beats scattered files and forgotten experiments.

Offline eval is not production eval. A model that scores well on your test set may perform differently with real users and real queries. A/B testing compares model versions on live traffic to validate that improvements are real.

βœ…
A/B Testing Best Practices
  • Random assignment: Users randomly see A or B
  • Same conditions: Same prompts, same post-processing
  • Sufficient sample: Run until statistically significant
  • Multiple metrics: Quality, latency, cost, user behavior
  • Holdout group: Always keep baseline for comparison
πŸ“Š
Metrics to Compare
  • Quality: Task accuracy, user ratings, thumbs up/down
  • Latency: P50, P95, P99 response time
  • Cost: Tokens used per request, GPU utilization
  • Engagement: Completion rate, follow-up questions
  • Errors: Failure rate, refusal rate, format errors
πŸ”§
Simple A/B routing implementation
import random import hashlib def get_model_for_request(user_id: str, experiment_config: dict) -> str: """Deterministic assignment based on user_id.""" # Hash user_id for consistent assignment hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) bucket = hash_val % 100 # Route based on traffic split if bucket < experiment_config["control_pct"]: # e.g., 80% return experiment_config["control_model"] # v2.2 else: return experiment_config["treatment_model"] # v2.3 # Log which variant was served for analysis def log_request(user_id: str, model_version: str, response: str, metrics: dict): analytics.log({ "user_id": user_id, "model_version": model_version, "latency_ms": metrics["latency"], "tokens": metrics["tokens"], "timestamp": datetime.now(), })
Shadow Mode Before A/B

Before exposing users to a new model, run it in shadow mode: send real traffic to both models, but only show the control model's output. Log the new model's outputs for offline comparison. This catches catastrophic failures (crashes, very bad outputs) before users see them. Only promote to A/B once shadow mode looks good.

A fine-tuned model can degrade over time β€” the input distribution shifts, edge cases emerge, or the world changes. Continuous monitoring detects these issues before users notice.

πŸ“Š
Operational Metrics
  • Latency: P50, P95, P99 per endpoint
  • Throughput: Requests/sec, tokens/sec
  • Error rate: 5xx, timeouts, OOM
  • GPU utilization: Memory, compute
  • Queue depth: Request backlog
🎯
Quality Metrics
  • User feedback: Thumbs up/down, ratings
  • Format compliance: JSON parse success rate
  • Refusal rate: How often model refuses
  • Output length: Avg tokens, anomalies
  • LLM-as-judge sample: Periodic quality scoring
πŸ”
Drift Detection
  • Input drift: Embedding distance from training
  • Output drift: Response distribution changes
  • Concept drift: Same inputs, different correct answers
  • Alert: When metrics deviate >2Οƒ from baseline
MetricAlerting Threshold (Example)Action When Triggered
P95 latency >2Γ— baseline for 5 min Check GPU load, model, batch size
Error rate >1% for 5 min Page on-call, check logs
Format compliance <95% for 1 hour Review failing examples, consider rollback
User thumbs down rate >2Γ— baseline for 1 day Sample and review bad responses
Input embedding drift >0.2 cosine distance shift Investigate new input patterns; may need new data
The Monitoring Stack

Metrics: Prometheus + Grafana or Datadog. Logging: Structured logs β†’ aggregator (Loki, CloudWatch, Datadog). Tracing: OpenTelemetry β†’ Jaeger or vendor. LLM-specific: LangSmith, Langfuse, or custom sampled eval. Alerting: PagerDuty, Slack, email for critical metrics.

The best fine-tuning teams don't ship one model β€” they build a flywheel. Production feedback generates training data, which improves the model, which generates better feedback. This loop compounds over time.

The fine-tuning flywheel β€” continuous improvement loop
Deploy Model Collect Feedback User ratings, corrections Curate Data Filter, label, clean Fine-Tune Evaluate Approve Flywheel
πŸ”„
Flywheel Data Sources
  • User corrections: When users edit model output
  • Thumbs up/down: Explicit quality signals
  • Support escalations: Cases model couldn't handle
  • A/B test losers: Examples where new model failed
  • Edge cases: Unusual inputs from production logs
⚑
Flywheel Automation
  • Auto-label: Use current model to bootstrap labels
  • Human-in-the-loop: Flag uncertain cases for review
  • Scheduled retraining: Weekly/monthly fine-tuning runs
  • Auto-eval: CI pipeline runs eval on new checkpoints
  • Auto-promote: If eval passes, deploy to staging
The Flywheel Compounds

Month 1: You ship a fine-tuned model. Month 2: You've collected 1,000 examples of user corrections; you retrain and quality improves 5%. Month 3: The better model gets more usage, generating more feedback. Month 6: You have 10,000 curated examples and 20% better quality than month 1. Teams that build the flywheel pull ahead of teams that treat fine-tuning as a one-time event.

PhaseCheckpointDone?
Data Dataset is versioned and reproducible ☐
Deduplication completed (exact + near) ☐
Train/val/test splits verified β€” no leakage ☐
Quality audit passed (random sample reviewed) ☐
Training Experiment tracked (hyperparams, metrics, artifacts) ☐
Multiple checkpoints saved ☐
Training loss and val loss look healthy ☐
Evaluation Task-specific eval passed (golden set) ☐
Regression tests passed (<2% drop on general benchmarks) ☐
Safety eval passed (refusal rate, harmful content) ☐
Format compliance verified (JSON parse rate, etc.) ☐
Deployment Model registered with version and metadata ☐
Quantization tested (if using) ☐
Inference latency acceptable in staging ☐
Operations Monitoring and alerting configured ☐
Rollback plan documented and tested ☐
Shadow mode or A/B test plan in place ☐

Fine-tuned models can fail in production. Have a runbook ready.

🚨
Common Failure Modes
  • Quality regression: Model suddenly worse
  • Format failures: JSON/structured output breaks
  • Refusal spike: Model refuses valid requests
  • Harmful output: Model generates bad content
  • Latency spike: Inference slows dramatically
  • OOM: Out of memory crashes
πŸ›‘οΈ
Incident Response Steps
  1. Detect: Alerting triggers on anomaly
  2. Assess: Severity? Scope? Cause hypothesis?
  3. Mitigate: Rollback to previous model version
  4. Investigate: Root cause analysis with logs/traces
  5. Fix: Address underlying issue
  6. Postmortem: Document and prevent recurrence
Always Have a Rollback Plan

Before deploying any new model version: (1) Keep the previous version running in staging, (2) Document the exact rollback command/process, (3) Test the rollback procedure in staging, (4) Have a "big red button" that can revert in <5 minutes. The ability to quickly rollback is more important than the ability to quickly deploy.

∑ Chapter 10 — Key Takeaways

  • Track every experiment: hyperparameters, data version, metrics, artifacts (W&B, MLflow)
  • Use a model registry: version models, promote through stages (experiment β†’ staging β†’ production β†’ archived)
  • A/B test in production: shadow mode first, then gradual rollout, measure quality + latency + cost
  • Monitor continuously: operational metrics (latency, errors) + quality metrics (user feedback, format compliance) + drift detection
  • Build the flywheel: production feedback β†’ curated data β†’ fine-tune β†’ deploy β†’ more feedback
  • Always have a rollback plan: test it, document it, be able to execute in <5 minutes