AI Foundation ยท Domain 09

MLOps & AI Engineering

Building, deploying & operating AI systems at scale โ€” data pipelines, experiment tracking, model serving, CI/CD for ML, monitoring, drift detection, vector databases, LLM APIs, and production infrastructure.

9.1
Chapter 9.1
The ML Lifecycle & MLOps Overview

87% of ML models never reach production. The bottleneck is not intelligence or algorithms โ€” it is engineering. MLOps is the discipline that bridges the gap between a working notebook and a reliable, monitored, continuously improving production AI system.

Machine learning is not a one-shot task โ€” it is a continuous cycle. Data changes, user behaviour shifts, the world evolves. A model that was accurate at deployment degrades over time unless the full lifecycle is managed as a system, not a project.

The ML Lifecycle โ€” a continuous feedback loop, not a linear pipeline
Data Collect & Clean Feature Engineer Train & Evaluate Deploy & Serve Monitor & Alert Retrain & Improve Continuous Feedback Loop

MLOps = Machine Learning + DevOps + Data Engineering. It is the set of practices, tools, and cultural norms for deploying and maintaining ML models in production reliably. The name is modelled on DevOps โ€” but the challenges are fundamentally different because the artefact is not just code, it is code + data + model.

๐Ÿง 
ML (Data Science)

Feature engineering, model selection, hyperparameter tuning, evaluation metrics, experimentation. Owns the model quality.

โš™๏ธ
Dev (Software Engineering)

Code quality, testing, version control, CI/CD, API design, containerisation. Owns the code quality and deployment pipeline.

๐Ÿ—๏ธ
Ops (Infrastructure)

Monitoring, alerting, scaling, GPU scheduling, cost management, reliability. Owns the system reliability.

DimensionDevOpsMLOps
ArtefactCode (binary/container)Code + Data + Model (triple versioning)
TestingUnit / integration / E2EUnit + data validation + model quality + bias
CI TriggerCode commitCode commit OR data change OR model drift
VersioningGit for codeGit + DVC/LakeFS for data + model registry
MonitoringLatency, errors, uptimeAll of above + prediction quality + data drift
RollbackRedeploy previous imageComplex โ€” model + data + feature pipeline must align

Google (2021) defined three maturity levels for MLOps. Most organisations are at Level 0. The progression is not about buying tools โ€” it is about automating the feedback loop.

MLOps Maturity โ€” from manual notebooks to fully automated retraining
Level 0 โ€” Manual Jupyter notebooks Manual deployment No pipeline automation No monitoring ~70% of orgs Train โ†” Serve: weeks Level 1 โ€” ML Pipeline Automated training pipeline Feature store Model registry Manual deploy trigger ~20% of orgs Train โ†” Serve: days Level 2 โ€” CI/CD for ML Automated training + deploy CI tests data + model Canary / blue-green deploy Monitoring triggers retrain ~8% of orgs Train โ†” Serve: hours Level 3 โ€” Full Auto Drift auto-detected Auto retrain + validate Auto deploy if quality OK Human-in-the-loop only ~2% of orgs Train โ†” Serve: minutes Source: adapted from Google MLOps whitepaper (2021) + industry surveys (2024) The goal is not Level 3 for all models โ€” it is the right level for each model's business impact
MLOps Stack โ€” each layer has purpose-built tooling
INFRASTRUCTURE Docker ยท Kubernetes ยท GPU Clusters ยท AWS SageMaker ยท GCP Vertex ยท Azure ML ยท Spot Instances ORCHESTRATION Airflow ยท Kubeflow Pipelines ยท Prefect ยท Dagster ยท Flyte ยท Argo Workflows MONITORING Evidently ยท WhyLabs ยท Fiddler ยท Arize ยท NannyML ยท Prometheus + Grafana ยท LangSmith SERVING TorchServe ยท Triton ยท vLLM ยท BentoML ยท FastAPI ยท TF Serving VECTOR / LLM Pinecone ยท Qdrant ยท Weaviate ยท Chroma ยท LiteLLM TRAINING PyTorch ยท HuggingFace ยท DeepSpeed ยท FSDP ยท Ray Train TRACKING MLflow ยท W&B ยท Neptune ยท ClearML ยท Comet DATA & FEATURES DVC ยท LakeFS ยท Feast ยท Tecton ยท Great Expectations ยท dbt ยท Spark ยท Kafka ยท Delta Lake

∑ Chapter 9.1 — Key Takeaways

  • ML lifecycle is a continuous feedback loop, not a linear pipeline — data → features → train → deploy → monitor → retrain
  • MLOps = ML + DevOps + Data Engineering — triple versioning (code + data + model) is the key difference from software DevOps
  • Most organisations are at Level 0 (manual) — automation happens in stages, not all at once
  • 87% of models never reach production — engineering, not algorithms, is the bottleneck
  • The MLOps stack has purpose-built tools at every layer — no single tool covers everything
9.2
Chapter 9.2
Data Pipelines & Feature Engineering

In production ML, the model is the easy part. The data pipeline — getting the right data, at the right time, in the right format, with the right quality — accounts for 60–80% of the engineering effort. A model is only as good as the features that feed it.

Data pipelines come in two fundamental paradigms: batch (scheduled, high-throughput, process data in chunks) and streaming (real-time, event-driven, process data as it arrives). Most production systems use both — the “lambda architecture” runs batch for training and streaming for inference.

Batch vs Streaming Pipelines — most systems need both
BATCH PATH Data Lake Spark / dbt Feature Store Training Job Scheduled: hourly/daily High throughput, high latency STREAMING PATH Events Kafka / Flink Feature Store Inference API Event-driven: real-time Low latency, lower throughput Feature store is the bridge — same features for training (batch) and serving (streaming)
DimensionBatchStreaming
LatencyMinutes to hoursMilliseconds to seconds
ThroughputVery high (TB/run)Moderate (events/sec)
ToolsSpark, dbt, AirflowKafka, Flink, Spark Streaming
Use caseTraining, daily reports, backfillReal-time inference, fraud, recommendations
ComplexityLower — easier to debugHigher — ordering, exactly-once, backpressure

A feature store is a centralised repository of curated, versioned features that are shared between training and serving. Without it, data scientists recompute the same features ad hoc in every experiment, and serving code re-implements training logic — the most common source of training/serving skew.

🍽️
Feast (Open Source)

Python SDK. Offline store (BigQuery, Redshift, file). Online store (Redis, DynamoDB). Git-based feature definitions. Free, self-managed. Great for teams starting with feature stores.

Tecton (Managed)

Enterprise feature platform. Real-time feature pipelines. Built-in monitoring. Automatic backfill. Complex transformations. Higher cost, lower operational burden.

# Feast feature definition (Python SDK)
from feast import Entity, Feature, FeatureView, ValueType
from feast.data_source import BigQuerySource
user = Entity(name="user_id", value_type=ValueType.INT64)
user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="avg_order_value", dtype=ValueType.FLOAT),
        Feature(name="total_orders_30d", dtype=ValueType.INT64),
        Feature(name="days_since_last_order", dtype=ValueType.INT64),
    ],
    online=True,
    source=BigQuerySource(
        table="project.dataset.user_features",
        timestamp_field="event_timestamp",
    ),
)
# Training: get historical features (point-in-time join)
training_df = store.get_historical_features(entity_df, ["user_features:avg_order_value"])
# Serving: get latest features (online store lookup, ~5ms)
features = store.get_online_features(entity_rows=[{"user_id": 42}], features=["user_features:avg_order_value"])

Data is the most common failure mode in production ML. Great Expectations lets you define expectations about your data (schema, ranges, distributions) and test them automatically. DVC (Data Version Control) tracks data files alongside code in Git, storing the actual data in remote storage.

# Great Expectations — validate a dataframe
import great_expectations as gx
context = gx.get_context()
ds = context.sources.add_pandas("my_ds")
asset = ds.add_dataframe_asset("training_data")
batch = asset.build_batch_request(dataframe=df)
# Define expectations
validator = context.get_validator(batch_request=batch)
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("age", min_value=0, max_value=150)
validator.expect_column_mean_to_be_between("income", min_value=20000, max_value=200000)
results = validator.validate()  # Returns success/failure + details
# DVC — version data alongside code
$ dvc init                          # Initialise DVC in git repo
$ dvc add data/training.parquet     # Track data file (creates .dvc pointer)
$ git add data/training.parquet.dvc # Commit pointer to git
$ git commit -m "Add training data v1"
$ dvc push                          # Push actual data to remote (S3, GCS)
# Later: reproduce exact training data
$ git checkout v1.0                 # Checkout code + DVC pointers
$ dvc pull                          # Pull matching data from remote

∑ Chapter 9.2 — Key Takeaways

  • Data pipelines are 60–80% of ML engineering effort — batch for training, streaming for serving
  • Feature stores eliminate training/serving skew by sharing features between training and inference
  • Feast (open source) and Tecton (managed) are the two dominant feature store approaches
  • Great Expectations validates data quality with testable expectations — catch data bugs before they hit the model
  • DVC versions data alongside code — Git tracks pointers, remote stores actual data
9.3
Chapter 9.3
Experiment Tracking & Model Registry

Without experiment tracking, ML research is a folder of notebooks named model_v2_final_FINAL_actually_final.ipynb. A model registry is the single source of truth for which model version is in production, why it was promoted, and what it was trained on.

An ML experiment produces multiple artefacts: code, data reference, hyperparameters, metrics, trained model weights, and environment configuration. Without structured tracking, reproducing a result from three months ago is nearly impossible. Experiment trackers log all of these automatically and make them searchable, comparable, and shareable.

📁
Without Tracking

Scattered notebooks. “Which hyperparams gave 92% accuracy?” — nobody knows. Cannot reproduce last month’s best run. Model deployed but nobody knows which commit trained it.

With Tracking

Every run logged: params, metrics, artefacts, code version. Compare 200 runs in a table. One-click reproduce any experiment. Full lineage from data to deployed model.

📊
What Gets Tracked

Hyperparameters, metrics (loss, accuracy, F1), model weights, training data hash, code commit, environment (Python version, packages), run duration, GPU utilisation.

# MLflow — experiment tracking in 10 lines
import mlflow
mlflow.set_experiment("fraud-detection")
with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("epochs", 50)
    mlflow.log_param("model_type", "XGBoost")
    # ... train model ...
    mlflow.log_metric("accuracy", 0.943)
    mlflow.log_metric("f1_score", 0.891)
    mlflow.log_metric("auc_roc", 0.967)
    mlflow.sklearn.log_model(model, "model")  # Save model artefact
    mlflow.log_artifact("confusion_matrix.png")  # Save any file
# Weights & Biases — experiment tracking
import wandb
wandb.init(project="fraud-detection", config={
    "learning_rate": 0.001,
    "epochs": 50,
    "model_type": "XGBoost"
})
# ... train model, log metrics per step ...
wandb.log({"accuracy": 0.943, "f1": 0.891, "loss": 0.12})
wandb.finish()

A model registry is a versioned catalogue of trained models with metadata, lineage, and lifecycle stages (Staging → Production → Archived). It answers: “What model is in production right now, who approved it, and what data trained it?”

Model Lifecycle — from training to retirement with approval gates
Registered Staging ✓ Tests pass Production Archived Each transition requires approval gate: automated tests + human review MLflow Model Registry | SageMaker Model Registry | Vertex AI Model Registry
ToolTypeTrackingRegistryUIBest For
MLflowOpen sourceExcellentBuilt-inFunctionalSelf-hosted, full control
W&BSaaS (free tier)ExcellentBuilt-inBeautifulTeams, collaboration, sweeps
NeptuneSaaSExcellentBasicGoodMetadata-heavy experiments
ClearMLOpen sourceGoodBuilt-inGoodEnd-to-end (track + orchestrate)

∑ Chapter 9.3 — Key Takeaways

  • Experiment tracking logs params, metrics, artefacts, and code version for every training run
  • MLflow (open source) and W&B (SaaS) are the dominant tools — both take ~10 lines to integrate
  • Model registry provides versioned lifecycle stages: Registered → Staging → Production → Archived
  • Reproducibility requires tracking: code + data + environment + hyperparameters + random seed
9.4
Chapter 9.4
Model Serving & Deployment Patterns

A trained model sitting in a notebook is worth nothing. Model serving is the discipline of making predictions available to users — via APIs, batch jobs, or embedded systems — with the latency, throughput, and reliability that production demands.

Batch Inference
Real-Time Inference

Pattern: Run model on all data at scheduled intervals (hourly, daily).

Output: Predictions written to a database/table — served via lookup.

Latency: Minutes to hours (acceptable for non-interactive use cases).

Scale: Process millions of records per run using distributed compute.

Examples: Email campaigns, daily risk scores, recommendation pre-compute.

Pattern: Model behind an API — request comes in, prediction goes out.

Output: JSON response with prediction, typically <100ms.

Latency: Milliseconds (p99 < 200ms for most production systems).

Scale: Auto-scale replicas based on QPS (queries per second).

Examples: Fraud detection, search ranking, chatbots, real-time pricing.

🎯
TorchServe

PyTorch-native model server. MAR packaging format. Built-in batching, metrics, model versioning. Best for PyTorch-only deployments.

🚀
NVIDIA Triton

Multi-framework (PyTorch, TF, ONNX, TensorRT). Dynamic batching. GPU-optimised. Ensemble pipelines. Best for high-throughput GPU serving.

vLLM

LLM-optimised serving. PagedAttention for memory efficiency. Continuous batching. OpenAI-compatible API. Best for LLM inference (2–5× faster than naive).

# FastAPI — simplest model serving API
from fastapi import FastAPI
import joblib
app = FastAPI()
model = joblib.load("model.pkl")
@app.post("/predict")
async def predict(features: dict):
    X = [[features["age"], features["income"], features["credit_score"]]]
    prediction = model.predict(X)[0]
    probability = model.predict_proba(X)[0].tolist()
    return {"prediction": int(prediction), "probability": probability}
# Run: uvicorn app:app --host 0.0.0.0 --port 8000
Three deployment strategies — risk vs speed tradeoff
Blue-Green Blue (old) Green (new) 100% traffic switch Instant rollback 2× infrastructure cost Low risk, high cost Canary Current (95%) New 5% Gradual: 5% → 25% → 100% Monitor metrics at each stage Rollback = route back to old Balanced risk & cost Shadow Current (serves) Shadow Mirror traffic to new model Log predictions, don’t serve Compare offline, promote later Zero user risk Choose based on risk tolerance: Shadow (safest) → Canary (standard) → Blue-Green (fastest switch)

∑ Chapter 9.4 — Key Takeaways

  • Batch inference: scheduled, high-throughput, store predictions. Real-time: API-based, <100ms latency
  • Model servers: TorchServe (PyTorch), Triton (multi-framework GPU), vLLM (LLM-optimised with PagedAttention)
  • Deployment: canary (gradual %), blue-green (instant switch), shadow (zero user risk)
  • FastAPI is the simplest starting point — graduate to model servers as scale demands
9.5
Chapter 9.5
CI/CD for Machine Learning

Software CI/CD deploys code. ML CI/CD deploys code + data + models. The pipeline must test data quality, validate model performance, check for bias, and only then deploy — all automatically, triggered by code changes OR data changes OR drift alerts.

ML CI/CD Pipeline — every stage has ML-specific requirements
Code Commit Lint & Unit Test Data Validation Train Model Evaluate & Test Register Model Deploy Canary Monitor CI (Continuous Integration) CD (Continuous Delivery) Triggers: code commit | data change | drift alert | scheduled retrain Orchestrators: Airflow | Kubeflow Pipelines | Prefect | Dagster | GitHub Actions Key difference from software: the “build” step takes hours (training), not seconds (compile)
🏗️
Airflow

Python DAGs. Mature ecosystem. Widely adopted. Schedule-based. Good for batch ML pipelines. Steeper learning curve. Not ML-specific but ML-capable.

⚙️
Kubeflow Pipelines

Kubernetes-native. ML-first. Component-based. Built-in experiment tracking. Best for teams already on K8s. Heavier operational footprint.

🐍
Prefect

Python-native. Dynamic DAGs (not just static). Easy local → cloud transition. Good UI. Lower learning curve than Airflow. Growing rapidly.

∑ Chapter 9.5 — Key Takeaways

  • ML CI/CD tests code + data + model quality — not just code
  • Pipeline triggers: code commit, data change, drift alert, or scheduled retrain
  • Orchestrators: Airflow (mature), Kubeflow (K8s-native), Prefect (Python-native, growing fast)
  • The “build” step in ML CI/CD takes hours (training) — caching and incremental training are critical
9.6
Chapter 9.6
Monitoring, Drift & Observability

Software breaks loudly — errors, crashes, timeouts. Models break silently: they keep returning predictions, but the predictions get worse. Without monitoring, a model can degrade for weeks before anyone notices. Drift detection is the early warning system.

Software Monitoring
Model Monitoring (additional)

Uptime: Is the service running?

Latency: p50, p95, p99 response times

Errors: HTTP 5xx, exceptions, timeouts

Resources: CPU, memory, GPU utilisation

Throughput: Requests per second

Data drift: Has the input distribution changed?

Concept drift: Has the input–output relationship changed?

Prediction drift: Has the output distribution changed?

Model quality: Accuracy, F1, AUC — are they degrading?

Feature health: Missing values, outliers, new categories

Three types of drift — different causes, different detection methods
Data Drift P(X) changes Input distribution shifts Training Production Example: user demographics shift Concept Drift P(Y|X) changes Relationship between X and Y shifts Same inputs, different correct label Example: pandemic changes buying patterns Prediction Drift P(Ŷ) changes Output distribution shifts Before After Symptom, not root cause Detection: KS test, PSI (Population Stability Index), Jensen–Shannon divergence, Chi-squared test Tools: Evidently, WhyLabs, NannyML, Arize, Fiddler, custom Prometheus metrics
# Evidently — drift detection in 5 lines
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=prod_df)
report.save_html("drift_report.html")   # Visual drift report
# Programmatic access to results
drift_results = report.as_dict()
if drift_results["metrics"][0]["result"]["dataset_drift"]:
    trigger_retraining_pipeline()  # Auto-retrain on drift

∑ Chapter 9.6 — Key Takeaways

  • Models break silently — they keep returning predictions that get worse without monitoring
  • Three drift types: data drift (P(X) shifts), concept drift (P(Y|X) shifts), prediction drift (P(Ŷ) shifts — symptom)
  • Detection: KS test, PSI, JS divergence — Evidently (OSS) and WhyLabs (SaaS) are leading tools
  • Retraining triggers: drift above threshold, scheduled cadence, or model quality metric below SLA
9.7
Chapter 9.7
Vector Databases & LLM Infrastructure

The LLM era introduced a new class of infrastructure: vector databases for retrieval, embedding pipelines for ingestion, API gateways for model routing, and prompt caching for cost reduction. This chapter covers the LLM-specific infrastructure layer.

A vector database stores high-dimensional embedding vectors and supports approximate nearest neighbour (ANN) search — finding the k most similar vectors to a query vector. This is the retrieval engine behind RAG, semantic search, and recommendation systems.

Vector DB Pipeline — from raw data to nearest-neighbour retrieval
Documents Chunk Embed Model Vector DB ANN Search Top-k Results Index algorithms: HNSW (graph), IVF (partition), PQ (compression) — tradeoff recall vs speed vs memory
Vector DBTypeIndexFilteringScaleBest For
PineconeManaged SaaSProprietaryExcellentBillionsZero-ops production RAG
WeaviateOSS / CloudHNSWGraphQLBillionsHybrid search (vector + keyword)
QdrantOSS / CloudHNSWRich filtersBillionsPerformance, Rust-based
ChromaOSSHNSWBasicMillionsPrototyping, local dev
pgvectorPostgres extensionIVFFlat/HNSWFull SQLMillionsAlready using Postgres
💰
Prompt Caching

Cache identical or semantically similar prompts. Reduces API costs 30–60%. Exact match (hash) or semantic match (embedding similarity > threshold). Tools: Redis, GPTCache, LiteLLM.

🚦
Rate Limiting & Quotas

Token-bucket or sliding window rate limiting per user/team. Prevents runaway costs from agent loops. Budget alerts when spend exceeds threshold. Critical for agentic workloads.

🔄
Gateway & Routing

Single API interface, multiple backends. Route by model capability, cost, latency. Auto-fallback: if OpenAI is down, route to Anthropic. Load balance across API keys. LiteLLM, Portkey, custom.

# LiteLLM — unified LLM API gateway
from litellm import completion
# Same interface for any provider
response = completion(
    model="gpt-4o",        # or "claude-3-sonnet" or "ollama/llama3"
    messages=[{"role": "user", "content": "Explain MLOps"}],
    max_tokens=500,
    temperature=0.7
)
print(response.choices[0].message.content)
# Fallback: try GPT-4o, fall back to Claude if it fails
response = completion(
    model="gpt-4o",
    messages=messages,
    fallbacks=["claude-3-5-sonnet-20241022"]  # auto-retry with fallback
)

∑ Chapter 9.7 — Key Takeaways

  • Vector databases power RAG, semantic search, and recommendations via ANN search on embedding vectors
  • HNSW is the dominant index algorithm — graph-based, high recall, fast query
  • Pinecone (managed), Qdrant (OSS, fast), Chroma (prototyping), pgvector (already using Postgres)
  • LLM gateway: unified API, prompt caching, rate limiting, auto-fallback — LiteLLM is the standard tool
9.8
Chapter 9.8
Containerisation, Orchestration & Cost Optimisation

ML infrastructure is expensive — GPU hours cost 10–100× CPU hours. Containerisation ensures reproducibility, Kubernetes provides orchestration at scale, and cost optimisation determines whether your ML system is economically viable.

Docker containers package code, dependencies, and model weights into a single reproducible unit. “It works on my machine” becomes “it works in the container.” Multi-stage builds keep images small; GPU access via NVIDIA Container Toolkit.

# Dockerfile for ML model serving (multi-stage)
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11 /usr/local/lib/python3.11
COPY model/ ./model/         # Model weights
COPY app.py .               # FastAPI app
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
# Build & run:
# docker build -t fraud-model:v1 .
# docker run -p 8000:8000 fraud-model:v1
# GPU: docker run --gpus all -p 8000:8000 fraud-model:v1
Instance TypeCostAvailabilityPreemptionUse Case
On-DemandFull price ($2–30/hr for GPU)GuaranteedNeverProduction serving, SLA-bound
Reserved (1–3yr)30–60% discountGuaranteedNeverSteady-state training clusters
Spot / Preemptible60–90% discountNot guaranteed2min warningBatch training with checkpointing
💸
The GPU Cost Problem

A single A100 (80GB): ~$2/hr on-demand. Training GPT-3-scale model: ~$4.6M. Fine-tuning LLaMA 70B: ~$5K–50K depending on approach. Inference: vLLM reduces cost 2–5× vs naive serving. The cost of NOT optimising is enormous.

💡
Spot Instance Strategy

Checkpoint every N steps — when preempted, resume from last checkpoint. Use multiple spot pools (different instance types). Fall back to on-demand if all spots unavailable. Training frameworks (PyTorch Lightning, DeepSpeed) support checkpointing natively.

📏
Right-Sizing

Match instance type to workload. Don’t use an A100 for a logistic regression. Profile GPU utilisation — if <50%, downsize. CPU inference is 10–100× cheaper for small models.

📉
Auto-Scaling & Scale-to-Zero

Scale replicas based on traffic. Scale to zero during off-hours (no cost). Karpenter (K8s), SageMaker auto-scaling, Cloud Run. Cold start latency is the tradeoff.

🧲
Model Optimisation

Quantisation (FP32 → INT8): 2–4× faster, 2–4× less memory. Distillation: train small model to mimic large one. Pruning: remove unimportant weights. ONNX Runtime: cross-platform optimised inference.

DimensionCloud (AWS/GCP/Azure)On-Premise
CapExNone (pay-as-you-go)High (buy hardware)
OpExVariable, can spikePredictable
GPU AccessSubject to availabilityDedicated
ScalingElastic, minutesFixed capacity
Data SovereigntyProvider-dependentFull control
Best ForVariable workloads, startupsSteady-state, regulated industries

∑ Chapter 9.8 — Key Takeaways

  • Docker containers ensure reproducible ML environments — multi-stage builds keep images small
  • Spot instances save 60–90% on training — requires checkpointing every N steps
  • Cost optimisation: right-size instances, scale to zero, quantise models (INT8 = 2–4× savings)
  • Cloud vs on-prem: cloud for variable workloads, on-prem for steady-state and data sovereignty

🎓 Domain 9 Complete — MLOps & AI Engineering

  • Ch 9.1: ML lifecycle is a continuous feedback loop. MLOps = ML + DevOps + Data Engineering. 87% of models never reach production — engineering is the bottleneck.
  • Ch 9.2: Data pipelines are 60–80% of effort. Feature stores bridge training and serving. DVC versions data alongside code.
  • Ch 9.3: Experiment tracking (MLflow, W&B) logs everything. Model registry manages lifecycle stages with approval gates.
  • Ch 9.4: Batch (scheduled) vs real-time (API) inference. vLLM for LLMs, Triton for GPU. Canary, blue-green, shadow deployment.
  • Ch 9.5: ML CI/CD tests code + data + model quality. Airflow, Kubeflow, Prefect for orchestration. Training is the slow step.
  • Ch 9.6: Models break silently. Data drift, concept drift, prediction drift require continuous monitoring. Evidently and WhyLabs are leading tools.
  • Ch 9.7: Vector databases (Pinecone, Qdrant, pgvector) power RAG. LLM gateways (LiteLLM) provide unified API, caching, fallback.
  • Ch 9.8: Docker for reproducibility. Spot instances save 60–90%. Quantisation, right-sizing, scale-to-zero optimise cost.

MLOps is not about tools — it is about closing the feedback loop.

The best MLOps setup is the simplest one that reliably gets your models to production, monitors their performance, and triggers retraining when they degrade. Start simple (MLflow + FastAPI + Evidently), add complexity only when scale demands it.