AI Foundation · Domain 09

MLOps & AI Engineering

Building, deploying & operating AI systems at scale — data pipelines, experiment tracking, model serving, CI/CD for ML, monitoring, drift detection, vector databases, LLM APIs, and production infrastructure.

9.1

Chapter 9.1

The ML Lifecycle & MLOps Overview

87% of ML models never reach production. The bottleneck is not intelligence or algorithms — it is engineering. MLOps is the discipline that bridges the gap between a working notebook and a reliable, monitored, continuously improving production AI system.

The ML Lifecycle — From Idea to Production Core

Machine learning is not a one-shot task — it is a continuous cycle. Data changes, user behaviour shifts, the world evolves. A model that was accurate at deployment degrades over time unless the full lifecycle is managed as a system, not a project.

The ML Lifecycle — a continuous feedback loop, not a linear pipeline

What Is MLOps? Core

MLOps = Machine Learning + DevOps + Data Engineering. It is the set of practices, tools, and cultural norms for deploying and maintaining ML models in production reliably. The name is modelled on DevOps — but the challenges are fundamentally different because the artefact is not just code, it is code + data + model.

🧠

ML (Data Science)

Feature engineering, model selection, hyperparameter tuning, evaluation metrics, experimentation. Owns the model quality.

⚙️

Dev (Software Engineering)

Code quality, testing, version control, CI/CD, API design, containerisation. Owns the code quality and deployment pipeline.

🏗️

Ops (Infrastructure)

Monitoring, alerting, scaling, GPU scheduling, cost management, reliability. Owns the system reliability.

Dimension	DevOps	MLOps
Artefact	Code (binary/container)	Code + Data + Model (triple versioning)
Testing	Unit / integration / E2E	Unit + data validation + model quality + bias
CI Trigger	Code commit	Code commit OR data change OR model drift
Versioning	Git for code	Git + DVC/LakeFS for data + model registry
Monitoring	Latency, errors, uptime	All of above + prediction quality + data drift
Rollback	Redeploy previous image	Complex — model + data + feature pipeline must align

MLOps Maturity Levels In-depth

Google (2021) defined three maturity levels for MLOps. Most organisations are at Level 0. The progression is not about buying tools — it is about automating the feedback loop.

MLOps Maturity — from manual notebooks to fully automated retraining

The MLOps Tech Stack Map Core

MLOps Stack — each layer has purpose-built tooling

∑ Chapter 9.1 — Key Takeaways

ML lifecycle is a continuous feedback loop, not a linear pipeline — data → features → train → deploy → monitor → retrain
MLOps = ML + DevOps + Data Engineering — triple versioning (code + data + model) is the key difference from software DevOps
Most organisations are at Level 0 (manual) — automation happens in stages, not all at once
87% of models never reach production — engineering, not algorithms, is the bottleneck
The MLOps stack has purpose-built tools at every layer — no single tool covers everything

9.2

Chapter 9.2

Data Pipelines & Feature Engineering

In production ML, the model is the easy part. The data pipeline — getting the right data, at the right time, in the right format, with the right quality — accounts for 60–80% of the engineering effort. A model is only as good as the features that feed it.

Batch vs Streaming Pipelines Core

Data pipelines come in two fundamental paradigms: batch (scheduled, high-throughput, process data in chunks) and streaming (real-time, event-driven, process data as it arrives). Most production systems use both — the “lambda architecture” runs batch for training and streaming for inference.

Batch vs Streaming Pipelines — most systems need both

Dimension	Batch	Streaming
Latency	Minutes to hours	Milliseconds to seconds
Throughput	Very high (TB/run)	Moderate (events/sec)
Tools	Spark, dbt, Airflow	Kafka, Flink, Spark Streaming
Use case	Training, daily reports, backfill	Real-time inference, fraud, recommendations
Complexity	Lower — easier to debug	Higher — ordering, exactly-once, backpressure

Feature Stores — Feast & Tecton In-depth

A feature store is a centralised repository of curated, versioned features that are shared between training and serving. Without it, data scientists recompute the same features ad hoc in every experiment, and serving code re-implements training logic — the most common source of training/serving skew.

🍽️

Feast (Open Source)

Python SDK. Offline store (BigQuery, Redshift, file). Online store (Redis, DynamoDB). Git-based feature definitions. Free, self-managed. Great for teams starting with feature stores.

⚡

Tecton (Managed)

Enterprise feature platform. Real-time feature pipelines. Built-in monitoring. Automatic backfill. Complex transformations. Higher cost, lower operational burden.

# Feast feature definition (Python SDK)
from feast import Entity, Feature, FeatureView, ValueType
from feast.data_source import BigQuerySource
user = Entity(name="user_id", value_type=ValueType.INT64)
user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="avg_order_value", dtype=ValueType.FLOAT),
        Feature(name="total_orders_30d", dtype=ValueType.INT64),
        Feature(name="days_since_last_order", dtype=ValueType.INT64),
    ],
    online=True,
    source=BigQuerySource(
        table="project.dataset.user_features",
        timestamp_field="event_timestamp",
    ),
)
# Training: get historical features (point-in-time join)
training_df = store.get_historical_features(entity_df, ["user_features:avg_order_value"])
# Serving: get latest features (online store lookup, ~5ms)
features = store.get_online_features(entity_rows=[{"user_id": 42}], features=["user_features:avg_order_value"])

Data Validation & Versioning Core

Data is the most common failure mode in production ML. Great Expectations lets you define expectations about your data (schema, ranges, distributions) and test them automatically. DVC (Data Version Control) tracks data files alongside code in Git, storing the actual data in remote storage.

# Great Expectations — validate a dataframe
import great_expectations as gx
context = gx.get_context()
ds = context.sources.add_pandas("my_ds")
asset = ds.add_dataframe_asset("training_data")
batch = asset.build_batch_request(dataframe=df)
# Define expectations
validator = context.get_validator(batch_request=batch)
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("age", min_value=0, max_value=150)
validator.expect_column_mean_to_be_between("income", min_value=20000, max_value=200000)
results = validator.validate()  # Returns success/failure + details

# DVC — version data alongside code
$ dvc init                          # Initialise DVC in git repo
$ dvc add data/training.parquet     # Track data file (creates .dvc pointer)
$ git add data/training.parquet.dvc # Commit pointer to git
$ git commit -m "Add training data v1"
$ dvc push                          # Push actual data to remote (S3, GCS)
# Later: reproduce exact training data
$ git checkout v1.0                 # Checkout code + DVC pointers
$ dvc pull                          # Pull matching data from remote

∑ Chapter 9.2 — Key Takeaways

Data pipelines are 60–80% of ML engineering effort — batch for training, streaming for serving
Feature stores eliminate training/serving skew by sharing features between training and inference
Feast (open source) and Tecton (managed) are the two dominant feature store approaches
Great Expectations validates data quality with testable expectations — catch data bugs before they hit the model
DVC versions data alongside code — Git tracks pointers, remote stores actual data

9.3

Chapter 9.3

Experiment Tracking & Model Registry

Without experiment tracking, ML research is a folder of notebooks named model_v2_final_FINAL_actually_final.ipynb. A model registry is the single source of truth for which model version is in production, why it was promoted, and what it was trained on.

Why Experiment Tracking Matters Core

An ML experiment produces multiple artefacts: code, data reference, hyperparameters, metrics, trained model weights, and environment configuration. Without structured tracking, reproducing a result from three months ago is nearly impossible. Experiment trackers log all of these automatically and make them searchable, comparable, and shareable.

📁

Without Tracking

Scattered notebooks. “Which hyperparams gave 92% accuracy?” — nobody knows. Cannot reproduce last month’s best run. Model deployed but nobody knows which commit trained it.

✅

With Tracking

Every run logged: params, metrics, artefacts, code version. Compare 200 runs in a table. One-click reproduce any experiment. Full lineage from data to deployed model.

📊

What Gets Tracked

Hyperparameters, metrics (loss, accuracy, F1), model weights, training data hash, code commit, environment (Python version, packages), run duration, GPU utilisation.

# MLflow — experiment tracking in 10 lines
import mlflow
mlflow.set_experiment("fraud-detection")
with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("epochs", 50)
    mlflow.log_param("model_type", "XGBoost")
    # ... train model ...
    mlflow.log_metric("accuracy", 0.943)
    mlflow.log_metric("f1_score", 0.891)
    mlflow.log_metric("auc_roc", 0.967)
    mlflow.sklearn.log_model(model, "model")  # Save model artefact
    mlflow.log_artifact("confusion_matrix.png")  # Save any file

# Weights & Biases — experiment tracking
import wandb
wandb.init(project="fraud-detection", config={
    "learning_rate": 0.001,
    "epochs": 50,
    "model_type": "XGBoost"
})
# ... train model, log metrics per step ...
wandb.log({"accuracy": 0.943, "f1": 0.891, "loss": 0.12})
wandb.finish()

Model Registry & Lifecycle Core

A model registry is a versioned catalogue of trained models with metadata, lineage, and lifecycle stages (Staging → Production → Archived). It answers: “What model is in production right now, who approved it, and what data trained it?”

Model Lifecycle — from training to retirement with approval gates

Tool	Type	Tracking	Registry	UI	Best For
MLflow	Open source	Excellent	Built-in	Functional	Self-hosted, full control
W&B	SaaS (free tier)	Excellent	Built-in	Beautiful	Teams, collaboration, sweeps
Neptune	SaaS	Excellent	Basic	Good	Metadata-heavy experiments
ClearML	Open source	Good	Built-in	Good	End-to-end (track + orchestrate)

∑ Chapter 9.3 — Key Takeaways

Experiment tracking logs params, metrics, artefacts, and code version for every training run
MLflow (open source) and W&B (SaaS) are the dominant tools — both take ~10 lines to integrate
Model registry provides versioned lifecycle stages: Registered → Staging → Production → Archived
Reproducibility requires tracking: code + data + environment + hyperparameters + random seed

9.4

Chapter 9.4

Model Serving & Deployment Patterns

A trained model sitting in a notebook is worth nothing. Model serving is the discipline of making predictions available to users — via APIs, batch jobs, or embedded systems — with the latency, throughput, and reliability that production demands.

Batch vs Real-Time Inference Core

Batch Inference

Real-Time Inference

Pattern: Run model on all data at scheduled intervals (hourly, daily).

Output: Predictions written to a database/table — served via lookup.

Latency: Minutes to hours (acceptable for non-interactive use cases).

Scale: Process millions of records per run using distributed compute.

Examples: Email campaigns, daily risk scores, recommendation pre-compute.

Pattern: Model behind an API — request comes in, prediction goes out.

Output: JSON response with prediction, typically <100ms.

Latency: Milliseconds (p99 < 200ms for most production systems).

Scale: Auto-scale replicas based on QPS (queries per second).

Examples: Fraud detection, search ranking, chatbots, real-time pricing.

🎯

TorchServe

PyTorch-native model server. MAR packaging format. Built-in batching, metrics, model versioning. Best for PyTorch-only deployments.

🚀

NVIDIA Triton

Multi-framework (PyTorch, TF, ONNX, TensorRT). Dynamic batching. GPU-optimised. Ensemble pipelines. Best for high-throughput GPU serving.

⚡

vLLM

LLM-optimised serving. PagedAttention for memory efficiency. Continuous batching. OpenAI-compatible API. Best for LLM inference (2–5× faster than naive).

# FastAPI — simplest model serving API
from fastapi import FastAPI
import joblib
app = FastAPI()
model = joblib.load("model.pkl")
@app.post("/predict")
async def predict(features: dict):
    X = [[features["age"], features["income"], features["credit_score"]]]
    prediction = model.predict(X)[0]
    probability = model.predict_proba(X)[0].tolist()
    return {"prediction": int(prediction), "probability": probability}
# Run: uvicorn app:app --host 0.0.0.0 --port 8000

Deployment Strategies In-depth

Three deployment strategies — risk vs speed tradeoff

∑ Chapter 9.4 — Key Takeaways

Batch inference: scheduled, high-throughput, store predictions. Real-time: API-based, <100ms latency
Model servers: TorchServe (PyTorch), Triton (multi-framework GPU), vLLM (LLM-optimised with PagedAttention)
Deployment: canary (gradual %), blue-green (instant switch), shadow (zero user risk)
FastAPI is the simplest starting point — graduate to model servers as scale demands

9.5

Chapter 9.5

CI/CD for Machine Learning

Software CI/CD deploys code. ML CI/CD deploys code + data + models. The pipeline must test data quality, validate model performance, check for bias, and only then deploy — all automatically, triggered by code changes OR data changes OR drift alerts.

How ML CI/CD Differs Core

ML CI/CD Pipeline — every stage has ML-specific requirements

🏗️

Airflow

Python DAGs. Mature ecosystem. Widely adopted. Schedule-based. Good for batch ML pipelines. Steeper learning curve. Not ML-specific but ML-capable.

⚙️

Kubeflow Pipelines

Kubernetes-native. ML-first. Component-based. Built-in experiment tracking. Best for teams already on K8s. Heavier operational footprint.

🐍

Prefect

Python-native. Dynamic DAGs (not just static). Easy local → cloud transition. Good UI. Lower learning curve than Airflow. Growing rapidly.

∑ Chapter 9.5 — Key Takeaways

ML CI/CD tests code + data + model quality — not just code
Pipeline triggers: code commit, data change, drift alert, or scheduled retrain
Orchestrators: Airflow (mature), Kubeflow (K8s-native), Prefect (Python-native, growing fast)
The “build” step in ML CI/CD takes hours (training) — caching and incremental training are critical

9.6

Chapter 9.6

Monitoring, Drift & Observability

Software breaks loudly — errors, crashes, timeouts. Models break silently: they keep returning predictions, but the predictions get worse. Without monitoring, a model can degrade for weeks before anyone notices. Drift detection is the early warning system.

Model Monitoring vs Software Monitoring Core

Software Monitoring

Model Monitoring (additional)

Uptime: Is the service running?

Latency: p50, p95, p99 response times

Errors: HTTP 5xx, exceptions, timeouts

Resources: CPU, memory, GPU utilisation

Throughput: Requests per second

Data drift: Has the input distribution changed?

Concept drift: Has the input–output relationship changed?

Prediction drift: Has the output distribution changed?

Model quality: Accuracy, F1, AUC — are they degrading?

Feature health: Missing values, outliers, new categories

Data Drift, Concept Drift & Prediction Drift In-depth

Three types of drift — different causes, different detection methods

# Evidently — drift detection in 5 lines
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=prod_df)
report.save_html("drift_report.html")   # Visual drift report
# Programmatic access to results
drift_results = report.as_dict()
if drift_results["metrics"][0]["result"]["dataset_drift"]:
    trigger_retraining_pipeline()  # Auto-retrain on drift

∑ Chapter 9.6 — Key Takeaways

Models break silently — they keep returning predictions that get worse without monitoring
Three drift types: data drift (P(X) shifts), concept drift (P(Y|X) shifts), prediction drift (P(Ŷ) shifts — symptom)
Detection: KS test, PSI, JS divergence — Evidently (OSS) and WhyLabs (SaaS) are leading tools
Retraining triggers: drift above threshold, scheduled cadence, or model quality metric below SLA

9.7

Chapter 9.7

Vector Databases & LLM Infrastructure

The LLM era introduced a new class of infrastructure: vector databases for retrieval, embedding pipelines for ingestion, API gateways for model routing, and prompt caching for cost reduction. This chapter covers the LLM-specific infrastructure layer.

Vector Database Architectures Core

A vector database stores high-dimensional embedding vectors and supports approximate nearest neighbour (ANN) search — finding the k most similar vectors to a query vector. This is the retrieval engine behind RAG, semantic search, and recommendation systems.

Vector DB Pipeline — from raw data to nearest-neighbour retrieval

Vector DB	Type	Index	Filtering	Scale	Best For
Pinecone	Managed SaaS	Proprietary	Excellent	Billions	Zero-ops production RAG
Weaviate	OSS / Cloud	HNSW	GraphQL	Billions	Hybrid search (vector + keyword)
Qdrant	OSS / Cloud	HNSW	Rich filters	Billions	Performance, Rust-based
Chroma	OSS	HNSW	Basic	Millions	Prototyping, local dev
pgvector	Postgres extension	IVFFlat/HNSW	Full SQL	Millions	Already using Postgres

LLM API Management & Gateway Patterns In-depth

💰

Prompt Caching

Cache identical or semantically similar prompts. Reduces API costs 30–60%. Exact match (hash) or semantic match (embedding similarity > threshold). Tools: Redis, GPTCache, LiteLLM.

🚦

Rate Limiting & Quotas

Token-bucket or sliding window rate limiting per user/team. Prevents runaway costs from agent loops. Budget alerts when spend exceeds threshold. Critical for agentic workloads.

🔄

Gateway & Routing

Single API interface, multiple backends. Route by model capability, cost, latency. Auto-fallback: if OpenAI is down, route to Anthropic. Load balance across API keys. LiteLLM, Portkey, custom.

# LiteLLM — unified LLM API gateway
from litellm import completion
# Same interface for any provider
response = completion(
    model="gpt-4o",        # or "claude-3-sonnet" or "ollama/llama3"
    messages=[{"role": "user", "content": "Explain MLOps"}],
    max_tokens=500,
    temperature=0.7
)
print(response.choices[0].message.content)
# Fallback: try GPT-4o, fall back to Claude if it fails
response = completion(
    model="gpt-4o",
    messages=messages,
    fallbacks=["claude-3-5-sonnet-20241022"]  # auto-retry with fallback
)

∑ Chapter 9.7 — Key Takeaways

Vector databases power RAG, semantic search, and recommendations via ANN search on embedding vectors
HNSW is the dominant index algorithm — graph-based, high recall, fast query
Pinecone (managed), Qdrant (OSS, fast), Chroma (prototyping), pgvector (already using Postgres)
LLM gateway: unified API, prompt caching, rate limiting, auto-fallback — LiteLLM is the standard tool

9.8

Chapter 9.8

Containerisation, Orchestration & Cost Optimisation

ML infrastructure is expensive — GPU hours cost 10–100× CPU hours. Containerisation ensures reproducibility, Kubernetes provides orchestration at scale, and cost optimisation determines whether your ML system is economically viable.

Docker for ML Core

Docker containers package code, dependencies, and model weights into a single reproducible unit. “It works on my machine” becomes “it works in the container.” Multi-stage builds keep images small; GPU access via NVIDIA Container Toolkit.

# Dockerfile for ML model serving (multi-stage)
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11 /usr/local/lib/python3.11
COPY model/ ./model/         # Model weights
COPY app.py .               # FastAPI app
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
# Build & run:
# docker build -t fraud-model:v1 .
# docker run -p 8000:8000 fraud-model:v1
# GPU: docker run --gpus all -p 8000:8000 fraud-model:v1

GPU Scheduling & Spot Instances In-depth

Instance Type	Cost	Availability	Preemption	Use Case
On-Demand	Full price ($2–30/hr for GPU)	Guaranteed	Never	Production serving, SLA-bound
Reserved (1–3yr)	30–60% discount	Guaranteed	Never	Steady-state training clusters
Spot / Preemptible	60–90% discount	Not guaranteed	2min warning	Batch training with checkpointing

💸

The GPU Cost Problem

A single A100 (80GB): ~$2/hr on-demand. Training GPT-3-scale model: ~$4.6M. Fine-tuning LLaMA 70B: ~$5K–50K depending on approach. Inference: vLLM reduces cost 2–5× vs naive serving. The cost of NOT optimising is enormous.

💡

Spot Instance Strategy

Checkpoint every N steps — when preempted, resume from last checkpoint. Use multiple spot pools (different instance types). Fall back to on-demand if all spots unavailable. Training frameworks (PyTorch Lightning, DeepSpeed) support checkpointing natively.

Cost Optimisation Strategies Core

📏

Right-Sizing

Match instance type to workload. Don’t use an A100 for a logistic regression. Profile GPU utilisation — if <50%, downsize. CPU inference is 10–100× cheaper for small models.

📉

Auto-Scaling & Scale-to-Zero

Scale replicas based on traffic. Scale to zero during off-hours (no cost). Karpenter (K8s), SageMaker auto-scaling, Cloud Run. Cold start latency is the tradeoff.

🧲

Model Optimisation

Quantisation (FP32 → INT8): 2–4× faster, 2–4× less memory. Distillation: train small model to mimic large one. Pruning: remove unimportant weights. ONNX Runtime: cross-platform optimised inference.

Dimension	Cloud (AWS/GCP/Azure)	On-Premise
CapEx	None (pay-as-you-go)	High (buy hardware)
OpEx	Variable, can spike	Predictable
GPU Access	Subject to availability	Dedicated
Scaling	Elastic, minutes	Fixed capacity
Data Sovereignty	Provider-dependent	Full control
Best For	Variable workloads, startups	Steady-state, regulated industries

∑ Chapter 9.8 — Key Takeaways

Docker containers ensure reproducible ML environments — multi-stage builds keep images small
Spot instances save 60–90% on training — requires checkpointing every N steps
Cost optimisation: right-size instances, scale to zero, quantise models (INT8 = 2–4× savings)
Cloud vs on-prem: cloud for variable workloads, on-prem for steady-state and data sovereignty

🎓 Domain 9 Complete — MLOps & AI Engineering

Ch 9.1: ML lifecycle is a continuous feedback loop. MLOps = ML + DevOps + Data Engineering. 87% of models never reach production — engineering is the bottleneck.
Ch 9.2: Data pipelines are 60–80% of effort. Feature stores bridge training and serving. DVC versions data alongside code.
Ch 9.3: Experiment tracking (MLflow, W&B) logs everything. Model registry manages lifecycle stages with approval gates.
Ch 9.4: Batch (scheduled) vs real-time (API) inference. vLLM for LLMs, Triton for GPU. Canary, blue-green, shadow deployment.
Ch 9.5: ML CI/CD tests code + data + model quality. Airflow, Kubeflow, Prefect for orchestration. Training is the slow step.
Ch 9.6: Models break silently. Data drift, concept drift, prediction drift require continuous monitoring. Evidently and WhyLabs are leading tools.
Ch 9.7: Vector databases (Pinecone, Qdrant, pgvector) power RAG. LLM gateways (LiteLLM) provide unified API, caching, fallback.
Ch 9.8: Docker for reproducibility. Spot instances save 60–90%. Quantisation, right-sizing, scale-to-zero optimise cost.

MLOps is not about tools — it is about closing the feedback loop.

The best MLOps setup is the simplest one that reliably gets your models to production, monitors their performance, and triggers retraining when they degrade. Start simple (MLflow + FastAPI + Evidently), add complexity only when scale demands it.

← Domain 08: AI Agents Domain 10: Ethics & Safety →