MLOps & AI Engineering
Building, deploying & operating AI systems at scale โ data pipelines, experiment tracking, model serving, CI/CD for ML, monitoring, drift detection, vector databases, LLM APIs, and production infrastructure.
87% of ML models never reach production. The bottleneck is not intelligence or algorithms โ it is engineering. MLOps is the discipline that bridges the gap between a working notebook and a reliable, monitored, continuously improving production AI system.
Machine learning is not a one-shot task โ it is a continuous cycle. Data changes, user behaviour shifts, the world evolves. A model that was accurate at deployment degrades over time unless the full lifecycle is managed as a system, not a project.
MLOps = Machine Learning + DevOps + Data Engineering. It is the set of practices, tools, and cultural norms for deploying and maintaining ML models in production reliably. The name is modelled on DevOps โ but the challenges are fundamentally different because the artefact is not just code, it is code + data + model.
Feature engineering, model selection, hyperparameter tuning, evaluation metrics, experimentation. Owns the model quality.
Code quality, testing, version control, CI/CD, API design, containerisation. Owns the code quality and deployment pipeline.
Monitoring, alerting, scaling, GPU scheduling, cost management, reliability. Owns the system reliability.
| Dimension | DevOps | MLOps |
|---|---|---|
| Artefact | Code (binary/container) | Code + Data + Model (triple versioning) |
| Testing | Unit / integration / E2E | Unit + data validation + model quality + bias |
| CI Trigger | Code commit | Code commit OR data change OR model drift |
| Versioning | Git for code | Git + DVC/LakeFS for data + model registry |
| Monitoring | Latency, errors, uptime | All of above + prediction quality + data drift |
| Rollback | Redeploy previous image | Complex โ model + data + feature pipeline must align |
Google (2021) defined three maturity levels for MLOps. Most organisations are at Level 0. The progression is not about buying tools โ it is about automating the feedback loop.
∑ Chapter 9.1 — Key Takeaways
- ML lifecycle is a continuous feedback loop, not a linear pipeline — data → features → train → deploy → monitor → retrain
- MLOps = ML + DevOps + Data Engineering — triple versioning (code + data + model) is the key difference from software DevOps
- Most organisations are at Level 0 (manual) — automation happens in stages, not all at once
- 87% of models never reach production — engineering, not algorithms, is the bottleneck
- The MLOps stack has purpose-built tools at every layer — no single tool covers everything
In production ML, the model is the easy part. The data pipeline — getting the right data, at the right time, in the right format, with the right quality — accounts for 60–80% of the engineering effort. A model is only as good as the features that feed it.
Data pipelines come in two fundamental paradigms: batch (scheduled, high-throughput, process data in chunks) and streaming (real-time, event-driven, process data as it arrives). Most production systems use both — the “lambda architecture” runs batch for training and streaming for inference.
| Dimension | Batch | Streaming |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Throughput | Very high (TB/run) | Moderate (events/sec) |
| Tools | Spark, dbt, Airflow | Kafka, Flink, Spark Streaming |
| Use case | Training, daily reports, backfill | Real-time inference, fraud, recommendations |
| Complexity | Lower — easier to debug | Higher — ordering, exactly-once, backpressure |
A feature store is a centralised repository of curated, versioned features that are shared between training and serving. Without it, data scientists recompute the same features ad hoc in every experiment, and serving code re-implements training logic — the most common source of training/serving skew.
Python SDK. Offline store (BigQuery, Redshift, file). Online store (Redis, DynamoDB). Git-based feature definitions. Free, self-managed. Great for teams starting with feature stores.
Enterprise feature platform. Real-time feature pipelines. Built-in monitoring. Automatic backfill. Complex transformations. Higher cost, lower operational burden.
# Feast feature definition (Python SDK) from feast import Entity, Feature, FeatureView, ValueType from feast.data_source import BigQuerySource user = Entity(name="user_id", value_type=ValueType.INT64) user_features = FeatureView( name="user_features", entities=["user_id"], ttl=timedelta(days=1), features=[ Feature(name="avg_order_value", dtype=ValueType.FLOAT), Feature(name="total_orders_30d", dtype=ValueType.INT64), Feature(name="days_since_last_order", dtype=ValueType.INT64), ], online=True, source=BigQuerySource( table="project.dataset.user_features", timestamp_field="event_timestamp", ), ) # Training: get historical features (point-in-time join) training_df = store.get_historical_features(entity_df, ["user_features:avg_order_value"]) # Serving: get latest features (online store lookup, ~5ms) features = store.get_online_features(entity_rows=[{"user_id": 42}], features=["user_features:avg_order_value"])
Data is the most common failure mode in production ML. Great Expectations lets you define expectations about your data (schema, ranges, distributions) and test them automatically. DVC (Data Version Control) tracks data files alongside code in Git, storing the actual data in remote storage.
# Great Expectations — validate a dataframe import great_expectations as gx context = gx.get_context() ds = context.sources.add_pandas("my_ds") asset = ds.add_dataframe_asset("training_data") batch = asset.build_batch_request(dataframe=df) # Define expectations validator = context.get_validator(batch_request=batch) validator.expect_column_values_to_not_be_null("user_id") validator.expect_column_values_to_be_between("age", min_value=0, max_value=150) validator.expect_column_mean_to_be_between("income", min_value=20000, max_value=200000) results = validator.validate() # Returns success/failure + details
# DVC — version data alongside code $ dvc init # Initialise DVC in git repo $ dvc add data/training.parquet # Track data file (creates .dvc pointer) $ git add data/training.parquet.dvc # Commit pointer to git $ git commit -m "Add training data v1" $ dvc push # Push actual data to remote (S3, GCS) # Later: reproduce exact training data $ git checkout v1.0 # Checkout code + DVC pointers $ dvc pull # Pull matching data from remote
∑ Chapter 9.2 — Key Takeaways
- Data pipelines are 60–80% of ML engineering effort — batch for training, streaming for serving
- Feature stores eliminate training/serving skew by sharing features between training and inference
- Feast (open source) and Tecton (managed) are the two dominant feature store approaches
- Great Expectations validates data quality with testable expectations — catch data bugs before they hit the model
- DVC versions data alongside code — Git tracks pointers, remote stores actual data
Without experiment tracking, ML research is a folder of notebooks named
model_v2_final_FINAL_actually_final.ipynb. A model registry is the
single source of truth for which model version is in production, why it was promoted,
and what it was trained on.
An ML experiment produces multiple artefacts: code, data reference, hyperparameters, metrics, trained model weights, and environment configuration. Without structured tracking, reproducing a result from three months ago is nearly impossible. Experiment trackers log all of these automatically and make them searchable, comparable, and shareable.
Scattered notebooks. “Which hyperparams gave 92% accuracy?” — nobody knows. Cannot reproduce last month’s best run. Model deployed but nobody knows which commit trained it.
Every run logged: params, metrics, artefacts, code version. Compare 200 runs in a table. One-click reproduce any experiment. Full lineage from data to deployed model.
Hyperparameters, metrics (loss, accuracy, F1), model weights, training data hash, code commit, environment (Python version, packages), run duration, GPU utilisation.
# MLflow — experiment tracking in 10 lines import mlflow mlflow.set_experiment("fraud-detection") with mlflow.start_run(): mlflow.log_param("learning_rate", 0.001) mlflow.log_param("epochs", 50) mlflow.log_param("model_type", "XGBoost") # ... train model ... mlflow.log_metric("accuracy", 0.943) mlflow.log_metric("f1_score", 0.891) mlflow.log_metric("auc_roc", 0.967) mlflow.sklearn.log_model(model, "model") # Save model artefact mlflow.log_artifact("confusion_matrix.png") # Save any file
# Weights & Biases — experiment tracking import wandb wandb.init(project="fraud-detection", config={ "learning_rate": 0.001, "epochs": 50, "model_type": "XGBoost" }) # ... train model, log metrics per step ... wandb.log({"accuracy": 0.943, "f1": 0.891, "loss": 0.12}) wandb.finish()
A model registry is a versioned catalogue of trained models with metadata, lineage, and lifecycle stages (Staging → Production → Archived). It answers: “What model is in production right now, who approved it, and what data trained it?”
| Tool | Type | Tracking | Registry | UI | Best For |
|---|---|---|---|---|---|
| MLflow | Open source | Excellent | Built-in | Functional | Self-hosted, full control |
| W&B | SaaS (free tier) | Excellent | Built-in | Beautiful | Teams, collaboration, sweeps |
| Neptune | SaaS | Excellent | Basic | Good | Metadata-heavy experiments |
| ClearML | Open source | Good | Built-in | Good | End-to-end (track + orchestrate) |
∑ Chapter 9.3 — Key Takeaways
- Experiment tracking logs params, metrics, artefacts, and code version for every training run
- MLflow (open source) and W&B (SaaS) are the dominant tools — both take ~10 lines to integrate
- Model registry provides versioned lifecycle stages: Registered → Staging → Production → Archived
- Reproducibility requires tracking: code + data + environment + hyperparameters + random seed
A trained model sitting in a notebook is worth nothing. Model serving is the discipline of making predictions available to users — via APIs, batch jobs, or embedded systems — with the latency, throughput, and reliability that production demands.
Pattern: Run model on all data at scheduled intervals (hourly, daily).
Output: Predictions written to a database/table — served via lookup.
Latency: Minutes to hours (acceptable for non-interactive use cases).
Scale: Process millions of records per run using distributed compute.
Examples: Email campaigns, daily risk scores, recommendation pre-compute.
Pattern: Model behind an API — request comes in, prediction goes out.
Output: JSON response with prediction, typically <100ms.
Latency: Milliseconds (p99 < 200ms for most production systems).
Scale: Auto-scale replicas based on QPS (queries per second).
Examples: Fraud detection, search ranking, chatbots, real-time pricing.
PyTorch-native model server. MAR packaging format. Built-in batching, metrics, model versioning. Best for PyTorch-only deployments.
Multi-framework (PyTorch, TF, ONNX, TensorRT). Dynamic batching. GPU-optimised. Ensemble pipelines. Best for high-throughput GPU serving.
LLM-optimised serving. PagedAttention for memory efficiency. Continuous batching. OpenAI-compatible API. Best for LLM inference (2–5× faster than naive).
# FastAPI — simplest model serving API from fastapi import FastAPI import joblib app = FastAPI() model = joblib.load("model.pkl") @app.post("/predict") async def predict(features: dict): X = [[features["age"], features["income"], features["credit_score"]]] prediction = model.predict(X)[0] probability = model.predict_proba(X)[0].tolist() return {"prediction": int(prediction), "probability": probability} # Run: uvicorn app:app --host 0.0.0.0 --port 8000
∑ Chapter 9.4 — Key Takeaways
- Batch inference: scheduled, high-throughput, store predictions. Real-time: API-based, <100ms latency
- Model servers: TorchServe (PyTorch), Triton (multi-framework GPU), vLLM (LLM-optimised with PagedAttention)
- Deployment: canary (gradual %), blue-green (instant switch), shadow (zero user risk)
- FastAPI is the simplest starting point — graduate to model servers as scale demands
Software CI/CD deploys code. ML CI/CD deploys code + data + models. The pipeline must test data quality, validate model performance, check for bias, and only then deploy — all automatically, triggered by code changes OR data changes OR drift alerts.
Python DAGs. Mature ecosystem. Widely adopted. Schedule-based. Good for batch ML pipelines. Steeper learning curve. Not ML-specific but ML-capable.
Kubernetes-native. ML-first. Component-based. Built-in experiment tracking. Best for teams already on K8s. Heavier operational footprint.
Python-native. Dynamic DAGs (not just static). Easy local → cloud transition. Good UI. Lower learning curve than Airflow. Growing rapidly.
∑ Chapter 9.5 — Key Takeaways
- ML CI/CD tests code + data + model quality — not just code
- Pipeline triggers: code commit, data change, drift alert, or scheduled retrain
- Orchestrators: Airflow (mature), Kubeflow (K8s-native), Prefect (Python-native, growing fast)
- The “build” step in ML CI/CD takes hours (training) — caching and incremental training are critical
Software breaks loudly — errors, crashes, timeouts. Models break silently: they keep returning predictions, but the predictions get worse. Without monitoring, a model can degrade for weeks before anyone notices. Drift detection is the early warning system.
Uptime: Is the service running?
Latency: p50, p95, p99 response times
Errors: HTTP 5xx, exceptions, timeouts
Resources: CPU, memory, GPU utilisation
Throughput: Requests per second
Data drift: Has the input distribution changed?
Concept drift: Has the input–output relationship changed?
Prediction drift: Has the output distribution changed?
Model quality: Accuracy, F1, AUC — are they degrading?
Feature health: Missing values, outliers, new categories
# Evidently — drift detection in 5 lines from evidently.report import Report from evidently.metric_preset import DataDriftPreset report = Report(metrics=[DataDriftPreset()]) report.run(reference_data=train_df, current_data=prod_df) report.save_html("drift_report.html") # Visual drift report # Programmatic access to results drift_results = report.as_dict() if drift_results["metrics"][0]["result"]["dataset_drift"]: trigger_retraining_pipeline() # Auto-retrain on drift
∑ Chapter 9.6 — Key Takeaways
- Models break silently — they keep returning predictions that get worse without monitoring
- Three drift types: data drift (P(X) shifts), concept drift (P(Y|X) shifts), prediction drift (P(Ŷ) shifts — symptom)
- Detection: KS test, PSI, JS divergence — Evidently (OSS) and WhyLabs (SaaS) are leading tools
- Retraining triggers: drift above threshold, scheduled cadence, or model quality metric below SLA
The LLM era introduced a new class of infrastructure: vector databases for retrieval, embedding pipelines for ingestion, API gateways for model routing, and prompt caching for cost reduction. This chapter covers the LLM-specific infrastructure layer.
A vector database stores high-dimensional embedding vectors and supports approximate nearest neighbour (ANN) search — finding the k most similar vectors to a query vector. This is the retrieval engine behind RAG, semantic search, and recommendation systems.
| Vector DB | Type | Index | Filtering | Scale | Best For |
|---|---|---|---|---|---|
| Pinecone | Managed SaaS | Proprietary | Excellent | Billions | Zero-ops production RAG |
| Weaviate | OSS / Cloud | HNSW | GraphQL | Billions | Hybrid search (vector + keyword) |
| Qdrant | OSS / Cloud | HNSW | Rich filters | Billions | Performance, Rust-based |
| Chroma | OSS | HNSW | Basic | Millions | Prototyping, local dev |
| pgvector | Postgres extension | IVFFlat/HNSW | Full SQL | Millions | Already using Postgres |
Cache identical or semantically similar prompts. Reduces API costs 30–60%. Exact match (hash) or semantic match (embedding similarity > threshold). Tools: Redis, GPTCache, LiteLLM.
Token-bucket or sliding window rate limiting per user/team. Prevents runaway costs from agent loops. Budget alerts when spend exceeds threshold. Critical for agentic workloads.
Single API interface, multiple backends. Route by model capability, cost, latency. Auto-fallback: if OpenAI is down, route to Anthropic. Load balance across API keys. LiteLLM, Portkey, custom.
# LiteLLM — unified LLM API gateway from litellm import completion # Same interface for any provider response = completion( model="gpt-4o", # or "claude-3-sonnet" or "ollama/llama3" messages=[{"role": "user", "content": "Explain MLOps"}], max_tokens=500, temperature=0.7 ) print(response.choices[0].message.content) # Fallback: try GPT-4o, fall back to Claude if it fails response = completion( model="gpt-4o", messages=messages, fallbacks=["claude-3-5-sonnet-20241022"] # auto-retry with fallback )
∑ Chapter 9.7 — Key Takeaways
- Vector databases power RAG, semantic search, and recommendations via ANN search on embedding vectors
- HNSW is the dominant index algorithm — graph-based, high recall, fast query
- Pinecone (managed), Qdrant (OSS, fast), Chroma (prototyping), pgvector (already using Postgres)
- LLM gateway: unified API, prompt caching, rate limiting, auto-fallback — LiteLLM is the standard tool
ML infrastructure is expensive — GPU hours cost 10–100× CPU hours. Containerisation ensures reproducibility, Kubernetes provides orchestration at scale, and cost optimisation determines whether your ML system is economically viable.
Docker containers package code, dependencies, and model weights into a single reproducible unit. “It works on my machine” becomes “it works in the container.” Multi-stage builds keep images small; GPU access via NVIDIA Container Toolkit.
# Dockerfile for ML model serving (multi-stage) FROM python:3.11-slim AS builder WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt FROM python:3.11-slim WORKDIR /app COPY --from=builder /usr/local/lib/python3.11 /usr/local/lib/python3.11 COPY model/ ./model/ # Model weights COPY app.py . # FastAPI app EXPOSE 8000 CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] # Build & run: # docker build -t fraud-model:v1 . # docker run -p 8000:8000 fraud-model:v1 # GPU: docker run --gpus all -p 8000:8000 fraud-model:v1
| Instance Type | Cost | Availability | Preemption | Use Case |
|---|---|---|---|---|
| On-Demand | Full price ($2–30/hr for GPU) | Guaranteed | Never | Production serving, SLA-bound |
| Reserved (1–3yr) | 30–60% discount | Guaranteed | Never | Steady-state training clusters |
| Spot / Preemptible | 60–90% discount | Not guaranteed | 2min warning | Batch training with checkpointing |
A single A100 (80GB): ~$2/hr on-demand. Training GPT-3-scale model: ~$4.6M. Fine-tuning LLaMA 70B: ~$5K–50K depending on approach. Inference: vLLM reduces cost 2–5× vs naive serving. The cost of NOT optimising is enormous.
Checkpoint every N steps — when preempted, resume from last checkpoint. Use multiple spot pools (different instance types). Fall back to on-demand if all spots unavailable. Training frameworks (PyTorch Lightning, DeepSpeed) support checkpointing natively.
Match instance type to workload. Don’t use an A100 for a logistic regression. Profile GPU utilisation — if <50%, downsize. CPU inference is 10–100× cheaper for small models.
Scale replicas based on traffic. Scale to zero during off-hours (no cost). Karpenter (K8s), SageMaker auto-scaling, Cloud Run. Cold start latency is the tradeoff.
Quantisation (FP32 → INT8): 2–4× faster, 2–4× less memory. Distillation: train small model to mimic large one. Pruning: remove unimportant weights. ONNX Runtime: cross-platform optimised inference.
| Dimension | Cloud (AWS/GCP/Azure) | On-Premise |
|---|---|---|
| CapEx | None (pay-as-you-go) | High (buy hardware) |
| OpEx | Variable, can spike | Predictable |
| GPU Access | Subject to availability | Dedicated |
| Scaling | Elastic, minutes | Fixed capacity |
| Data Sovereignty | Provider-dependent | Full control |
| Best For | Variable workloads, startups | Steady-state, regulated industries |
∑ Chapter 9.8 — Key Takeaways
- Docker containers ensure reproducible ML environments — multi-stage builds keep images small
- Spot instances save 60–90% on training — requires checkpointing every N steps
- Cost optimisation: right-size instances, scale to zero, quantise models (INT8 = 2–4× savings)
- Cloud vs on-prem: cloud for variable workloads, on-prem for steady-state and data sovereignty
🎓 Domain 9 Complete — MLOps & AI Engineering
- Ch 9.1: ML lifecycle is a continuous feedback loop. MLOps = ML + DevOps + Data Engineering. 87% of models never reach production — engineering is the bottleneck.
- Ch 9.2: Data pipelines are 60–80% of effort. Feature stores bridge training and serving. DVC versions data alongside code.
- Ch 9.3: Experiment tracking (MLflow, W&B) logs everything. Model registry manages lifecycle stages with approval gates.
- Ch 9.4: Batch (scheduled) vs real-time (API) inference. vLLM for LLMs, Triton for GPU. Canary, blue-green, shadow deployment.
- Ch 9.5: ML CI/CD tests code + data + model quality. Airflow, Kubeflow, Prefect for orchestration. Training is the slow step.
- Ch 9.6: Models break silently. Data drift, concept drift, prediction drift require continuous monitoring. Evidently and WhyLabs are leading tools.
- Ch 9.7: Vector databases (Pinecone, Qdrant, pgvector) power RAG. LLM gateways (LiteLLM) provide unified API, caching, fallback.
- Ch 9.8: Docker for reproducibility. Spot instances save 60–90%. Quantisation, right-sizing, scale-to-zero optimise cost.
MLOps is not about tools — it is about closing the feedback loop.
The best MLOps setup is the simplest one that reliably gets your models to production, monitors their performance, and triggers retraining when they degrade. Start simple (MLflow + FastAPI + Evidently), add complexity only when scale demands it.