AI Foundation · Domain 03 · Chapter 3.1

The ML Landscape

Supervised, unsupervised, semi-supervised, and reinforcement learning — the map before the territory

3.1

Chapter 3.1

The ML Landscape & Learning Paradigms

Machine learning is not one thing. It is a family of approaches united by one idea: let the computer find the pattern in the data, rather than programming the pattern by hand. Everything in this domain is a specific way of doing that.

What Is Machine Learning? Core

In 1959, Arthur Samuel coined the term machine learning while building a self-improving checkers program, defining it as "the ability to learn without being explicitly programmed." The most precise formulation came from Tom Mitchell in 1997:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Mitchell, T. — Machine Learning, McGraw Hill, 1997

Concrete example — applying Mitchell's definition to a spam filter:

T

TaskClassify incoming emails as spam or not-spam

E

Experience50,000 labelled emails — each marked spam / not-spam by human reviewers

P

PerformanceClassification accuracy on new, previously-unseen emails — ideally 99%+

As the filter processes more labelled emails (more E), its accuracy on new mail (P at T) improves — that's the entire definition in one sentence.

The critical insight is the inversion of the programming paradigm. In traditional software you write rules ("if subject contains 'FREE MONEY' → spam"). In ML, you provide examples and the algorithm discovers the rules automatically — rules often far too complex and numerous for any human to encode by hand.

The ML Inversion — from programming rules to learning them

Supervised Learning In-depth

Supervised learning is the most widely deployed form of ML. The word "supervised" means a teacher is present: you provide labelled training examples — pairs of (input X, correct answer y) — and the algorithm learns a mapping f: X → y that generalises to new, unseen inputs. Think of a student studying with an answer key: they infer underlying rules from many (question, answer) pairs, then apply those rules to questions they've never seen.

Tasks split into two types by the nature of y:

Regression — y is continuous

e.g. house price, temperature, stock return

Classification — y is a discrete label

e.g. spam/ham, cat/dog, benign/malignant

The choice of output type determines the loss function, output layer, and evaluation metric.

Supervised Learning Loop — predict, measure error, update

The learning signal comes from comparing the model's prediction ŷ to the true label y via a loss function. Gradient descent nudges the model's parameters in the direction that reduces the loss. Repeat this across millions of examples and the model converges to a good approximation of f. Supervised learning's key limitation is label dependency: human-annotated examples are expensive, slow, and often scarce in specialised domains.

Property	Regression	Classification
Output type	Continuous value (ℝ)	Discrete category
Loss function	MSE, MAE, Huber	Cross-entropy, Hinge
Output layer	Linear (single node)	Softmax / Sigmoid
Evaluation	RMSE, R², MAE	Accuracy, F1, AUC-ROC
Example	House price prediction	Email spam detection
Algorithms	Linear regression, Ridge, SVR	Logistic reg, SVM, Random Forest

Unsupervised Learning Core

Unsupervised learning removes the teacher entirely. You provide raw inputs X — no answer key, no rewards. The algorithm must discover hidden structure on its own. Think of giving someone a pile of unlabelled photographs and asking them to organise it: they'll naturally group similar images together even without being told what "similar" means.

Three major tasks fall under unsupervised learning:

Clustering Groups similar data points — k-means, DBSCAN, hierarchical clustering

Dim Reduction Compresses to fewer dimensions while preserving structure — PCA, t-SNE, UMAP

Density Est. Models the underlying probability distribution — Gaussian Mixture Models, KDE

Three Unsupervised Learning Tasks

The unsupervised challenge is evaluation: without ground-truth labels, how do you know if a clustering is good? Metrics like silhouette score (cohesion vs separation) and reconstruction error (for autoencoders) provide proxies, but ultimately require domain-expert judgment. Despite this, unsupervised learning is invaluable — it can operate on the vast stores of unlabelled data that exist in every organisation.

Semi-Supervised & Self-Supervised Learning Core

Between the extremes of "all data labelled" (supervised) and "no data labelled" (unsupervised) lies a practical middle ground that powers most of modern AI.

Semi-supervised learning combines a small labelled set with a large unlabelled set. Even without labels, the geometric structure of the data (clusters, manifolds) constrains which labels are plausible: two nearby points probably share a label. Algorithms like label propagation, self-training, and co-training exploit this structure to achieve strong performance even when labels cover only 1–5% of data — critical in medical imaging where expert annotation is expensive.

Self-supervised learning is the most important paradigm of the past decade. It creates a training signal directly from raw data, eliminating human annotation entirely. The core trick: mask or corrupt part of the input, then train the model to reconstruct the missing part.

BERTMasks 15% of tokens and predicts them → learns language understanding

GPTPredicts the next token → learns to generate coherent text

SimCLRTwo augmented views of same image must have similar embeddings → learns visual features

Self-supervised learning is how every major language model is pre-trained. GPT-4 processed trillions of tokens — each token is the label for the previous one. This technique made trillion-parameter model training economically feasible without any human annotation at scale.

🏷️

Semi-Supervised Learning

A few labelled points anchor the model; unlabelled points fill in the space via geometric structure. Like teaching a child 5 examples of "cat" and letting them identify 10,000 more.

Medical AI: 100 labelled scans + 10,000 unlabelled
Drug discovery: few active compounds + vast chemical space
Web classification: small annotated corpus + crawled text

🤖

Self-Supervised Learning

Labels come from the data structure itself — the supervision signal is derived from the input. No human annotation required.

BERT: predict masked tokens → understand language
GPT: predict next token → generate coherent text
SimCLR: match augmented views → learn visual features
WAV2Vec: predict masked audio → speech recognition

Reinforcement Learning Introductory

Reinforcement learning (RL) is the third major paradigm. An agent takes actions in an environment and receives numerical rewards or penalties based on those actions. No labelled examples exist — only a reward signal over time. The agent must learn, through trial and error, which sequences of actions lead to the highest cumulative reward.

The central challenge is the exploration-exploitation dilemma: should the agent exploit what it already knows works, or explore new actions that might be better? Too much exploitation → stuck in a local optimum. Too much exploration → never commits to a strategy.

AlphaGoreward = winning the game

Robot locomotionreward = distance without falling

RLHF for LLMsreward = human preference rating

🎮

Game Playing

AlphaGo / AlphaZero learned Go, chess, and shogi from self-play with game outcome as reward. Beat world champions within days of training.

🤖

Robotics

Reward = physical task completion. Boston Dynamics gaits, OpenAI Dactyl hand, DeepMind locomotion — all RL-trained policies.

💬

RLHF for LLMs

Humans rate responses → train a reward model → use PPO to fine-tune the LLM toward preferred outputs. The technique that made ChatGPT helpful.

📌 Full RL coverage — Q-learning, policy gradients, PPO, actor-critic — is in Domain 7: Reinforcement Learning. Here we establish only where RL fits in the ML landscape.

The Standard ML Pipeline In-depth

Every ML project — from a weekend Kaggle competition to a production system at Google — follows the same general pipeline. Understanding this sequence prevents the most common failure modes: models trained on wrong data, evaluated on contaminated test sets, or deployed without monitoring.

① Problem Definition What to optimise?

② Data Collection Gather raw data

③ EDA Explore & understand

④ Preprocessing Clean & transform

⑤ Modelling Train & tune

⑥ Evaluation Measure performance

⑦ Deployment Serve & monitor

① Problem Definition

Frame the task precisely: regression or classification? What is the prediction target? What does success look like numerically? Check whether ML is even needed — sometimes a rule-based system is simpler and more reliable.

② Data Collection

Gather from APIs, databases, web scraping, sensors, or surveys. Assess quantity (enough to generalise?) and quality (representative? biased? stale?). Data collection is often 60–70% of total project time in practice.

③ EDA — Exploratory Data Analysis

Histograms, scatter plots, correlation matrices, missing value heatmaps, class imbalance checks, outlier detection. Goal: understand your data before touching a model. Surprises found here cost 10× less than surprises found post-deployment.

④ Preprocessing

Missing value imputation (mean, median, KNN), feature scaling (StandardScaler, MinMaxScaler), categorical encoding (one-hot, ordinal, target encoding), train/val/test split. Critical rule: fit all transformers on training data only — never on validation or test data.

⑤ Modelling

Start with a baseline (majority-class classifier, mean predictor). Then iterate: logistic regression → random forest → gradient boosting → neural network. Use cross-validation. Tune with grid search, random search, or Bayesian optimisation (Optuna).

⑥ Evaluation

Choose metrics aligned with business goals. Accuracy misleads on imbalanced classes — prefer F1, precision/recall, or AUC-ROC. Never evaluate on training data. Final evaluation on the held-out test set — used exactly once.

⑦ Deployment & Monitoring

Serve via REST API (FastAPI, Flask, Triton) or batch pipeline. Monitor for data drift (input distribution shifts), concept drift (relationship between X and y changes), and performance degradation. Set retraining triggers. Production models without monitoring degrade silently — one of the most common failure modes in real ML systems.

The pipeline is not linear — it is a loop. EDA may force you to redefine the problem. Evaluation may send you back to data collection. Deployment monitoring may trigger the whole cycle again. Budget time for at least 3 full iterations.

When to Use Which Paradigm Core

The most common mistake early practitioners make is defaulting to neural networks or the most complex algorithm available. The right question is always: what data do I have, and what do I need to output?

Choosing a Learning Paradigm — Decision Flow

Paradigm	Data Requirement	Typical Tasks	Key Algorithms	When to Use
Supervised	Labelled pairs (X, y)	Classification, Regression	Linear Reg, RF, SVM, NNs	You have labels and a clear prediction target
Unsupervised	Unlabelled X only	Clustering, Compression, Anomaly	k-means, PCA, DBSCAN, Autoencoders	No labels; explore structure; detect anomalies
Semi-supervised	Few labels + many unlabelled	Classification with scarce labels	Label propagation, self-training, MixMatch	Labelling expensive; abundant unlabelled data
Self-supervised	Raw unlabelled X (large scale)	Pre-training LLMs, vision encoders	BERT, GPT, SimCLR, MAE	Huge unlabelled datasets; build foundation model
Reinforcement	Reward signal from environment	Games, robotics, RLHF, trading	Q-learning, PPO, SAC, DQN	Sequential decisions; feedback from consequences

📋 Chapter 3.1 — Key Takeaways

ML = let the algorithm find patterns instead of programming them explicitly — the key inversion of the traditional software paradigm
Mitchell's definition: learn from experience E on tasks T as measured by performance P — applies to everything from spam filters to AlphaFold
Supervised: labelled {X, y} pairs → learn f: X → y. Two subtypes: regression (continuous) and classification (categorical)
Unsupervised: no labels → discover hidden clusters, compress via dimensionality reduction, or model density — evaluation is harder without ground truth
Self-supervised: labels come from the data itself — predicting masked or next tokens — how GPT and BERT are pre-trained at trillion-token scale with zero human annotation
Reinforcement: learn from reward signals, not labels — powers AlphaGo, robot locomotion, and RLHF for LLMs (full coverage: Domain 7)
The ML pipeline: Problem → Data → EDA → Preprocess → Model → Evaluate → Deploy & Monitor — a loop, not a one-way sequence
Paradigm choice: do you have labels? A reward signal? Or only raw data? — always start with the simplest method that fits your data situation

3.2

Chapter 3.2

Regression — Predicting Continuous Values

Regression is the simplest form of supervised learning — and the best place to understand how ML really works. Every concept here (loss functions, gradient descent, overfitting, regularisation) reappears in every algorithm from SVMs to transformers.

Linear Regression In-depth

The goal of regression is to predict a continuous numerical output from one or more inputs. The simplest version — simple linear regression — assumes the relationship between input x and output y is a straight line. We fit that line to the data by finding the optimal slope (weight) and intercept (bias).

Our running example: predicting a house's sale price from its size in square footage. We assume price increases roughly linearly with size, with some noise from other unmodelled factors (location, condition, year built). The model makes predictions using:

Linear Regression Model ŷ = w · x + b ŷ = predicted price | w = weight (slope) | x = input feature (sqft) | b = bias (intercept)

The model has just two parameters — w (how much price increases per unit of size) and b (the base price for a zero-size house, which anchors the line). Training means finding the values of w and b that make the line fit the data as closely as possible.

Linear Regression — finding the best-fit line through data

from sklearn.linear_model import LinearRegression import numpy as np X = np.array([[3],[5],[7],[10],[15],[20],[25]]) # sq footage (100s) y = np.array([85, 120, 145, 200, 300, 380, 460]) # price ($1000s) model = LinearRegression() model.fit(X, y) print(f"Weight (slope): {model.coef_[0]:.2f}") # 16.20 print(f"Bias (intercept): {model.intercept_:.2f}") # 42.50 print(f"Predict 2000 sqft: ${model.predict([[20]])[0]:.0f}k") # $367k

The Cost Function (MSE) In-depth

To fit a line we need to measure how wrong our current w and b are. The residual for one prediction is the gap between the actual value and the predicted value: residual = y − ŷ. If we just summed residuals, positive and negative errors would cancel. Instead we square them.

Mean Squared Error (MSE) averages the squared residuals across all n training examples. Squaring does two things: it makes all errors positive, and it penalises large errors far more than small ones (a 2× larger error contributes 4× the cost). MSE is also smooth everywhere — which means we can take its derivative and use gradient descent.

Loss Functions for Regression MSE = (1/n) · Σᵢ (yᵢ − ŷᵢ)² MAE = (1/n) · Σᵢ |yᵢ − ŷᵢ| RMSE = √MSE ← same units as y, more interpretable n = number of samples | yᵢ = true value | ŷᵢ = predicted value

A worked example — 4 predictions: actual = [200, 250, 300, 350], predicted = [190, 270, 290, 380]. Residuals: [10, −20, 10, −30]. Squared: [100, 400, 100, 900]. MSE = (100+400+100+900)/4 = 375. RMSE = √375 ≈ $19.4k — that's the typical prediction error in the same units as house price.

MSE Cost Function — the bowl we're trying to reach the bottom of

MSE — Use When

Outliers are genuine data (not noise)
Large errors should be penalised heavily
You need a differentiable loss for gradient descent
Default choice for regression — sklearn, PyTorch default

MAE — Use When

Data has many outliers you want to ignore
Median-like behaviour is preferred over mean
House price datasets with luxury anomalies
Not differentiable at 0 — needs subgradient methods

Gradient Descent for Regression In-depth

We have a cost function J(w, b) — now we need to minimise it. Two approaches exist. The Normal Equation solves for the optimal weights analytically in one shot. Gradient Descent takes many small iterative steps, using the derivative of J to determine which direction to move.

Normal Equation (closed-form solution) w = (XᵀX)⁻¹ · Xᵀy Exact solution — no iterations needed. Expensive for large feature counts (O(n³) matrix inversion).

Gradient Descent Update Rules (one step) w ← w − α · (2/n) · Σ (ŷᵢ − yᵢ) · xᵢ b ← b − α · (2/n) · Σ (ŷᵢ − yᵢ) α = learning rate (step size) | n = number of samples | repeat until convergence

The intuition: the gradient (derivative of J w.r.t. w) tells you the slope of the cost bowl at your current position. Subtract a fraction (α) of that slope from w to move toward the minimum. If the gradient is positive, w is too large — decrease it. If negative, w is too small — increase it. The learning rate α controls how large each step is — too large and you overshoot; too small and training takes forever.

Gradient descent is not just for linear regression — it's the engine that trains every neural network, transformer, and deep learning system. Understanding it here, where the math is simple, makes every downstream algorithm clearer.

Aspect	Normal Equation	Gradient Descent
Formula	w = (XᵀX)⁻¹Xᵀy	Iterative ∂J/∂w updates
Speed — small n	Fast, one computation	Slow — many iterations
Speed — large n	Very slow O(n³)	Scales well, mini-batches
Memory	Must invert full XᵀX matrix	Can stream mini-batches
Convergence	Always exact	Needs learning rate tuning
Best for	< 1,000 features	Large data, neural networks

Multiple Linear Regression Core

Real predictions need multiple inputs. House price depends not just on square footage, but on bedrooms, bathrooms, location score, age, and dozens of other features. Multiple linear regression extends the model to handle any number of features:

Multiple Linear Regression ŷ = w₁x₁ + w₂x₂ + w₃x₃ + ... + wₙxₙ + b Matrix form: ŷ = X·w + b Each feature xⱼ has its own weight wⱼ. X is the (n × p) feature matrix.

With multiple features, feature scaling becomes critical. If x₁ (square footage) ranges from 500–3000 and x₂ (bedrooms) ranges from 1–6, the gradient for w₁ is tiny compared to w₂. This creates an elongated cost bowl where gradient descent zig-zags inefficiently. Scaling all features to a comparable range makes the bowl round and descent fast.

Why Feature Scaling Matters for Gradient Descent

Feature Scaling Methods StandardScaler: x' = (x − μ) / σ → zero mean, unit variance MinMaxScaler: x' = (x − min) / (max − min) → range [0, 1] Use StandardScaler as the default. MinMaxScaler when you need a bounded range (e.g., image pixels).

Polynomial Regression Core

What if the relationship between x and y is curved, not linear? We can handle this without leaving the linear regression framework — by creating polynomial features: x², x³, and so on. We then fit a standard linear regression to these expanded features. The model is still linear in its parameters (w₀, w₁, w₂...) even though the fit is a curve in the original x space.

Polynomial Regression (degree d) ŷ = w₀ + w₁x + w₂x² + w₃x³ + ... + wₐxᵈ sklearn: PolynomialFeatures(degree=d).fit_transform(X) → then LinearRegression()

The risk: higher polynomial degree → tighter fit on training data → worse generalisation to new data. A degree-12 polynomial can pass through every training point perfectly (zero training error) while oscillating wildly between them — it has memorised the data rather than learned the pattern.

Polynomial Degree vs Model Complexity — the overfitting spectrum

Regularisation: Ridge, Lasso, ElasticNet In-depth

Overfitting happens because the model is too complex for the data — it fits noise rather than signal. Regularisation is the standard fix: add a penalty term to the loss function that discourages large weights. Smaller weights = simpler model = less overfitting. The penalty strength is controlled by a hyperparameter λ (lambda).

Regularised Loss Functions Ridge (L2): J = MSE + λ · Σ wᵢ² → penalise large weights Lasso (L1): J = MSE + λ · Σ |wᵢ| → can zero weights out ElasticNet: J = MSE + λ · [α·Σ|wᵢ| + (1−α)·Σwᵢ²] λ = 0 → no regularisation (plain MSE). λ → ∞ → all weights forced to zero.

Ridge (L2)

Adds λ·Σwᵢ² to the loss. Penalises large weights by squaring them. Weights shrink toward zero but never exactly reach zero. Keeps all features — just with smaller influence.

Best when all features are genuinely useful
Stable — handles correlated features well
sklearn: Ridge(alpha=λ)

Lasso (L1)

Adds λ·Σ|wᵢ| to the loss. The absolute value creates corners in the constraint region — the optimal solution tends to land exactly on an axis, driving some weights to exactly zero. Built-in feature selection.

Best when many features are irrelevant
Produces sparse models
sklearn: Lasso(alpha=λ)

ElasticNet

Combines L1 and L2 penalties. Gets some feature selection (from L1) with the stability of L2 when features are correlated. Controlled by a mixing ratio α.

Best when groups of correlated features exist
More flexible than pure Ridge or Lasso
sklearn: ElasticNet(alpha=λ, l1_ratio=α)

Why Lasso Creates Sparse Weights (Feature Selection)

Choosing Regularisation Strength — bias-variance tradeoff

Property	Ridge (L2)	Lasso (L1)	ElasticNet
Penalty	λ·Σwᵢ²	λ·Σ\|wᵢ\|	α·L1 + (1−α)·L2
Feature selection	No — weights shrink, never zero	Yes — exact zeros possible	Partial
Output	Dense — all weights non-zero	Sparse — many weights = 0	Semi-sparse
Best when	All features relevant	Many irrelevant features	Correlated + sparse
sklearn	`Ridge(alpha=λ)`	`Lasso(alpha=λ)`	`ElasticNet(alpha, l1_ratio)`

The Bias-Variance Tradeoff In-depth

Why does any ML model fail? The expected prediction error can always be decomposed into three components. Understanding them is essential for diagnosing model problems and choosing the right fix.

Bias-Variance Decomposition E[(y − ŷ)²] = Bias²(ŷ) + Variance(ŷ) + σ² Bias = how wrong on average | Variance = sensitivity to training data | σ² = irreducible noise

High Bias (Underfitting)

Model too simple — can't capture the true pattern.

Fix: more features, higher polynomial degree, less regularisation, more complex model.

High Variance (Overfitting)

Model too complex — memorises noise.

Fix: more training data, regularisation (Ridge/Lasso), simpler model, dropout, early stopping.

Bias-Variance Tradeoff — the fundamental tension in ML

Regularisation is bias-variance control. Increasing λ raises bias (simpler model) and reduces variance (less sensitive to training data). The optimal λ sits at the bottom of the U-shaped validation error curve — found by cross-validation, not guesswork.

📋 Chapter 3.2 — Key Takeaways

Linear regression: ŷ = w·x + b — learn the weight and bias that minimise MSE on training data
MSE penalises large errors quadratically; RMSE is in the same units as y — more interpretable for reporting
Gradient descent: iteratively nudge w and b in the direction that reduces MSE — the algorithm behind all of deep learning
Feature scaling (StandardScaler) is essential: unscaled features create elongated cost bowls that slow convergence dramatically
Polynomial features extend linear regression to curves — but higher degree risks overfitting (zero training error, high test error)
Ridge (L2) shrinks weights; Lasso (L1) zeros weights (feature selection); ElasticNet combines both
Bias-Variance: Error = Bias² + Variance + σ². Regularisation trades more bias for less variance — find the sweet spot with cross-validation

3.3

Chapter 3.3

Classification — Predicting Categories

Classification is supervised learning for discrete outputs: given an input, which bucket does it belong to? Logistic regression, k-nearest neighbours, and Naive Bayes each answer that question from a completely different angle — probabilistic, geometric, and statistical.

Logistic Regression In-Depth

Despite its name, logistic regression is a classifier, not a regressor. It predicts P(y=1 | x) — the probability that an input belongs to class 1. The trick is the sigmoid function, which squashes any real number from −∞ to +∞ into the range (0, 1), making the output interpretable as a probability.

Why not use ordinary linear regression for classification? A linear model can predict values well below 0 or well above 1, which are meaningless as probabilities, and the resulting loss surface is non-convex with MSE — gradient descent is not guaranteed to find the global minimum. The sigmoid + binary cross-entropy combination fixes both problems.

Core Equations — Logistic Regression σ(z) = 1 / (1 + e⁻ᶻ) Sigmoid function — squashes any real z to the (0, 1) range ŷ = σ(wx + b) Logistic regression model — output is a probability Loss = −[y·log(ŷ) + (1−y)·log(1−ŷ)] Binary Cross-Entropy — convex loss, amenable to gradient descent

Decision rule: predict class 1 if P(y=1|x) ≥ 0.5, else class 0. The threshold 0.5 is the default but can be tuned — lower it to increase recall (catch more positives), raise it to increase precision (fewer false alarms).

Logistic Regression — sigmoid output and linear decision boundary

# Logistic Regression — sklearn walkthrough
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=200, n_features=2, n_redundant=0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]  # probability of class 1

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Probabilities (first 5): {y_prob[:5].round(3)}")

Sigmoid & Decision Boundaries Core

Logistic regression produces linear decision boundaries — a straight line in 2D, a hyperplane in higher dimensions. This is fast and interpretable, but it fails when the true boundary between classes is curved or complex.

Three escape routes from linearity: polynomial features (add x², x·y, ... as new inputs — same model, richer boundary), the kernel trick (implicitly maps to a very high-dimensional space, used in SVMs), or simply switch to a non-linear model (decision trees, neural networks). The softmax function generalises the sigmoid to K classes — each class gets a score zₖ, and softmax converts all K scores into a probability distribution that sums to 1.

Decision Boundary Complexity — linear to non-linear classifiers

Multiclass Classification Core

When there are K > 2 classes, three strategies extend binary classification:

1️⃣

One-vs-Rest (OvR)

Train K binary classifiers. Classifier k asks: "is this class k vs everything else?" At prediction, pick the class with the highest confidence score.

K models trained
Fast, simple
Works with any binary classifier

2️⃣

One-vs-One (OvO)

Train K(K−1)/2 classifiers — one for every pair of classes. At prediction, majority vote wins. Works well when individual classifiers are fast to train.

K(K−1)/2 models
Better for SVMs
Slower for large K

∑

Softmax (Multinomial)

Single model with K output scores. Softmax converts them to probabilities that sum to 1. The modern default for neural networks and logistic regression.

1 model
Probabilistic output
End-to-end trainable

Softmax — Multiclass Probability P(y=k | x) = e^(zₖ) / Σⱼ e^(zⱼ) Output: K probabilities — each in (0,1), all summing to 1

One-vs-Rest — K binary classifiers for K-class problem

K-Nearest Neighbours In-Depth

"You are who your neighbours are." KNN is the simplest meaningful classifier: to classify a new point, find the K nearest training points by distance and take a majority vote.

KNN is a lazy learner — there is no training phase. The algorithm memorises all training data and performs all computation at prediction time. This makes training instant but prediction potentially slow on large datasets (O(n·d) per query — must compute distance to every training point).

KNN Core Math Euclidean: d(x, y) = √Σᵢ(xᵢ − yᵢ)² L2 distance — most common metric; sensitive to scale differences ŷ = majority_vote({y₁, y₂, …, yₖ}) [K nearest neighbours] Also: Manhattan (L1): Σᵢ|xᵢ−yᵢ| | Minkowski: (Σᵢ|xᵢ−yᵢ|ᵖ)^(1/p)

Choosing K: K=1 memorises training data perfectly (zero training error) but is extremely sensitive to noise and outliers — overfitting. K=n (all points) just predicts the majority class — underfitting. The optimal K is found via cross-validation, typically in the range 3–15. Odd K values avoid ties in binary classification.

KNN with K=1, K=5, K=15 — effect on decision boundary

KNN Strengths

KNN Weaknesses

No training time — instant fit

Slow prediction: O(n·d) per query

Naturally handles multi-class problems

Requires entire training set in memory

Non-linear, complex decision boundaries

Sensitive to irrelevant / noisy features

Simple, transparent, easy to explain

Feature scaling mandatory (distance-based)

Good non-parametric baseline

Curse of dimensionality in high-dim spaces

Naive Bayes Core

Naive Bayes is a probabilistic classifier built on Bayes' theorem. The "naive" assumption is that all features are conditionally independent given the class — in practice this is almost never true, yet the algorithm performs remarkably well, especially on text classification tasks.

Bayes' Theorem Applied to Classification P(y | x₁,…,xₙ) ∝ P(y) · Π P(xᵢ | y) Prior × Product of likelihoods — the naive independence assumption lets us factorise ŷ = argmax_y P(y) · P(x₁|y) · P(x₂|y) · … · P(xₙ|y) Pick the class y that maximises the joint probability

Why does it work despite the false assumption? The discriminative signal — the difference between class probabilities — is often large enough that the independence error doesn't flip the argmax. In high-dimensional sparse data (like text), there's also very little co-occurrence signal to exploit anyway.

Worked example — spam filter: Given 5 training emails below, compute the Naive Bayes decision for the new message "free click prize":

Email	"free"	"click"	"meeting"	"report"	Label
Email 1	1	1	0	0	spam
Email 2	1	0	0	0	spam
Email 3	0	0	1	1	ham
Email 4	0	0	0	1	ham
Email 5	1	1	0	0	spam

# P(spam) = 3/5 = 0.6  |  P(ham) = 2/5 = 0.4
# P(free|spam)=3/3=1.0  P(click|spam)=2/3≈0.67
# P(free|ham)=0/2=0.0   P(click|ham)=0/2=0.0  (add Laplace smoothing: +1)
# With smoothing: P(free|ham)≈0.17  P(click|ham)≈0.17

# Score(spam) = 0.6 × 1.0 × 0.67 ≈ 0.40
# Score(ham)  = 0.4 × 0.17 × 0.17 ≈ 0.012
prediction = "spam"  # argmax wins

Variant	Feature Type	Use Case	Example
Gaussian NB	Continuous (normal dist)	Iris classification, sensors	Height/weight data
Multinomial NB	Count data (non-negative int)	Text classification (TF)	Word frequency in emails
Bernoulli NB	Binary (0 / 1)	Binary feature classification	Word present / absent
Complement NB	Count data	Imbalanced text classification	Large corpus, class imbalance

Algorithm Comparison Core

Algorithm	Training	Prediction	Interpretable?	Non-linear?	Best For
Logistic Regression	O(n·d·iter)	Fast O(d)	Yes	No (poly features needed)	Linearly separable, probability output needed
KNN	None (lazy)	Slow O(n·d)	Intuitive	Yes (local)	Small dataset, non-linear baseline
Naive Bayes	Very Fast O(n·d)	Very Fast O(K·d)	Yes	Partially (Gaussian)	Text classification, high-dim sparse data

Decision Regions — Logistic Regression vs KNN vs Naive Bayes (two-moons dataset)

📋 Chapter 3.3 — Key Takeaways

Logistic Regression outputs P(y=1|x) via the sigmoid function — linear decision boundary, interpretable coefficients, trained with Binary Cross-Entropy
The decision threshold (default 0.5) can be tuned — lower for higher recall, raise for higher precision; always use the probability output, not just the class label
KNN: no training phase — classify by majority vote of K nearest neighbours. Requires feature scaling; slow at prediction O(n·d); ideal as a non-linear baseline
Naive Bayes: Bayes' theorem + conditional independence assumption — surprisingly accurate despite the naive assumption; very fast; best-in-class for text classification
Multiclass strategies: OvR (K classifiers), OvO (K(K−1)/2 classifiers), or Softmax (single model, modern default)
Linear classifiers fail on non-linear data — use polynomial features, kernel trick, or switch to a non-linear model (decision trees, neural networks)

3.4

Chapter 3.4

Decision Trees & Ensemble Methods

A single decision tree mirrors human reasoning but overfits easily. Ensemble methods — Random Forests and Gradient Boosting — combine hundreds of trees to become the most powerful classical ML algorithms on tabular data.

Decision Trees In-Depth

Decision trees mirror how humans naturally make decisions: a series of if/else questions on features, arriving at a prediction at the leaves. Each internal node tests one feature, each branch is an outcome of that test, and each leaf holds the class label or numeric prediction. The CART algorithm (Classification And Regression Trees) is the sklearn default and can handle both classification and regression, numerical and categorical features, and naturally captures non-linear relationships.

Decision Tree — Iris flower classification (depth 3)

# Decision Tree — sklearn
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

tree = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
tree.fit(X, y)

# Print tree structure as text
print(export_text(tree, feature_names=list(iris.feature_names)))

# Feature importances
for name, imp in zip(iris.feature_names, tree.feature_importances_):
    print(f"{name}: {imp:.3f}")

Splitting Criteria In-Depth

At each node, CART exhaustively searches all features and all thresholds to find the split that maximises purity in the child nodes. Three measures of impurity are used in practice:

Gini Impurity

CART default for classification. Probability of misclassifying a random sample. Range: 0 (pure) → 0.5 (balanced binary).

Gini = 1 − Σ pᵢ²

Entropy / Info Gain

Used by ID3 and C4.5 algorithms. Measures disorder. Information Gain = parent entropy − weighted child entropy.

H = −Σ pᵢ log₂(pᵢ)

Variance Reduction

For regression trees. Choose the split that minimises the weighted variance of child nodes — equivalent to minimising MSE.

min Σ wₖ · Var(yₖ)

Gini vs Entropy — both peak at maximum class imbalance

Worked example — which split wins? Parent node: 30 samples (20 blue, 10 red). Gini = 1 − (20/30)² − (10/30)² = 0.444.

Split	Left child	Right child	Weighted Gini	Result
Split A	15 blue, 5 red → Gini = 0.375	5 blue, 5 red → Gini = 0.500	0.417	Gain = 0.027
Split B	19 blue, 1 red → Gini = 0.099	1 blue, 9 red → Gini = 0.180	0.120	Gain = 0.324 ✓ Winner

Overfitting & Pruning Core

An unconstrained decision tree grows until every leaf contains a single sample — perfect training accuracy, terrible test accuracy. This is the classic overfit problem. Two families of solutions exist: pre-pruning (stop growing early) and post-pruning (grow full, then cut).

Pre-Pruning Hyperparameters

max_depth — maximum depth of tree
min_samples_split — min samples to split a node
min_samples_leaf — min samples in any leaf
max_features — max features considered per split
min_impurity_decrease — min gain required to split

Post-Pruning (Cost Complexity)

Grow full tree to max depth
Calculate cost-complexity criterion for subtrees
Remove branches with lowest improvement-to-size ratio
Select best pruned tree via cross-validation
sklearn: ccp_alpha parameter controls pruning strength

Tree Depth vs Overfitting — hyperparameter max_depth is critical

Bagging & Random Forests In-Depth

A single decision tree has high variance — small perturbations to the training data produce completely different trees. Bagging (Bootstrap Aggregating) solves this by training N trees on N bootstrapped datasets (sampled with replacement) and averaging their predictions. The individual errors are largely independent, so they cancel out.

Random Forest = Bagging + random feature subset at each split. The critical insight: if you use all features at each node, every tree will pick the same dominant feature at the root, making all trees highly correlated — they'd all make the same mistakes. By restricting each node to a random subset of √d features, the trees become decorrelated, and averaging truly independent errors dramatically reduces variance. About 37% of training samples are never used in each tree — this out-of-bag (OOB) set is a free validation estimate.

Random Forest — bagging + random feature subsets = decorrelated ensemble

Random Forest Strengths

Handles high-dimensional data well
Built-in feature importance scores
Robust to outliers and missing values
OOB error — free validation estimate
Fully parallelisable — fast training
One of the best off-the-shelf algorithms

Random Forest Weaknesses

Less interpretable than a single tree
Slow prediction for very large forests
Memory intensive — stores all trees
Poor extrapolation beyond training range
Not ideal for very high-cardinality categoricals

Gradient Boosting In-Depth

Gradient Boosting is fundamentally different from bagging. Trees are built sequentially — each new tree specifically targets the mistakes of the current ensemble. The key insight (Friedman, 1999): fit each new tree to the negative gradient of the loss function with respect to the current predictions. This is gradient descent — but in function space rather than parameter space.

The learning rate (shrinkage) controls how much each tree contributes: F(x) += learning_rate × T(x). A small learning rate requires more trees but generalises better. AdaBoost (the precursor) reweighted misclassified examples; modern Gradient Boosting is more general and works with any differentiable loss function.

Gradient Boosting — Prediction Update Rule Fₘ(x) = Fₘ₋₁(x) + α · Tₘ(x) Fₘ = ensemble after m trees | α = learning rate | Tₘ = new tree fit to residuals of Fₘ₋₁ Residual = −∂L(y, Fₘ₋₁(x)) / ∂Fₘ₋₁(x) Negative gradient of the loss — the "direction of steepest descent" in function space

Gradient Boosting — sequential error correction

Random Forest / Bagging

Gradient Boosting

Trees trained in parallel — independent

Trees trained sequentially — each corrects previous

Primarily reduces variance

Reduces both bias and variance

Robust — hard to overfit

Can overfit if learning_rate too high or too many trees

Fast training (parallelisable)

Slower (sequential) — but XGBoost/LightGBM are optimised

Great off-the-shelf baseline

Higher ceiling — Kaggle competition winner

XGBoost, LightGBM & CatBoost Core

Vanilla gradient boosting is slow — it re-evaluates every possible split at every node. The three production-grade libraries solve this with fundamentally different approaches, making gradient boosting practical on millions of rows.

⚡

XGBoost (2016)

Chen & Guestrin. Regularised gradient boosting, second-order gradients (Newton step), hardware-optimised cache-aware computation. The competition standard for years.

Best general-purpose default
Excellent documentation
Wide ecosystem support

🚀

LightGBM (2017)

Microsoft. Histogram-based splits (buckets values → much faster), leaf-wise tree growth (vs level-wise), GOSS & EFB sampling. 10× faster than XGBoost on large data.

Large datasets (>100K rows)
High-cardinality features
Speed matters

🎯

CatBoost (2017)

Yandex. Native categorical feature handling (no manual encoding), ordered boosting to prevent target leakage, symmetric (oblivious) trees for fast prediction.

Many categorical features
Minimal tuning needed
Avoids target leakage

Feature	XGBoost	LightGBM	CatBoost
Speed	Good	Fastest	Good
Memory usage	High	Low	Medium
Categoricals	Manual encoding	Partial support	Native support
Tuning effort	Moderate	Moderate	Minimal
Tree growth	Level-wise	Leaf-wise	Oblivious trees
Best for	General tabular	Large-scale, fast	Many categoricals

# XGBoost — binary classification example
import xgboost as xgb
from sklearn.datasets   import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics    import roc_auc_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = xgb.XGBClassifier(
    n_estimators=200,          # number of trees
    learning_rate=0.1,        # shrinkage — prevents overfitting
    max_depth=4,               # tree depth
    subsample=0.8,             # row sampling per tree
    colsample_bytree=0.8,      # feature sampling per tree
    eval_metric='logloss',
    random_state=42,
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    early_stopping_rounds=20,  # stop if no improvement for 20 rounds
    verbose=False,
)

y_prob = model.predict_proba(X_test)[:, 1]   # probability of positive class
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")

Ensemble Comparison Core

Ensemble Methods — Performance vs Complexity Spectrum

If you have tabular data, try Random Forest first as your baseline. If you need the best possible accuracy, use XGBoost or LightGBM with cross-validated hyperparameter tuning. These two algorithms have won more Kaggle competitions than any other approach combined.

📋 Chapter 3.4 — Key Takeaways

Decision tree: recursive binary splits using Gini or Entropy — interpretable but prone to high-variance overfitting without pruning
Bagging: train N trees on bootstrapped datasets and aggregate to reduce variance — errors average out when trees are independent
Random Forest: bagging + random feature subsets at each node = decorrelated trees = one of the strongest off-the-shelf algorithms
Gradient Boosting: sequential error correction — each new tree fits the negative gradient of the loss; controlled by learning rate (shrinkage)
XGBoost / LightGBM / CatBoost: production-grade gradient boosting — dominant on structured/tabular data for a decade
Ensemble principle: combining many diverse, weak learners consistently outperforms any single strong learner

3.5

Chapter 3.5

Support Vector Machines

SVMs answer a deceptively simple question: of all the hyperplanes that separate two classes, which one generalises best? The answer — the maximum margin hyperplane — leads to one of the most elegant and mathematically rigorous algorithms in machine learning.

Maximum Margin Classifier In-Depth

Many hyperplanes can separate two linearly-separable classes. SVMs choose the one that maximises the margin — the perpendicular distance between the boundary and the nearest training points on each side. A wider margin means more room for error on unseen data, giving better generalisation.

The decision boundary is the hyperplane w·x + b = 0. The two margin planes are w·x + b = +1 and w·x + b = −1. The margin width is 2/||w||, so maximising the margin is equivalent to minimising ||w||². This is a constrained quadratic optimisation problem — convex, so it has a unique global solution.

SVM — Hard Margin Optimisation Objective: minimise ½||w||² Equivalent to maximising the margin 2/||w|| Subject to: yᵢ(w·xᵢ + b) ≥ 1 for all i Every point must be on the correct side of its margin plane Decision: ŷ = sign(w·x + b)

SVM — Maximum Margin Hyperplane and Support Vectors

Support Vectors Core

Support vectors are the training points that lie exactly on the margin boundaries (w·x + b = ±1). They are the only points that matter: if you removed every non-support vector from the training set and retrained, you'd get the exact same model. This is a profound property — the decision boundary is defined by a sparse subset of the data.

What they are

Training points on or inside the margin boundary. Typically just a small fraction of all training points — often fewer than 10%.

Why they matter

They are the only points that influence the boundary. Non-support vectors can be moved or removed without changing the model at all.

Model complexity

Fewer support vectors → simpler, more generalisable model. Many support vectors → complex boundary, may overfit. Good diagnostic metric.

The SVM optimisation is solved in its dual formulation using Lagrange multipliers — one multiplier αᵢ per training point. Non-support vectors have αᵢ = 0, so they contribute nothing. The prediction is: ŷ = sign(Σ αᵢ yᵢ K(xᵢ, x) + b), summing only over support vectors.

Soft Margin SVM — The C Parameter In-Depth

The hard-margin SVM requires perfect linear separability — it fails completely if even one point is on the wrong side. Real data is always noisy and often overlapping. Soft-margin SVM introduces slack variables ξᵢ ≥ 0 that allow points to violate the margin. The C hyperparameter controls how heavily violations are penalised.

Soft Margin Objective minimise: ½||w||² + C · Σ ξᵢ ξᵢ = slack variable: 0 if correctly classified outside margin, >0 if inside or wrong side High C → penalise violations heavily → small margin, few errors → overfit risk Low C → tolerate violations → wide margin, more errors → underfit risk

SVM C Parameter — bias-variance tradeoff via margin width

The Kernel Trick In-Depth

A linear SVM can only draw straight boundaries. But many real datasets are not linearly separable in their original feature space. One solution is to manually engineer polynomial or interaction features — but for high-dimensional data this is computationally prohibitive. The kernel trick solves this elegantly: it allows the SVM to operate in an implicitly high-dimensional feature space without ever computing the transformation explicitly.

The key insight: the SVM dual formulation only requires computing dot products between feature vectors — never the vectors themselves. A kernel function K(xᵢ, xⱼ) computes the dot product that would result from a high-dimensional mapping φ, without actually applying φ. For the RBF kernel, this implicit feature space is infinite-dimensional.

The Kernel Identity K(xᵢ, xⱼ) = φ(xᵢ) · φ(xⱼ) Kernel = dot product in high-D space — computed without ever computing φ(x) RBF: K(x, z) = exp(−γ ||x − z||²) Points close together → K≈1 (similar) | Far apart → K≈0 (dissimilar) High γ = narrow influence = complex boundary | Low γ = wide influence = smooth boundary

The Kernel Trick — separate non-linear data via implicit high-D mapping

Kernel Types Core

SVM Kernels — Linear, Poly(2), Poly(5), RBF on circular data

Kernel	Formula	Parameters	Best For	Limitation
Linear	x · z	None	Linearly separable, high-d text/NLP	Fails on non-linear data
Polynomial	(x · z + c)^d	d, c	Feature interactions (image data)	Sensitive to degree d choice
RBF / Gaussian	exp(−γ\|\|x−z\|\|²)	γ, C	General purpose — most used	Slow on very large datasets
Sigmoid	tanh(γx·z + c)	γ, c	Neural network approximation	Rarely the best choice

# SVM with RBF kernel + GridSearch
  from sklearn.svm            import SVC
  from sklearn.preprocessing  import StandardScaler
  from sklearn.pipeline       import Pipeline
  from sklearn.model_selection import GridSearchCV

  # SVM requires feature scaling — always use a Pipeline!
  svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm',    SVC(kernel='rbf', probability=True)),
  ])

  param_grid = {
    'svm__C':     [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
  }

  grid_search = GridSearchCV(
    svm_pipeline, param_grid,
    cv=5, scoring='accuracy', n_jobs=-1
  )
  grid_search.fit(X_train, y_train)

  print(f"Best params:   {grid_search.best_params_}")
  print(f"Best CV score: {grid_search.best_score_:.4f}")

SVR — Support Vector Regression Reference

Support Vector Regression (SVR) extends the SVM idea to continuous outputs. Instead of maximising a margin between classes, SVR fits a tube of width 2ε around the predictions. Points inside the tube incur zero loss — the model doesn't care about small errors. Only points outside the tube (the support vectors) contribute to the optimisation. This gives SVR a natural insensitivity to small noise.

SVR inherits all the power of kernels — with an RBF kernel, SVR can fit complex non-linear regression surfaces while remaining robust to outliers (thanks to the ε-insensitive loss). The tradeoff: SVR can be slow on large datasets (O(n²) to O(n³) training) and requires careful tuning of ε, C, and the kernel parameters.

SVR — ε-Insensitive Loss L(y, ŷ) = max(0, |y − ŷ| − ε) Zero loss inside the ε-tube | Linear loss outside — robust to outliers Objective: minimise ½||w||² + C · Σ (ξᵢ + ξᵢ*) ξᵢ, ξᵢ* = slack above/below tube | C = penalty for tube violations

SVR Strengths

SVR Weaknesses

Robust to outliers (ε-insensitive loss)

Slow training O(n² to n³)

Works with any kernel — handles non-linear regression

Doesn't scale to large datasets (>50K rows)

Sparse solution — only support vectors stored

Requires careful tuning of ε, C, γ

Well-founded theoretical guarantees

Feature scaling mandatory

📋 Chapter 3.5 — Key Takeaways

SVM finds the maximum margin hyperplane — widest gap between classes = maximum robustness to noise
Only support vectors (points on/inside the margin) determine the boundary — all other training points are irrelevant
Soft margin C: high C = small margin, strict (overfit risk); low C = wide margin, tolerant (underfit risk)
Kernel trick: evaluate dot products in high-dimensional space without explicit transformation — makes non-linear SVMs tractable
RBF kernel is the most general — controlled by γ (boundary complexity) and C (margin strictness) — tune both via cross-validation
SVMs require feature scaling — always wrap in a Pipeline with StandardScaler

3.6

Chapter 3.6

Clustering & Unsupervised Learning

Unsupervised learning is the art of finding structure without labels. The data speaks for itself — the algorithm must discover what groups, components, or manifolds naturally exist. Clustering and dimensionality reduction are the two pillars.

K-Means Clustering In-Depth

K-Means partitions n points into K clusters by minimising within-cluster variance. The Lloyd algorithm iterates two steps — assignment and update — until centroids stop moving.

① InitRandom K centroids

② AssignEach point → nearest centroid

③ UpdateCentroid = cluster mean

④ RepeatUntil convergence

K-Means Objective — Minimise Inertia J = Σₖ Σₓ∈Cₖ ||x − μₖ||² μₖ = centroid of cluster k | Cₖ = set of points assigned to cluster k Assignment: cᵢ = argminₖ ||xᵢ − μₖ||² Update: μₖ = (1/|Cₖ|) Σᵢ∈Cₖ xᵢ

K-Means Lloyd Algorithm — 4-Step Convergence

K-Means Issues & Choosing K Core

⚠️

Choosing K

Elbow method — plot inertia vs K; look for the kink
Silhouette score — measures separation quality (higher = better)
Gap statistic — compare inertia to random null reference

🎲

Initialisation Sensitivity

Random init can converge to suboptimal local minima
K-Means++: spread initial centroids far apart — provably ≤ O(log K) of optimal
Run multiple times with different seeds; pick lowest inertia

🔵

Spherical Cluster Assumption

Assumes clusters are roughly spherical and equal-size
Fails on elongated, ring-shaped, or crescent clusters
Use DBSCAN or GMM for arbitrary shapes

📍

Outlier Sensitivity

A single outlier can drag a centroid far from the true cluster center
K-Medoids (PAM) uses actual data points as centroids — more robust
Pre-filter outliers or use robust clustering

K-Means Failure Modes — Shape, Density, Outlier Sensitivity

Choosing K — Elbow Method and Silhouette Score

DBSCAN In-Depth

Density-Based Spatial Clustering of Applications with Noise. Clusters are dense regions of points separated by low-density space. Unlike K-Means, DBSCAN does not require specifying K — it discovers the number of clusters from the data.

⬤

Core Point

Has ≥ min_samples neighbours within radius ε. Anchor of a cluster.

◉

Border Point

Within ε of a core point, but has fewer than min_samples neighbours itself.

Noise / Outlier

Not a core point and not within ε of any core point. Labelled −1.

DBSCAN Hyperparameters ε (eps) — neighbourhood radius Too small → almost everything is noise | Too large → everything merges into one cluster min_samples — minimum neighbours to be a core point Rule of thumb: min_samples ≥ D+1 (D = dimensions) | Use k-distance plot to choose ε

DBSCAN — Core, Border, and Noise Points

K-Means vs DBSCAN — DBSCAN Handles Arbitrary Shapes

Hierarchical Clustering Core

Hierarchical clustering builds a full tree (dendrogram) of cluster merges. No need to specify K upfront — pick any cut height and read off the clusters.

Agglomerative (bottom-up) ↑

Divisive (top-down) ↓

Start: every point is its own cluster

Start: one cluster containing all points

Repeatedly merge the two closest clusters

Recursively split into sub-clusters

Common: Ward, Complete, Average, Single linkage

Less common; computationally expensive

Linkage criteria — how "distance between clusters" is measured:

Single

Min distance any two points across clusters. Tends to chain (elongated clusters).

Complete

Max distance. Produces more compact, balanced clusters.

Average

Average pairwise distance. Trade-off between single and complete.

Ward ★

Minimise increase in total within-cluster variance. Best general-purpose choice.

Dendrogram — Hierarchical Clustering / Cut to Get K Clusters

PCA — Principal Component Analysis In-Depth

PCA reduces dimensionality by finding the directions (principal components) of maximum variance. Each PC is orthogonal to all others — no redundancy. The technique is the eigendecomposition of the covariance matrix (equivalently, SVD of the data matrix).

① CenterSubtract mean from each feature

② CovarianceΣ = (1/n) XᵀX

③ EigenvectorsSolve Σv = λv

④ SortLargest λ = most variance

⑤ ProjectX_reduced = X · V_k

PCA — Key Equations Covariance matrix: Σ = (1/n) XᵀX X is mean-centered | Σ is symmetric positive semi-definite Eigendecomposition: Σ = V Λ Vᵀ V = matrix of eigenvectors (principal components) | Λ = diagonal matrix of eigenvalues Explained variance ratio: λₖ / Σⱼ λⱼ Choose k PCs so that cumulative ratio ≥ 90–95%

PCA — Finding Directions of Maximum Variance

Scree Plot — How Many Principal Components to Keep

When to use PCA: high-dimensional data (>50 features), visualisation (reduce to 2D/3D), noise reduction, or to remove multicollinearity before regression. Always scale features first (StandardScaler) — PCA is variance-based and sensitive to scale.

t-SNE & UMAP Core

PCA is linear — it cannot preserve complex non-linear structure. t-SNE and UMAP are non-linear manifold methods that excel at visual exploration of high-dimensional data.

🌀

t-SNE

Models pairwise similarities as probabilities in high-D
Maps to 2D/3D so similar points stay together, dissimilar push apart
Uses t-distribution in 2D to avoid the crowding problem
Perplexity hyperparameter: effective neighbourhood size (5–50)
Stochastic — different runs give different layouts
O(n²) — slow on large datasets

⚡

UMAP

Topological approach — models data on a Riemannian manifold
Faster than t-SNE (sub-quadratic)
Better preserves global structure alongside local
Can be used for dimensionality reduction (not just visualisation)
n_neighbors: controls local vs global structure trade-off
More stable across runs than t-SNE

PCA vs t-SNE on High-Dimensional Digit Data

PCA

t-SNE / UMAP

Linear projection

Non-linear manifold embedding

Fast: O(nd²) where d = dimensions

Slow: O(n²) for t-SNE; faster for UMAP

Preserves global variance structure

Preserves local neighbourhood structure

Can be used for downstream ML features

t-SNE: visualisation only — axes meaningless

Deterministic — same result every run

Stochastic — results vary per run (t-SNE)

📋 Chapter 3.6 — Key Takeaways

K-Means: assign each point to its nearest centroid, recompute centroids — fast but assumes spherical, equal-size clusters
DBSCAN: discovers clusters as dense regions separated by low-density space — no K required, naturally handles noise and outliers
Hierarchical clustering: builds a dendrogram showing the full merge history — cut at any height to get the desired number of clusters
PCA: eigendecomposition of the covariance matrix → linear projection retaining maximum variance — always scale features first
t-SNE / UMAP: non-linear 2D/3D visualisation — use for exploration and intuition, not as features for downstream ML
Rule of thumb: use PCA for feature compression before ML; use t-SNE or UMAP for visualisation only

← Ch 3.5: Support Vector Machines Ch 3.7: Model Evaluation →

3.7

Chapter 3.7

Model Evaluation, Metrics & Validation

A model is only as good as how you measure it. The wrong metric gives you false confidence; the right metric reveals exactly where your model fails. Evaluation is not an afterthought — it is built into every design decision.

Confusion Matrix In-Depth

The confusion matrix is the foundation of all classification metrics. For binary classification it is a 2×2 table of outcomes that decomposes prediction errors into two fundamentally different types — false positives and false negatives — which carry very different costs depending on the domain.

✅

True Positive (TP)

Predicted positive, actually positive. Correct detection.

✅

True Negative (TN)

Predicted negative, actually negative. Correct rejection.

⚠️

False Positive (FP) — Type I Error

Predicted positive, actually negative. "False alarm." E.g. healthy patient flagged — bad but manageable.

🚨

False Negative (FN) — Type II Error

Predicted negative, actually positive. "Missed case." E.g. cancer patient cleared — potentially fatal.

Confusion Matrix — the foundation of all classification metrics

Classification Metrics In-Depth

🎯

Accuracy

(TP + TN) / N

Fraction of all predictions that are correct. Useless on imbalanced data. Predicting "no anthrax" for every email gives 99.99% accuracy — meaningless.

🔍

Precision

TP / (TP + FP)

Of all predicted positives, how many are actually positive? Use when false positives are costly — spam filter, treatment recommendation.

🔭

Recall (Sensitivity)

TP / (TP + FN)

Of all actual positives, how many did we find? Use when false negatives are costly — cancer screening, fraud detection, security alerts.

⚖️

F1 Score

2·P·R / (P + R)

Harmonic mean of precision and recall. Use when you care about both equally. Fβ: β>1 weighs recall more; β<1 weighs precision more.

Classification Metrics — Reference Accuracy = (TP + TN) / N Precision = TP / (TP + FP) Recall = TP / (TP + FN) F1 = 2 · Precision · Recall / (Precision + Recall) Specificity = TN / (TN + FP) ← True Negative Rate N = total samples | All metrics in range [0, 1] unless stated otherwise

Precision-Recall Curve — choose operating point based on cost tradeoff

Metric	Formula	Use When	Example
Accuracy	(TP+TN)/N	Balanced classes	Handwriting recognition
Precision	TP/(TP+FP)	FP costly	Spam filter, treatment recommendation
Recall	TP/(TP+FN)	FN costly	Cancer screening, fraud detection, security
F1 Score	2PR/(P+R)	Balance P and R	Information retrieval, general NLP
AUC-ROC	Area under ROC	Imbalanced, ranking	Credit scoring, medical diagnosis
MCC	Matthews Corr.	Highly imbalanced	Genomics, rare event detection

ROC Curve & AUC In-Depth

The ROC curve (Receiver Operating Characteristic) plots True Positive Rate (Recall) against False Positive Rate at every possible decision threshold. AUC (Area Under the Curve) collapses this to a single number: the probability that the model ranks a random positive example higher than a random negative example.

AUC = 0.5

Random guessing. Model has zero discrimination ability. Diagonal line.

AUC = 0.7–0.9

Good to excellent discrimination. Acceptable for many real-world problems.

AUC = 1.0

Perfect classifier. L-shaped ROC curve hugging top-left corner.

ROC Curves — Comparing 4 Classifiers (AUC: 0.50 → 0.96)

ROC & AUC — Key Definitions TPR (Recall) = TP / (TP + FN) ← y-axis of ROC FPR = FP / (FP + TN) ← x-axis of ROC (= 1 − Specificity) AUC = P(score(positive) > score(negative)) AUC is threshold-independent — evaluates model quality across ALL decision thresholds Prefer AUC over accuracy for imbalanced classification problems

Regression Metrics Core

For regression problems, evaluation metrics measure how far predictions deviate from true values. The choice of metric determines what kind of errors your model is penalised for — which in turn shapes training and interpretation.

Regression Metrics — Reference MAE = (1/n) Σ|yᵢ − ŷᵢ| Mean Absolute Error — same units as y — robust to outliers MSE = (1/n) Σ(yᵢ − ŷᵢ)² Mean Squared Error — units are y² — penalises large errors heavily RMSE = √MSE Root MSE — same units as y — most commonly reported R² = 1 − SS_res / SS_tot Coefficient of determination — proportion of variance explained | 1.0 = perfect | 0 = predicting the mean MAPE = (1/n) Σ|yᵢ − ŷᵢ| / yᵢ · 100% Mean Absolute Percentage Error — interpretable % — undefined when yᵢ = 0

Metric	Range	Units	Outlier Sensitivity	Best For
MAE	[0, ∞)	Same as y	Robust	When outliers are real and shouldn't dominate
MSE	[0, ∞)	y²	Sensitive	Training objective — differentiable everywhere
RMSE	[0, ∞)	Same as y	Moderate	Reporting — interpretable, same unit as target
R²	(-∞, 1]	Unitless	Moderate	Comparing models on the same dataset
MAPE	[0%, ∞)	Percentage	Sensitive (small y)	Business reporting, relative error

R² < 0 is possible — it means your model is worse than simply predicting the mean. A model with R² = 0.85 explains 85% of the variance in the target. Always pair R² with a residual plot to check for systematic bias.

Cross-Validation In-Depth

A single train/test split gives a high-variance performance estimate — one unlucky split can make a great model look bad. Cross-validation rotates the validation set across the full dataset, giving a robust, low-variance estimate of generalisation performance.

5-Fold Cross-Validation — rotate validation fold to use all data

Method	Description	Folds	Data Efficiency	Use When
Hold-out	Single 80/20 split	1	Low	Large dataset (>50K), fast iteration
K-Fold	Rotate K folds	K (5–10)	High	Medium dataset, standard practice
Stratified K-Fold	Maintain class ratio per fold	K	High	Imbalanced classification
Time-Series Split	Respect temporal order, no future leakage	K	Medium	Time-series data
LOOCV	Each sample is its own fold (K=n)	n	Maximum	Very small dataset (<100 samples)

Data Leakage In-Depth

Data leakage is the most common and dangerous ML mistake in production. It occurs when information from outside the training set influences the model — artificially inflating validation performance so the model appears to work when it does not.

🎯

Target Leakage

A feature is a consequence of the target, not a cause. E.g., using "days hospitalised" to predict "admitted to hospital" — the feature only exists after the event.

🔀

Train-Test Contamination

Preprocessing (scaling, imputation, encoding) fitted on the full dataset before splitting. The scaler has seen test data — leakage!

⏱️

Temporal Leakage

Future data used to predict past events in time-series. Training on data from 2024 to predict 2023 outcomes.

📋

Duplicate Leakage

Same sample appears in both train and test due to duplicates in the raw dataset. Model memorises rather than generalises.

Data Leakage — always split before fitting any preprocessor

Data leakage is why models perform brilliantly in development and catastrophically in production. The number one rule: split your data first. Fit ALL preprocessing — scalers, encoders, imputers — ONLY on training data. Use sklearn's Pipeline to enforce this automatically.

📋 Chapter 3.7 — Key Takeaways

Confusion matrix: foundation of classification evaluation — TP, TN, FP, FN — understand what each error type costs in your domain
Choose metric by cost: Recall when FN is costly (cancer, fraud); Precision when FP is costly (spam, treatment); F1 when both matter equally
AUC-ROC: threshold-independent — probability that the model ranks a random positive above a random negative — best single metric for imbalanced classification
R² for regression: proportion of variance explained — 1.0 is perfect, 0 is same as predicting the mean, negative means worse than mean
K-Fold CV: rotate K folds for robust, low-variance performance estimates — always use Stratified K-Fold for classification
Data leakage is the most common production failure: always split first, fit preprocessors only on training data — use sklearn Pipeline

← Ch 3.6: Clustering & Unsupervised Learning Ch 3.8: Feature Engineering →

🎓 Domain 3 Complete — Classical Machine Learning

Ch 3.1 — The ML Landscape. Supervised, unsupervised, semi-supervised, and reinforcement learning. Mitchell's definition: learn from experience E to improve at tasks T measured by P. ML ≠ explicit programming — it is function approximation from data.
Ch 3.2 — Linear & Polynomial Regression. Minimise MSE via gradient descent. Regularisation (Ridge / Lasso) prevents overfitting by penalising large weights. Bias-variance tradeoff: high bias = underfits, high variance = overfits. The fundamental tension in all ML.
Ch 3.3 — Logistic Regression, KNN, Naive Bayes. Logistic regression uses the sigmoid to output probabilities; decision boundary is linear. KNN is instance-based — no training, but slow at inference and sensitive to the curse of dimensionality. Naive Bayes applies Bayes' theorem with the strong (but effective) conditional independence assumption.
Ch 3.4 — Decision Trees & Ensembles. Trees split on information gain (Gini / entropy). Single trees overfit — ensemble methods fix this. Bagging (Random Forest) reduces variance via bootstrap sampling. Boosting (Gradient Boosting, XGBoost) reduces bias by sequentially correcting errors. XGBoost is still the dominant algorithm for structured data.
Ch 3.5 — Support Vector Machines. Maximum-margin hyperplane — only support vectors matter. Soft margin C controls tolerance. The kernel trick projects data into high-dimensional space via the inner product — enabling non-linear decision boundaries without explicit feature maps. RBF kernel is the most general choice; always scale features.
Ch 3.6 — Clustering & Unsupervised Learning. K-Means minimises inertia — fast but assumes spherical clusters and needs K upfront. DBSCAN discovers clusters as dense regions — no K required, handles noise and arbitrary shapes. Hierarchical clustering builds a full dendrogram — cut at any height for K clusters. PCA: eigenvectors of the covariance matrix for maximum-variance projection. t-SNE / UMAP for non-linear 2D/3D visualisation only.
Ch 3.7 — Model Evaluation & Validation. Confusion matrix is the foundation: TP/TN/FP/FN. Choose metric by error cost — Recall when FN is fatal (cancer), Precision when FP is costly (spam). AUC-ROC is threshold-independent and handles class imbalance. K-Fold CV gives robust, low-variance performance estimates. Data leakage — the most common production failure — is prevented by splitting before fitting any preprocessor (use Pipeline).

Domain 3 gave you the full classical ML toolkit — the algorithms that power the majority of production ML systems today. Every algorithm here is a specific answer to the same question: how do we build a function that generalises from training data to unseen examples? Domain 4 takes this further into deep neural networks — where the function is parameterised by millions of learned weights.

← Domain 2: Mathematics for AI Domain 4: Deep Learning →