AI Foundation ยท Domain 03 ยท Chapter 3.1

The ML Landscape

Supervised, unsupervised, semi-supervised, and reinforcement learning โ€” the map before the territory

3.1
Chapter 3.1
The ML Landscape & Learning Paradigms

Machine learning is not one thing. It is a family of approaches united by one idea: let the computer find the pattern in the data, rather than programming the pattern by hand. Everything in this domain is a specific way of doing that.

In 1959, Arthur Samuel coined the term machine learning while building a self-improving checkers program, defining it as "the ability to learn without being explicitly programmed." The most precise formulation came from Tom Mitchell in 1997:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Mitchell, T. โ€” Machine Learning, McGraw Hill, 1997

Concrete example โ€” applying Mitchell's definition to a spam filter:

T
TaskClassify incoming emails as spam or not-spam
E
Experience50,000 labelled emails โ€” each marked spam / not-spam by human reviewers
P
PerformanceClassification accuracy on new, previously-unseen emails โ€” ideally 99%+

As the filter processes more labelled emails (more E), its accuracy on new mail (P at T) improves โ€” that's the entire definition in one sentence.

The critical insight is the inversion of the programming paradigm. In traditional software you write rules ("if subject contains 'FREE MONEY' โ†’ spam"). In ML, you provide examples and the algorithm discovers the rules automatically โ€” rules often far too complex and numerous for any human to encode by hand.

The ML Inversion โ€” from programming rules to learning them
Traditional ML Data + Rules programmed by humans Program computer Output / Answers predictions Data + Answers labelled examples ML Algorithm computer Rules / Model learned automatically ML inverts the paradigm โ€” rules emerge from data rather than being hand-written

Supervised learning is the most widely deployed form of ML. The word "supervised" means a teacher is present: you provide labelled training examples โ€” pairs of (input X, correct answer y) โ€” and the algorithm learns a mapping f: X โ†’ y that generalises to new, unseen inputs. Think of a student studying with an answer key: they infer underlying rules from many (question, answer) pairs, then apply those rules to questions they've never seen.

Tasks split into two types by the nature of y:

Regression โ€” y is continuous
e.g. house price, temperature, stock return
Classification โ€” y is a discrete label
e.g. spam/ham, cat/dog, benign/malignant

The choice of output type determines the loss function, output layer, and evaluation metric.

Supervised Learning Loop โ€” predict, measure error, update
Labelled Training Data {(xแตข, yแตข)} โ€” N examples Model f(x ; ฮธ) parameters ฮธ Prediction ลท = f(x;ฮธ) model output Loss L(y, ลท) measure of error Update ฮธ gradient descent updated weights ฮธ' backprop / grad descent

The learning signal comes from comparing the model's prediction ลท to the true label y via a loss function. Gradient descent nudges the model's parameters in the direction that reduces the loss. Repeat this across millions of examples and the model converges to a good approximation of f. Supervised learning's key limitation is label dependency: human-annotated examples are expensive, slow, and often scarce in specialised domains.

PropertyRegressionClassification
Output typeContinuous value (โ„)Discrete category
Loss functionMSE, MAE, HuberCross-entropy, Hinge
Output layerLinear (single node)Softmax / Sigmoid
EvaluationRMSE, Rยฒ, MAEAccuracy, F1, AUC-ROC
ExampleHouse price predictionEmail spam detection
AlgorithmsLinear regression, Ridge, SVRLogistic reg, SVM, Random Forest

Unsupervised learning removes the teacher entirely. You provide raw inputs X โ€” no answer key, no rewards. The algorithm must discover hidden structure on its own. Think of giving someone a pile of unlabelled photographs and asking them to organise it: they'll naturally group similar images together even without being told what "similar" means.

Three major tasks fall under unsupervised learning:

Clustering Groups similar data points โ€” k-means, DBSCAN, hierarchical clustering
Dim Reduction Compresses to fewer dimensions while preserving structure โ€” PCA, t-SNE, UMAP
Density Est. Models the underlying probability distribution โ€” Gaussian Mixture Models, KDE
Three Unsupervised Learning Tasks
โ‘  Clustering Discover groups in unlabelled data โ‘ก Dim Reduction High-D PCA 2-D Compress high-D data to fewer dims โ‘ข Density Estimation Model the probability distribution

The unsupervised challenge is evaluation: without ground-truth labels, how do you know if a clustering is good? Metrics like silhouette score (cohesion vs separation) and reconstruction error (for autoencoders) provide proxies, but ultimately require domain-expert judgment. Despite this, unsupervised learning is invaluable โ€” it can operate on the vast stores of unlabelled data that exist in every organisation.

Between the extremes of "all data labelled" (supervised) and "no data labelled" (unsupervised) lies a practical middle ground that powers most of modern AI.

Semi-supervised learning combines a small labelled set with a large unlabelled set. Even without labels, the geometric structure of the data (clusters, manifolds) constrains which labels are plausible: two nearby points probably share a label. Algorithms like label propagation, self-training, and co-training exploit this structure to achieve strong performance even when labels cover only 1โ€“5% of data โ€” critical in medical imaging where expert annotation is expensive.

Self-supervised learning is the most important paradigm of the past decade. It creates a training signal directly from raw data, eliminating human annotation entirely. The core trick: mask or corrupt part of the input, then train the model to reconstruct the missing part.

BERTMasks 15% of tokens and predicts them โ†’ learns language understanding
GPTPredicts the next token โ†’ learns to generate coherent text
SimCLRTwo augmented views of same image must have similar embeddings โ†’ learns visual features

Self-supervised learning is how every major language model is pre-trained. GPT-4 processed trillions of tokens โ€” each token is the label for the previous one. This technique made trillion-parameter model training economically feasible without any human annotation at scale.

๐Ÿท๏ธ

Semi-Supervised Learning

A few labelled points anchor the model; unlabelled points fill in the space via geometric structure. Like teaching a child 5 examples of "cat" and letting them identify 10,000 more.

  • Medical AI: 100 labelled scans + 10,000 unlabelled
  • Drug discovery: few active compounds + vast chemical space
  • Web classification: small annotated corpus + crawled text
๐Ÿค–

Self-Supervised Learning

Labels come from the data structure itself โ€” the supervision signal is derived from the input. No human annotation required.

  • BERT: predict masked tokens โ†’ understand language
  • GPT: predict next token โ†’ generate coherent text
  • SimCLR: match augmented views โ†’ learn visual features
  • WAV2Vec: predict masked audio โ†’ speech recognition

Reinforcement learning (RL) is the third major paradigm. An agent takes actions in an environment and receives numerical rewards or penalties based on those actions. No labelled examples exist โ€” only a reward signal over time. The agent must learn, through trial and error, which sequences of actions lead to the highest cumulative reward.

The central challenge is the exploration-exploitation dilemma: should the agent exploit what it already knows works, or explore new actions that might be better? Too much exploitation โ†’ stuck in a local optimum. Too much exploration โ†’ never commits to a strategy.

AlphaGoreward = winning the game
Robot locomotionreward = distance without falling
RLHF for LLMsreward = human preference rating
๐ŸŽฎ

Game Playing

AlphaGo / AlphaZero learned Go, chess, and shogi from self-play with game outcome as reward. Beat world champions within days of training.

๐Ÿค–

Robotics

Reward = physical task completion. Boston Dynamics gaits, OpenAI Dactyl hand, DeepMind locomotion โ€” all RL-trained policies.

๐Ÿ’ฌ

RLHF for LLMs

Humans rate responses โ†’ train a reward model โ†’ use PPO to fine-tune the LLM toward preferred outputs. The technique that made ChatGPT helpful.

๐Ÿ“Œ Full RL coverage โ€” Q-learning, policy gradients, PPO, actor-critic โ€” is in Domain 7: Reinforcement Learning. Here we establish only where RL fits in the ML landscape.

Every ML project โ€” from a weekend Kaggle competition to a production system at Google โ€” follows the same general pipeline. Understanding this sequence prevents the most common failure modes: models trained on wrong data, evaluated on contaminated test sets, or deployed without monitoring.

โ‘  Problem Definition What to optimise?
โ‘ก Data Collection Gather raw data
โ‘ข EDA Explore & understand
โ‘ฃ Preprocessing Clean & transform
โ‘ค Modelling Train & tune
โ‘ฅ Evaluation Measure performance
โ‘ฆ Deployment Serve & monitor

โ‘  Problem Definition

Frame the task precisely: regression or classification? What is the prediction target? What does success look like numerically? Check whether ML is even needed โ€” sometimes a rule-based system is simpler and more reliable.

โ‘ก Data Collection

Gather from APIs, databases, web scraping, sensors, or surveys. Assess quantity (enough to generalise?) and quality (representative? biased? stale?). Data collection is often 60โ€“70% of total project time in practice.

โ‘ข EDA โ€” Exploratory Data Analysis

Histograms, scatter plots, correlation matrices, missing value heatmaps, class imbalance checks, outlier detection. Goal: understand your data before touching a model. Surprises found here cost 10ร— less than surprises found post-deployment.

โ‘ฃ Preprocessing

Missing value imputation (mean, median, KNN), feature scaling (StandardScaler, MinMaxScaler), categorical encoding (one-hot, ordinal, target encoding), train/val/test split. Critical rule: fit all transformers on training data only โ€” never on validation or test data.

โ‘ค Modelling

Start with a baseline (majority-class classifier, mean predictor). Then iterate: logistic regression โ†’ random forest โ†’ gradient boosting โ†’ neural network. Use cross-validation. Tune with grid search, random search, or Bayesian optimisation (Optuna).

โ‘ฅ Evaluation

Choose metrics aligned with business goals. Accuracy misleads on imbalanced classes โ€” prefer F1, precision/recall, or AUC-ROC. Never evaluate on training data. Final evaluation on the held-out test set โ€” used exactly once.

โ‘ฆ Deployment & Monitoring

Serve via REST API (FastAPI, Flask, Triton) or batch pipeline. Monitor for data drift (input distribution shifts), concept drift (relationship between X and y changes), and performance degradation. Set retraining triggers. Production models without monitoring degrade silently โ€” one of the most common failure modes in real ML systems.

The pipeline is not linear โ€” it is a loop. EDA may force you to redefine the problem. Evaluation may send you back to data collection. Deployment monitoring may trigger the whole cycle again. Budget time for at least 3 full iterations.

The most common mistake early practitioners make is defaulting to neural networks or the most complex algorithm available. The right question is always: what data do I have, and what do I need to output?

Choosing a Learning Paradigm โ€” Decision Flow
What data do you have? Choose paradigm from data situation Labelled (X, y) Few labels + many unlabelled Unlabelled only Reward signal SUPERVISED Learn f: X โ†’ y SEMI-SUPERVISED Propagate scarce labels UNSUPERVISED Discover structure REINFORCEMENT Max cumulative reward y continuous y category Regression Linear Reg, SVR Classification SVM, RF, NNs Group? Compress? Clustering k-means, DBSCAN Dim Reduction PCA, UMAP, t-SNE Start with the simplest appropriate paradigm โ€” complexity is not a virtue in ML
ParadigmData RequirementTypical TasksKey AlgorithmsWhen to Use
SupervisedLabelled pairs (X, y)Classification, RegressionLinear Reg, RF, SVM, NNsYou have labels and a clear prediction target
UnsupervisedUnlabelled X onlyClustering, Compression, Anomalyk-means, PCA, DBSCAN, AutoencodersNo labels; explore structure; detect anomalies
Semi-supervisedFew labels + many unlabelledClassification with scarce labelsLabel propagation, self-training, MixMatchLabelling expensive; abundant unlabelled data
Self-supervisedRaw unlabelled X (large scale)Pre-training LLMs, vision encodersBERT, GPT, SimCLR, MAEHuge unlabelled datasets; build foundation model
ReinforcementReward signal from environmentGames, robotics, RLHF, tradingQ-learning, PPO, SAC, DQNSequential decisions; feedback from consequences

๐Ÿ“‹ Chapter 3.1 โ€” Key Takeaways

  • ML = let the algorithm find patterns instead of programming them explicitly โ€” the key inversion of the traditional software paradigm
  • Mitchell's definition: learn from experience E on tasks T as measured by performance P โ€” applies to everything from spam filters to AlphaFold
  • Supervised: labelled {X, y} pairs โ†’ learn f: X โ†’ y. Two subtypes: regression (continuous) and classification (categorical)
  • Unsupervised: no labels โ†’ discover hidden clusters, compress via dimensionality reduction, or model density โ€” evaluation is harder without ground truth
  • Self-supervised: labels come from the data itself โ€” predicting masked or next tokens โ€” how GPT and BERT are pre-trained at trillion-token scale with zero human annotation
  • Reinforcement: learn from reward signals, not labels โ€” powers AlphaGo, robot locomotion, and RLHF for LLMs (full coverage: Domain 7)
  • The ML pipeline: Problem โ†’ Data โ†’ EDA โ†’ Preprocess โ†’ Model โ†’ Evaluate โ†’ Deploy & Monitor โ€” a loop, not a one-way sequence
  • Paradigm choice: do you have labels? A reward signal? Or only raw data? โ€” always start with the simplest method that fits your data situation
3.2
Chapter 3.2
Regression โ€” Predicting Continuous Values

Regression is the simplest form of supervised learning โ€” and the best place to understand how ML really works. Every concept here (loss functions, gradient descent, overfitting, regularisation) reappears in every algorithm from SVMs to transformers.

The goal of regression is to predict a continuous numerical output from one or more inputs. The simplest version โ€” simple linear regression โ€” assumes the relationship between input x and output y is a straight line. We fit that line to the data by finding the optimal slope (weight) and intercept (bias).

Our running example: predicting a house's sale price from its size in square footage. We assume price increases roughly linearly with size, with some noise from other unmodelled factors (location, condition, year built). The model makes predictions using:

Linear Regression Model ลท = w ยท x + b ลท = predicted price | w = weight (slope) | x = input feature (sqft) | b = bias (intercept)

The model has just two parameters โ€” w (how much price increases per unit of size) and b (the base price for a zero-size house, which anchors the line). Training means finding the values of w and b that make the line fit the data as closely as possible.

Linear Regression โ€” finding the best-fit line through data
0 5 10 15 20 25 30 Square footage (100s) 0 100 200 300 400 500 House Price ($1000s) residual = y โˆ’ ลท ลท = 16.2ยทx + 42.5
from sklearn.linear_model import LinearRegression import numpy as np X = np.array([[3],[5],[7],[10],[15],[20],[25]]) # sq footage (100s) y = np.array([85, 120, 145, 200, 300, 380, 460]) # price ($1000s) model = LinearRegression() model.fit(X, y) print(f"Weight (slope): {model.coef_[0]:.2f}") # 16.20 print(f"Bias (intercept): {model.intercept_:.2f}") # 42.50 print(f"Predict 2000 sqft: ${model.predict([[20]])[0]:.0f}k") # $367k

To fit a line we need to measure how wrong our current w and b are. The residual for one prediction is the gap between the actual value and the predicted value: residual = y โˆ’ ลท. If we just summed residuals, positive and negative errors would cancel. Instead we square them.

Mean Squared Error (MSE) averages the squared residuals across all n training examples. Squaring does two things: it makes all errors positive, and it penalises large errors far more than small ones (a 2ร— larger error contributes 4ร— the cost). MSE is also smooth everywhere โ€” which means we can take its derivative and use gradient descent.

Loss Functions for Regression MSE = (1/n) ยท ฮฃแตข (yแตข โˆ’ ลทแตข)ยฒ MAE = (1/n) ยท ฮฃแตข |yแตข โˆ’ ลทแตข| RMSE = โˆšMSE โ† same units as y, more interpretable n = number of samples | yแตข = true value | ลทแตข = predicted value

A worked example โ€” 4 predictions: actual = [200, 250, 300, 350], predicted = [190, 270, 290, 380]. Residuals: [10, โˆ’20, 10, โˆ’30]. Squared: [100, 400, 100, 900]. MSE = (100+400+100+900)/4 = 375. RMSE = โˆš375 โ‰ˆ $19.4k โ€” that's the typical prediction error in the same units as house price.

MSE Cost Function โ€” the bowl we're trying to reach the bottom of
1D Cost Landscape (fixed b) Weight w MSE Cost J(w) w=16.2 Optimal w too small w too large 2D Cost Landscape (contours) Weight w โ†’ Bias b โ†’ min J(w,b) Start

MSE โ€” Use When

  • Outliers are genuine data (not noise)
  • Large errors should be penalised heavily
  • You need a differentiable loss for gradient descent
  • Default choice for regression โ€” sklearn, PyTorch default

MAE โ€” Use When

  • Data has many outliers you want to ignore
  • Median-like behaviour is preferred over mean
  • House price datasets with luxury anomalies
  • Not differentiable at 0 โ€” needs subgradient methods

We have a cost function J(w, b) โ€” now we need to minimise it. Two approaches exist. The Normal Equation solves for the optimal weights analytically in one shot. Gradient Descent takes many small iterative steps, using the derivative of J to determine which direction to move.

Normal Equation (closed-form solution) w = (Xแต€X)โปยน ยท Xแต€y Exact solution โ€” no iterations needed. Expensive for large feature counts (O(nยณ) matrix inversion).
Gradient Descent Update Rules (one step) w โ† w โˆ’ ฮฑ ยท (2/n) ยท ฮฃ (ลทแตข โˆ’ yแตข) ยท xแตข b โ† b โˆ’ ฮฑ ยท (2/n) ยท ฮฃ (ลทแตข โˆ’ yแตข) ฮฑ = learning rate (step size) | n = number of samples | repeat until convergence

The intuition: the gradient (derivative of J w.r.t. w) tells you the slope of the cost bowl at your current position. Subtract a fraction (ฮฑ) of that slope from w to move toward the minimum. If the gradient is positive, w is too large โ€” decrease it. If negative, w is too small โ€” increase it. The learning rate ฮฑ controls how large each step is โ€” too large and you overshoot; too small and training takes forever.

Gradient descent is not just for linear regression โ€” it's the engine that trains every neural network, transformer, and deep learning system. Understanding it here, where the math is simple, makes every downstream algorithm clearer.

AspectNormal EquationGradient Descent
Formulaw = (Xแต€X)โปยนXแต€yIterative โˆ‚J/โˆ‚w updates
Speed โ€” small nFast, one computationSlow โ€” many iterations
Speed โ€” large nVery slow O(nยณ)Scales well, mini-batches
MemoryMust invert full Xแต€X matrixCan stream mini-batches
ConvergenceAlways exactNeeds learning rate tuning
Best for< 1,000 featuresLarge data, neural networks

Real predictions need multiple inputs. House price depends not just on square footage, but on bedrooms, bathrooms, location score, age, and dozens of other features. Multiple linear regression extends the model to handle any number of features:

Multiple Linear Regression ลท = wโ‚xโ‚ + wโ‚‚xโ‚‚ + wโ‚ƒxโ‚ƒ + ... + wโ‚™xโ‚™ + b Matrix form: ลท = Xยทw + b Each feature xโฑผ has its own weight wโฑผ. X is the (n ร— p) feature matrix.

With multiple features, feature scaling becomes critical. If xโ‚ (square footage) ranges from 500โ€“3000 and xโ‚‚ (bedrooms) ranges from 1โ€“6, the gradient for wโ‚ is tiny compared to wโ‚‚. This creates an elongated cost bowl where gradient descent zig-zags inefficiently. Scaling all features to a comparable range makes the bowl round and descent fast.

Why Feature Scaling Matters for Gradient Descent
Without Scaling Elongated cost bowl โ†’ zig-zag Start wโ‚ (sqft: 500โ€“3000) wโ‚‚ (beds: 1โ€“6) With Scaling Circular bowl โ†’ fast direct path Start wโ‚' (scaled) wโ‚‚' (scaled)
Feature Scaling Methods StandardScaler: x' = (x โˆ’ ฮผ) / ฯƒ โ†’ zero mean, unit variance MinMaxScaler: x' = (x โˆ’ min) / (max โˆ’ min) โ†’ range [0, 1] Use StandardScaler as the default. MinMaxScaler when you need a bounded range (e.g., image pixels).

What if the relationship between x and y is curved, not linear? We can handle this without leaving the linear regression framework โ€” by creating polynomial features: xยฒ, xยณ, and so on. We then fit a standard linear regression to these expanded features. The model is still linear in its parameters (wโ‚€, wโ‚, wโ‚‚...) even though the fit is a curve in the original x space.

Polynomial Regression (degree d) ลท = wโ‚€ + wโ‚x + wโ‚‚xยฒ + wโ‚ƒxยณ + ... + wโ‚xแตˆ sklearn: PolynomialFeatures(degree=d).fit_transform(X) โ†’ then LinearRegression()

The risk: higher polynomial degree โ†’ tighter fit on training data โ†’ worse generalisation to new data. A degree-12 polynomial can pass through every training point perfectly (zero training error) while oscillating wildly between them โ€” it has memorised the data rather than learned the pattern.

Polynomial Degree vs Model Complexity โ€” the overfitting spectrum
Degree 1 Underfitting (High Bias) Train error: HIGH Test error: HIGH Degree 3 Good Fit Train error: LOW Test error: LOW Degree 12 Overfitting (High Variance) Train error: ~0 Test error: HIGH

Overfitting happens because the model is too complex for the data โ€” it fits noise rather than signal. Regularisation is the standard fix: add a penalty term to the loss function that discourages large weights. Smaller weights = simpler model = less overfitting. The penalty strength is controlled by a hyperparameter ฮป (lambda).

Regularised Loss Functions Ridge (L2): J = MSE + ฮป ยท ฮฃ wแตขยฒ โ†’ penalise large weights Lasso (L1): J = MSE + ฮป ยท ฮฃ |wแตข| โ†’ can zero weights out ElasticNet: J = MSE + ฮป ยท [ฮฑยทฮฃ|wแตข| + (1โˆ’ฮฑ)ยทฮฃwแตขยฒ] ฮป = 0 โ†’ no regularisation (plain MSE). ฮป โ†’ โˆž โ†’ all weights forced to zero.

Ridge (L2)

Adds ฮปยทฮฃwแตขยฒ to the loss. Penalises large weights by squaring them. Weights shrink toward zero but never exactly reach zero. Keeps all features โ€” just with smaller influence.

  • Best when all features are genuinely useful
  • Stable โ€” handles correlated features well
  • sklearn: Ridge(alpha=ฮป)

Lasso (L1)

Adds ฮปยทฮฃ|wแตข| to the loss. The absolute value creates corners in the constraint region โ€” the optimal solution tends to land exactly on an axis, driving some weights to exactly zero. Built-in feature selection.

  • Best when many features are irrelevant
  • Produces sparse models
  • sklearn: Lasso(alpha=ฮป)

ElasticNet

Combines L1 and L2 penalties. Gets some feature selection (from L1) with the stability of L2 when features are correlated. Controlled by a mixing ratio ฮฑ.

  • Best when groups of correlated features exist
  • More flexible than pure Ridge or Lasso
  • sklearn: ElasticNet(alpha=ฮป, l1_ratio=ฮฑ)
Why Lasso Creates Sparse Weights (Feature Selection)
Ridge (L2) โ€” Circle Constraint ฮป constraint optimal wโ‚โ‰ 0, wโ‚‚โ‰ 0 wโ‚ wโ‚‚ Optimal lies off-axis โ€” weights shrink, not zeroed Lasso (L1) โ€” Diamond Constraint ฮป constraint optimal wโ‚=0! wโ‚ wโ‚‚ Optimal hits corner โ†’ exact zero โ†’ feature dropped
Choosing Regularisation Strength โ€” bias-variance tradeoff
Regularisation Strength ฮป โ†’ Error 0.001 0.01 0.1 1 10 100 Train error Val error Optimal ฮป โ† ฮป too small Overfitting ฮป too large โ†’ Underfitting
PropertyRidge (L2)Lasso (L1)ElasticNet
Penaltyฮปยทฮฃwแตขยฒฮปยทฮฃ|wแตข|ฮฑยทL1 + (1โˆ’ฮฑ)ยทL2
Feature selectionNo โ€” weights shrink, never zeroYes โ€” exact zeros possiblePartial
OutputDense โ€” all weights non-zeroSparse โ€” many weights = 0Semi-sparse
Best whenAll features relevantMany irrelevant featuresCorrelated + sparse
sklearnRidge(alpha=ฮป)Lasso(alpha=ฮป)ElasticNet(alpha, l1_ratio)

Why does any ML model fail? The expected prediction error can always be decomposed into three components. Understanding them is essential for diagnosing model problems and choosing the right fix.

Bias-Variance Decomposition E[(y โˆ’ ลท)ยฒ] = Biasยฒ(ลท) + Variance(ลท) + ฯƒยฒ Bias = how wrong on average | Variance = sensitivity to training data | ฯƒยฒ = irreducible noise
High Bias (Underfitting)
Model too simple โ€” can't capture the true pattern.

Fix: more features, higher polynomial degree, less regularisation, more complex model.
High Variance (Overfitting)
Model too complex โ€” memorises noise.

Fix: more training data, regularisation (Ridge/Lasso), simpler model, dropout, early stopping.
Bias-Variance Tradeoff โ€” the fundamental tension in ML
Error Components vs Model Complexity Model Complexity โ†’ Error Noise floor ฯƒยฒ Biasยฒ Variance Total Error Sweet Spot โ† Underfitting Overfitting โ†’ Bias vs Variance Low Bias Low Variance โœ“ Low Bias High Variance High Bias Low Variance High Bias High Variance

Regularisation is bias-variance control. Increasing ฮป raises bias (simpler model) and reduces variance (less sensitive to training data). The optimal ฮป sits at the bottom of the U-shaped validation error curve โ€” found by cross-validation, not guesswork.

๐Ÿ“‹ Chapter 3.2 โ€” Key Takeaways

  • Linear regression: ลท = wยทx + b โ€” learn the weight and bias that minimise MSE on training data
  • MSE penalises large errors quadratically; RMSE is in the same units as y โ€” more interpretable for reporting
  • Gradient descent: iteratively nudge w and b in the direction that reduces MSE โ€” the algorithm behind all of deep learning
  • Feature scaling (StandardScaler) is essential: unscaled features create elongated cost bowls that slow convergence dramatically
  • Polynomial features extend linear regression to curves โ€” but higher degree risks overfitting (zero training error, high test error)
  • Ridge (L2) shrinks weights; Lasso (L1) zeros weights (feature selection); ElasticNet combines both
  • Bias-Variance: Error = Biasยฒ + Variance + ฯƒยฒ. Regularisation trades more bias for less variance โ€” find the sweet spot with cross-validation
3.3
Chapter 3.3
Classification โ€” Predicting Categories

Classification is supervised learning for discrete outputs: given an input, which bucket does it belong to? Logistic regression, k-nearest neighbours, and Naive Bayes each answer that question from a completely different angle โ€” probabilistic, geometric, and statistical.

Despite its name, logistic regression is a classifier, not a regressor. It predicts P(y=1 | x) โ€” the probability that an input belongs to class 1. The trick is the sigmoid function, which squashes any real number from โˆ’โˆž to +โˆž into the range (0, 1), making the output interpretable as a probability.

Why not use ordinary linear regression for classification? A linear model can predict values well below 0 or well above 1, which are meaningless as probabilities, and the resulting loss surface is non-convex with MSE โ€” gradient descent is not guaranteed to find the global minimum. The sigmoid + binary cross-entropy combination fixes both problems.

Core Equations โ€” Logistic Regression ฯƒ(z) = 1 / (1 + eโปแถป) Sigmoid function โ€” squashes any real z to the (0, 1) range ลท = ฯƒ(wx + b) Logistic regression model โ€” output is a probability Loss = โˆ’[yยทlog(ลท) + (1โˆ’y)ยทlog(1โˆ’ลท)] Binary Cross-Entropy โ€” convex loss, amenable to gradient descent

Decision rule: predict class 1 if P(y=1|x) โ‰ฅ 0.5, else class 0. The threshold 0.5 is the default but can be tuned โ€” lower it to increase recall (catch more positives), raise it to increase precision (fewer false alarms).

Logistic Regression โ€” sigmoid output and linear decision boundary
SIGMOID FUNCTION 0.5 threshold boundary z = wx + b ฯƒ(z) โˆ’6 0 +6 1.0 0.5 0.0 Predict Class 0 Predict Class 1 DECISION BOUNDARY Feature 1 (hours studied) Feature 2 (tests) wx+b=0 ร— ร— ร— ร— ร— ร— ร— ร— P(pass) < 0.5 P(pass) > 0.5
# Logistic Regression โ€” sklearn walkthrough
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=200, n_features=2, n_redundant=0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]  # probability of class 1

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Probabilities (first 5): {y_prob[:5].round(3)}")

Logistic regression produces linear decision boundaries โ€” a straight line in 2D, a hyperplane in higher dimensions. This is fast and interpretable, but it fails when the true boundary between classes is curved or complex.

Three escape routes from linearity: polynomial features (add xยฒ, xยทy, ... as new inputs โ€” same model, richer boundary), the kernel trick (implicitly maps to a very high-dimensional space, used in SVMs), or simply switch to a non-linear model (decision trees, neural networks). The softmax function generalises the sigmoid to K classes โ€” each class gets a score zโ‚–, and softmax converts all K scores into a probability distribution that sums to 1.

Decision Boundary Complexity โ€” linear to non-linear classifiers
LOGISTIC REGRESSION ร— ร— ร— ร— ร— Linear boundary โœ“ POLYNOMIAL FEATURES ร— ร— ร— ร— ร— ร— Circular boundary โœ“ NEURAL NET / RBF KERNEL ร— ร— ร— ร— ร— Complex boundary โœ“

When there are K > 2 classes, three strategies extend binary classification:

1๏ธโƒฃ

One-vs-Rest (OvR)

Train K binary classifiers. Classifier k asks: "is this class k vs everything else?" At prediction, pick the class with the highest confidence score.

  • K models trained
  • Fast, simple
  • Works with any binary classifier
2๏ธโƒฃ

One-vs-One (OvO)

Train K(Kโˆ’1)/2 classifiers โ€” one for every pair of classes. At prediction, majority vote wins. Works well when individual classifiers are fast to train.

  • K(Kโˆ’1)/2 models
  • Better for SVMs
  • Slower for large K
โˆ‘

Softmax (Multinomial)

Single model with K output scores. Softmax converts them to probabilities that sum to 1. The modern default for neural networks and logistic regression.

  • 1 model
  • Probabilistic output
  • End-to-end trainable
Softmax โ€” Multiclass Probability P(y=k | x) = e^(zโ‚–) / ฮฃโฑผ e^(zโฑผ) Output: K probabilities โ€” each in (0,1), all summing to 1
One-vs-Rest โ€” K binary classifiers for K-class problem
CLASSIFIER 1: BLUE vs REST Conf: 0.82 CLASSIFIER 2: GREEN vs REST Conf: 0.45 CLASSIFIER 3: RED vs REST Conf: 0.31 argmax(0.82, 0.45, 0.31) โ†’ Class 1 (Blue)

"You are who your neighbours are." KNN is the simplest meaningful classifier: to classify a new point, find the K nearest training points by distance and take a majority vote.

KNN is a lazy learner โ€” there is no training phase. The algorithm memorises all training data and performs all computation at prediction time. This makes training instant but prediction potentially slow on large datasets (O(nยทd) per query โ€” must compute distance to every training point).

KNN Core Math Euclidean: d(x, y) = โˆšฮฃแตข(xแตข โˆ’ yแตข)ยฒ L2 distance โ€” most common metric; sensitive to scale differences ลท = majority_vote({yโ‚, yโ‚‚, โ€ฆ, yโ‚–}) [K nearest neighbours] Also: Manhattan (L1): ฮฃแตข|xแตขโˆ’yแตข| | Minkowski: (ฮฃแตข|xแตขโˆ’yแตข|แต–)^(1/p)

Choosing K: K=1 memorises training data perfectly (zero training error) but is extremely sensitive to noise and outliers โ€” overfitting. K=n (all points) just predicts the majority class โ€” underfitting. The optimal K is found via cross-validation, typically in the range 3โ€“15. Odd K values avoid ties in binary classification.

KNN with K=1, K=5, K=15 โ€” effect on decision boundary
K=1 โ€” Overfit ร— ร— ร— ร— ร— โ˜… Jagged, noise-sensitive K=5 โ€” Balanced โœ“ ร— ร— ร— ร— ร— โ˜… 3 green, 2 blue โ†’ Class 1 Smooth, generalises well K=15 โ€” Risk Underfit ร— ร— ร— ร— ร— โ˜… Very smooth, may miss patterns
KNN Strengths
KNN Weaknesses
No training time โ€” instant fit
Slow prediction: O(nยทd) per query
Naturally handles multi-class problems
Requires entire training set in memory
Non-linear, complex decision boundaries
Sensitive to irrelevant / noisy features
Simple, transparent, easy to explain
Feature scaling mandatory (distance-based)
Good non-parametric baseline
Curse of dimensionality in high-dim spaces

Naive Bayes is a probabilistic classifier built on Bayes' theorem. The "naive" assumption is that all features are conditionally independent given the class โ€” in practice this is almost never true, yet the algorithm performs remarkably well, especially on text classification tasks.

Bayes' Theorem Applied to Classification P(y | xโ‚,โ€ฆ,xโ‚™) โˆ P(y) ยท ฮ  P(xแตข | y) Prior ร— Product of likelihoods โ€” the naive independence assumption lets us factorise ลท = argmax_y P(y) ยท P(xโ‚|y) ยท P(xโ‚‚|y) ยท โ€ฆ ยท P(xโ‚™|y) Pick the class y that maximises the joint probability

Why does it work despite the false assumption? The discriminative signal โ€” the difference between class probabilities โ€” is often large enough that the independence error doesn't flip the argmax. In high-dimensional sparse data (like text), there's also very little co-occurrence signal to exploit anyway.

Worked example โ€” spam filter: Given 5 training emails below, compute the Naive Bayes decision for the new message "free click prize":

Email"free""click""meeting""report"Label
Email 11100spam
Email 21000spam
Email 30011ham
Email 40001ham
Email 51100spam
# P(spam) = 3/5 = 0.6  |  P(ham) = 2/5 = 0.4
# P(free|spam)=3/3=1.0  P(click|spam)=2/3โ‰ˆ0.67
# P(free|ham)=0/2=0.0   P(click|ham)=0/2=0.0  (add Laplace smoothing: +1)
# With smoothing: P(free|ham)โ‰ˆ0.17  P(click|ham)โ‰ˆ0.17

# Score(spam) = 0.6 ร— 1.0 ร— 0.67 โ‰ˆ 0.40
# Score(ham)  = 0.4 ร— 0.17 ร— 0.17 โ‰ˆ 0.012
prediction = "spam"  # argmax wins
VariantFeature TypeUse CaseExample
Gaussian NBContinuous (normal dist)Iris classification, sensorsHeight/weight data
Multinomial NBCount data (non-negative int)Text classification (TF)Word frequency in emails
Bernoulli NBBinary (0 / 1)Binary feature classificationWord present / absent
Complement NBCount dataImbalanced text classificationLarge corpus, class imbalance
AlgorithmTrainingPredictionInterpretable?Non-linear?Best For
Logistic Regression O(nยทdยทiter) Fast O(d) Yes No (poly features needed) Linearly separable, probability output needed
KNN None (lazy) Slow O(nยทd) Intuitive Yes (local) Small dataset, non-linear baseline
Naive Bayes Very Fast O(nยทd) Very Fast O(Kยทd) Yes Partially (Gaussian) Text classification, high-dim sparse data
Decision Regions โ€” Logistic Regression vs KNN vs Naive Bayes (two-moons dataset)
LOGISTIC REGRESSION ร— ร— ร— ร— ร— Linear โ€” misclassifies โœ— KNN (K=5) ร— ร— ร— ร— ร— Irregular โ€” accurate โœ“ NAIVE BAYES (GAUSSIAN) ร— ร— ร— ร— ร— Elliptical contours

๐Ÿ“‹ Chapter 3.3 โ€” Key Takeaways

  • Logistic Regression outputs P(y=1|x) via the sigmoid function โ€” linear decision boundary, interpretable coefficients, trained with Binary Cross-Entropy
  • The decision threshold (default 0.5) can be tuned โ€” lower for higher recall, raise for higher precision; always use the probability output, not just the class label
  • KNN: no training phase โ€” classify by majority vote of K nearest neighbours. Requires feature scaling; slow at prediction O(nยทd); ideal as a non-linear baseline
  • Naive Bayes: Bayes' theorem + conditional independence assumption โ€” surprisingly accurate despite the naive assumption; very fast; best-in-class for text classification
  • Multiclass strategies: OvR (K classifiers), OvO (K(Kโˆ’1)/2 classifiers), or Softmax (single model, modern default)
  • Linear classifiers fail on non-linear data โ€” use polynomial features, kernel trick, or switch to a non-linear model (decision trees, neural networks)
3.4
Chapter 3.4
Decision Trees & Ensemble Methods

A single decision tree mirrors human reasoning but overfits easily. Ensemble methods โ€” Random Forests and Gradient Boosting โ€” combine hundreds of trees to become the most powerful classical ML algorithms on tabular data.

Decision trees mirror how humans naturally make decisions: a series of if/else questions on features, arriving at a prediction at the leaves. Each internal node tests one feature, each branch is an outcome of that test, and each leaf holds the class label or numeric prediction. The CART algorithm (Classification And Regression Trees) is the sklearn default and can handle both classification and regression, numerical and categorical features, and naturally captures non-linear relationships.

Decision Tree โ€” Iris flower classification (depth 3)
petal length โ‰ค 2.45? samples: 150 ยท gini: 0.667 Root Node True ๐ŸŒธ Setosa samples: 50 gini: 0.0 โœ“ Pure False petal width โ‰ค 1.75? samples: 100 ยท gini: 0.5 Internal Node True ๐ŸŒผ Versicolor samples: 54 gini: 0.168 False ๐ŸŒบ Virginica samples: 46 gini: 0.043 Decision Node Leaf Node Gini=0 โ†’ perfectly pure leaf
# Decision Tree โ€” sklearn
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

tree = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
tree.fit(X, y)

# Print tree structure as text
print(export_text(tree, feature_names=list(iris.feature_names)))

# Feature importances
for name, imp in zip(iris.feature_names, tree.feature_importances_):
    print(f"{name}: {imp:.3f}")

At each node, CART exhaustively searches all features and all thresholds to find the split that maximises purity in the child nodes. Three measures of impurity are used in practice:

Gini Impurity

CART default for classification. Probability of misclassifying a random sample. Range: 0 (pure) โ†’ 0.5 (balanced binary).

Gini = 1 โˆ’ ฮฃ pแตขยฒ

Entropy / Info Gain

Used by ID3 and C4.5 algorithms. Measures disorder. Information Gain = parent entropy โˆ’ weighted child entropy.

H = โˆ’ฮฃ pแตข logโ‚‚(pแตข)

Variance Reduction

For regression trees. Choose the split that minimises the weighted variance of child nodes โ€” equivalent to minimising MSE.

min ฮฃ wโ‚– ยท Var(yโ‚–)
Gini vs Entropy โ€” both peak at maximum class imbalance
0 0.5 1.0 Fraction of class 1 (pโ‚) 0 0.25 0.5 1.0 Impurity max impurity Gini impurity Entropy Misclassification

Worked example โ€” which split wins? Parent node: 30 samples (20 blue, 10 red). Gini = 1 โˆ’ (20/30)ยฒ โˆ’ (10/30)ยฒ = 0.444.

SplitLeft childRight childWeighted GiniResult
Split A 15 blue, 5 red โ†’ Gini = 0.375 5 blue, 5 red โ†’ Gini = 0.500 0.417 Gain = 0.027
Split B 19 blue, 1 red โ†’ Gini = 0.099 1 blue, 9 red โ†’ Gini = 0.180 0.120 Gain = 0.324 โœ“ Winner

An unconstrained decision tree grows until every leaf contains a single sample โ€” perfect training accuracy, terrible test accuracy. This is the classic overfit problem. Two families of solutions exist: pre-pruning (stop growing early) and post-pruning (grow full, then cut).

Pre-Pruning Hyperparameters

  • max_depth โ€” maximum depth of tree
  • min_samples_split โ€” min samples to split a node
  • min_samples_leaf โ€” min samples in any leaf
  • max_features โ€” max features considered per split
  • min_impurity_decrease โ€” min gain required to split

Post-Pruning (Cost Complexity)

  • Grow full tree to max depth
  • Calculate cost-complexity criterion for subtrees
  • Remove branches with lowest improvement-to-size ratio
  • Select best pruned tree via cross-validation
  • sklearn: ccp_alpha parameter controls pruning strength
Tree Depth vs Overfitting โ€” hyperparameter max_depth is critical
MAX_DEPTH=1 โ€” Underfit ร— ร— ร— ร— ร— ร— Many errors โœ— MAX_DEPTH=3 โ€” Good Fit โœ“ ร— ร— ร— ร— ร— Well generalised โœ“ MAX_DEPTH=โˆž โ€” Overfit ร— ร— ร— ร— ร— Memorised โ€” fails on new data

A single decision tree has high variance โ€” small perturbations to the training data produce completely different trees. Bagging (Bootstrap Aggregating) solves this by training N trees on N bootstrapped datasets (sampled with replacement) and averaging their predictions. The individual errors are largely independent, so they cancel out.

Random Forest = Bagging + random feature subset at each split. The critical insight: if you use all features at each node, every tree will pick the same dominant feature at the root, making all trees highly correlated โ€” they'd all make the same mistakes. By restricting each node to a random subset of โˆšd features, the trees become decorrelated, and averaging truly independent errors dramatically reduces variance. About 37% of training samples are never used in each tree โ€” this out-of-bag (OOB) set is a free validation estimate.

Random Forest โ€” bagging + random feature subsets = decorrelated ensemble
Training Data n samples ยท d features sample w/ replacement sample w/ replacement Bootstrap Sample 1 ~63% unique rows Bootstrap Sample 2 ~63% unique rows Bootstrap Sample 3 ~63% unique rows ๐ŸŒฒ Tree 1 โˆšd features / split ๐ŸŒฒ Tree 2 โˆšd features / split ๐ŸŒฒ Tree 3 โˆšd features / split Majority Vote / Average โ€” Reduces Variance Final Prediction

Random Forest Strengths

  • Handles high-dimensional data well
  • Built-in feature importance scores
  • Robust to outliers and missing values
  • OOB error โ€” free validation estimate
  • Fully parallelisable โ€” fast training
  • One of the best off-the-shelf algorithms

Random Forest Weaknesses

  • Less interpretable than a single tree
  • Slow prediction for very large forests
  • Memory intensive โ€” stores all trees
  • Poor extrapolation beyond training range
  • Not ideal for very high-cardinality categoricals

Gradient Boosting is fundamentally different from bagging. Trees are built sequentially โ€” each new tree specifically targets the mistakes of the current ensemble. The key insight (Friedman, 1999): fit each new tree to the negative gradient of the loss function with respect to the current predictions. This is gradient descent โ€” but in function space rather than parameter space.

The learning rate (shrinkage) controls how much each tree contributes: F(x) += learning_rate ร— T(x). A small learning rate requires more trees but generalises better. AdaBoost (the precursor) reweighted misclassified examples; modern Gradient Boosting is more general and works with any differentiable loss function.

Gradient Boosting โ€” Prediction Update Rule Fโ‚˜(x) = Fโ‚˜โ‚‹โ‚(x) + ฮฑ ยท Tโ‚˜(x) Fโ‚˜ = ensemble after m trees | ฮฑ = learning rate | Tโ‚˜ = new tree fit to residuals of Fโ‚˜โ‚‹โ‚ Residual = โˆ’โˆ‚L(y, Fโ‚˜โ‚‹โ‚(x)) / โˆ‚Fโ‚˜โ‚‹โ‚(x) Negative gradient of the loss โ€” the "direction of steepest descent" in function space
Gradient Boosting โ€” sequential error correction
Fโ‚€(x) mean(y) ๐ŸŒฒ Tree 1 fits residuals of Fโ‚€ residuals (large) Fโ‚(x) Fโ‚€ + ฮฑยทTโ‚ ๐ŸŒฒ Tree 2 fits residuals of Fโ‚ residuals (smaller) Fโ‚‚(x) Fโ‚ + ฮฑยทTโ‚‚ ยทยทยท Each tree corrects previous mistakes โ€” residuals shrink with each iteration
Random Forest / Bagging
Gradient Boosting
Trees trained in parallel โ€” independent
Trees trained sequentially โ€” each corrects previous
Primarily reduces variance
Reduces both bias and variance
Robust โ€” hard to overfit
Can overfit if learning_rate too high or too many trees
Fast training (parallelisable)
Slower (sequential) โ€” but XGBoost/LightGBM are optimised
Great off-the-shelf baseline
Higher ceiling โ€” Kaggle competition winner

Vanilla gradient boosting is slow โ€” it re-evaluates every possible split at every node. The three production-grade libraries solve this with fundamentally different approaches, making gradient boosting practical on millions of rows.

โšก

XGBoost (2016)

Chen & Guestrin. Regularised gradient boosting, second-order gradients (Newton step), hardware-optimised cache-aware computation. The competition standard for years.

  • Best general-purpose default
  • Excellent documentation
  • Wide ecosystem support
๐Ÿš€

LightGBM (2017)

Microsoft. Histogram-based splits (buckets values โ†’ much faster), leaf-wise tree growth (vs level-wise), GOSS & EFB sampling. 10ร— faster than XGBoost on large data.

  • Large datasets (>100K rows)
  • High-cardinality features
  • Speed matters
๐ŸŽฏ

CatBoost (2017)

Yandex. Native categorical feature handling (no manual encoding), ordered boosting to prevent target leakage, symmetric (oblivious) trees for fast prediction.

  • Many categorical features
  • Minimal tuning needed
  • Avoids target leakage
FeatureXGBoostLightGBMCatBoost
SpeedGoodFastestGood
Memory usageHighLowMedium
CategoricalsManual encodingPartial supportNative support
Tuning effortModerateModerateMinimal
Tree growthLevel-wiseLeaf-wiseOblivious trees
Best forGeneral tabularLarge-scale, fastMany categoricals
# XGBoost โ€” binary classification example
import xgboost as xgb
from sklearn.datasets   import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics    import roc_auc_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = xgb.XGBClassifier(
    n_estimators=200,          # number of trees
    learning_rate=0.1,        # shrinkage โ€” prevents overfitting
    max_depth=4,               # tree depth
    subsample=0.8,             # row sampling per tree
    colsample_bytree=0.8,      # feature sampling per tree
    eval_metric='logloss',
    random_state=42,
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    early_stopping_rounds=20,  # stop if no improvement for 20 rounds
    verbose=False,
)

y_prob = model.predict_proba(X_test)[:, 1]   # probability of positive class
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
Ensemble Methods โ€” Performance vs Complexity Spectrum
Relative Performance on Tabular Data โ†’ 0 100 Decision Tree High variance, interpretable Bagged Trees Variance reduced Random Forest Best off-the-shelf baseline Gradient Boosting Highly competitive XGBoost / LightGBM โ˜… SOTA

If you have tabular data, try Random Forest first as your baseline. If you need the best possible accuracy, use XGBoost or LightGBM with cross-validated hyperparameter tuning. These two algorithms have won more Kaggle competitions than any other approach combined.

๐Ÿ“‹ Chapter 3.4 โ€” Key Takeaways

  • Decision tree: recursive binary splits using Gini or Entropy โ€” interpretable but prone to high-variance overfitting without pruning
  • Bagging: train N trees on bootstrapped datasets and aggregate to reduce variance โ€” errors average out when trees are independent
  • Random Forest: bagging + random feature subsets at each node = decorrelated trees = one of the strongest off-the-shelf algorithms
  • Gradient Boosting: sequential error correction โ€” each new tree fits the negative gradient of the loss; controlled by learning rate (shrinkage)
  • XGBoost / LightGBM / CatBoost: production-grade gradient boosting โ€” dominant on structured/tabular data for a decade
  • Ensemble principle: combining many diverse, weak learners consistently outperforms any single strong learner
3.5
Chapter 3.5
Support Vector Machines

SVMs answer a deceptively simple question: of all the hyperplanes that separate two classes, which one generalises best? The answer โ€” the maximum margin hyperplane โ€” leads to one of the most elegant and mathematically rigorous algorithms in machine learning.

Many hyperplanes can separate two linearly-separable classes. SVMs choose the one that maximises the margin โ€” the perpendicular distance between the boundary and the nearest training points on each side. A wider margin means more room for error on unseen data, giving better generalisation.

The decision boundary is the hyperplane wยทx + b = 0. The two margin planes are wยทx + b = +1 and wยทx + b = โˆ’1. The margin width is 2/||w||, so maximising the margin is equivalent to minimising ||w||ยฒ. This is a constrained quadratic optimisation problem โ€” convex, so it has a unique global solution.

SVM โ€” Hard Margin Optimisation Objective: minimise ยฝ||w||ยฒ Equivalent to maximising the margin 2/||w|| Subject to: yแตข(wยทxแตข + b) โ‰ฅ 1 for all i Every point must be on the correct side of its margin plane Decision: ลท = sign(wยทx + b)
SVM โ€” Maximum Margin Hyperplane and Support Vectors
wยทx + b = 0 wยทx + b = +1 wยทx + b = -1 2/||w|| SV ร— ร— ร— ร— ร— ร— ร— ร— ร— SV ร— Class +1 Class โˆ’1 Only support vectors determine the boundary

Support vectors are the training points that lie exactly on the margin boundaries (wยทx + b = ยฑ1). They are the only points that matter: if you removed every non-support vector from the training set and retrained, you'd get the exact same model. This is a profound property โ€” the decision boundary is defined by a sparse subset of the data.

What they are

Training points on or inside the margin boundary. Typically just a small fraction of all training points โ€” often fewer than 10%.

Why they matter

They are the only points that influence the boundary. Non-support vectors can be moved or removed without changing the model at all.

Model complexity

Fewer support vectors โ†’ simpler, more generalisable model. Many support vectors โ†’ complex boundary, may overfit. Good diagnostic metric.

The SVM optimisation is solved in its dual formulation using Lagrange multipliers โ€” one multiplier ฮฑแตข per training point. Non-support vectors have ฮฑแตข = 0, so they contribute nothing. The prediction is: ลท = sign(ฮฃ ฮฑแตข yแตข K(xแตข, x) + b), summing only over support vectors.

The hard-margin SVM requires perfect linear separability โ€” it fails completely if even one point is on the wrong side. Real data is always noisy and often overlapping. Soft-margin SVM introduces slack variables ฮพแตข โ‰ฅ 0 that allow points to violate the margin. The C hyperparameter controls how heavily violations are penalised.

Soft Margin Objective minimise: ยฝ||w||ยฒ + C ยท ฮฃ ฮพแตข ฮพแตข = slack variable: 0 if correctly classified outside margin, >0 if inside or wrong side High C โ†’ penalise violations heavily โ†’ small margin, few errors โ†’ overfit risk Low C โ†’ tolerate violations โ†’ wide margin, more errors โ†’ underfit risk
SVM C Parameter โ€” bias-variance tradeoff via margin width
C=0.01 โ€” Wide Margin ร— ร— ร— ร— ร— Tolerates violations C=1.0 โ€” Balanced โœ“ ร— ร— ร— ร— Balanced margin C=100 โ€” Narrow Margin ร— ร— ร— ร— ร— Avoids violations, narrow

A linear SVM can only draw straight boundaries. But many real datasets are not linearly separable in their original feature space. One solution is to manually engineer polynomial or interaction features โ€” but for high-dimensional data this is computationally prohibitive. The kernel trick solves this elegantly: it allows the SVM to operate in an implicitly high-dimensional feature space without ever computing the transformation explicitly.

The key insight: the SVM dual formulation only requires computing dot products between feature vectors โ€” never the vectors themselves. A kernel function K(xแตข, xโฑผ) computes the dot product that would result from a high-dimensional mapping ฯ†, without actually applying ฯ†. For the RBF kernel, this implicit feature space is infinite-dimensional.

The Kernel Identity K(xแตข, xโฑผ) = ฯ†(xแตข) ยท ฯ†(xโฑผ) Kernel = dot product in high-D space โ€” computed without ever computing ฯ†(x) RBF: K(x, z) = exp(โˆ’ฮณ ||x โˆ’ z||ยฒ) Points close together โ†’ Kโ‰ˆ1 (similar) | Far apart โ†’ Kโ‰ˆ0 (dissimilar) High ฮณ = narrow influence = complex boundary | Low ฮณ = wide influence = smooth boundary
The Kernel Trick โ€” separate non-linear data via implicit high-D mapping
ORIGINAL 2D โ€” NOT SEPARABLE ร— ร— ร— ร— ร— ร— ร— ร— No linear boundary works โ†’ kernel ฯ†(x) FEATURE SPACE โ€” SEPARABLE separating hyperplane ร— ร— ร— ร— ร— ร— Kernel evaluates this implicitly
SVM Kernels โ€” Linear, Poly(2), Poly(5), RBF on circular data
LINEAR โ€” FAILS ร— ร— ร— ร— ร— Many errors POLY(d=2) ร— ร— ร— ร— ร— Partial success POLY(d=5) ร— ร— ร— ร— ร— Better fit RBF โ€” NEAR PERFECT โœ“ ร— ร— ร— ร— ร— ร— Circular boundary โœ“
KernelFormulaParametersBest ForLimitation
Linearx ยท zNoneLinearly separable, high-d text/NLPFails on non-linear data
Polynomial(x ยท z + c)^dd, cFeature interactions (image data)Sensitive to degree d choice
RBF / Gaussianexp(โˆ’ฮณ||xโˆ’z||ยฒ)ฮณ, CGeneral purpose โ€” most usedSlow on very large datasets
Sigmoidtanh(ฮณxยทz + c)ฮณ, cNeural network approximationRarely the best choice
# SVM with RBF kernel + GridSearch
  from sklearn.svm            import SVC
  from sklearn.preprocessing  import StandardScaler
  from sklearn.pipeline       import Pipeline
  from sklearn.model_selection import GridSearchCV

  # SVM requires feature scaling โ€” always use a Pipeline!
  svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm',    SVC(kernel='rbf', probability=True)),
  ])

  param_grid = {
    'svm__C':     [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
  }

  grid_search = GridSearchCV(
    svm_pipeline, param_grid,
    cv=5, scoring='accuracy', n_jobs=-1
  )
  grid_search.fit(X_train, y_train)

  print(f"Best params:   {grid_search.best_params_}")
  print(f"Best CV score: {grid_search.best_score_:.4f}")

Support Vector Regression (SVR) extends the SVM idea to continuous outputs. Instead of maximising a margin between classes, SVR fits a tube of width 2ฮต around the predictions. Points inside the tube incur zero loss โ€” the model doesn't care about small errors. Only points outside the tube (the support vectors) contribute to the optimisation. This gives SVR a natural insensitivity to small noise.

SVR inherits all the power of kernels โ€” with an RBF kernel, SVR can fit complex non-linear regression surfaces while remaining robust to outliers (thanks to the ฮต-insensitive loss). The tradeoff: SVR can be slow on large datasets (O(nยฒ) to O(nยณ) training) and requires careful tuning of ฮต, C, and the kernel parameters.

SVR โ€” ฮต-Insensitive Loss L(y, ลท) = max(0, |y โˆ’ ลท| โˆ’ ฮต) Zero loss inside the ฮต-tube | Linear loss outside โ€” robust to outliers Objective: minimise ยฝ||w||ยฒ + C ยท ฮฃ (ฮพแตข + ฮพแตข*) ฮพแตข, ฮพแตข* = slack above/below tube | C = penalty for tube violations
SVR Strengths
SVR Weaknesses
Robust to outliers (ฮต-insensitive loss)
Slow training O(nยฒ to nยณ)
Works with any kernel โ€” handles non-linear regression
Doesn't scale to large datasets (>50K rows)
Sparse solution โ€” only support vectors stored
Requires careful tuning of ฮต, C, ฮณ
Well-founded theoretical guarantees
Feature scaling mandatory

๐Ÿ“‹ Chapter 3.5 โ€” Key Takeaways

  • SVM finds the maximum margin hyperplane โ€” widest gap between classes = maximum robustness to noise
  • Only support vectors (points on/inside the margin) determine the boundary โ€” all other training points are irrelevant
  • Soft margin C: high C = small margin, strict (overfit risk); low C = wide margin, tolerant (underfit risk)
  • Kernel trick: evaluate dot products in high-dimensional space without explicit transformation โ€” makes non-linear SVMs tractable
  • RBF kernel is the most general โ€” controlled by ฮณ (boundary complexity) and C (margin strictness) โ€” tune both via cross-validation
  • SVMs require feature scaling โ€” always wrap in a Pipeline with StandardScaler
3.6
Chapter 3.6
Clustering & Unsupervised Learning

Unsupervised learning is the art of finding structure without labels. The data speaks for itself โ€” the algorithm must discover what groups, components, or manifolds naturally exist. Clustering and dimensionality reduction are the two pillars.

K-Means partitions n points into K clusters by minimising within-cluster variance. The Lloyd algorithm iterates two steps โ€” assignment and update โ€” until centroids stop moving.

โ‘  InitRandom K centroids
โ‘ก AssignEach point โ†’ nearest centroid
โ‘ข UpdateCentroid = cluster mean
โ‘ฃ RepeatUntil convergence
K-Means Objective โ€” Minimise Inertia J = ฮฃโ‚– ฮฃโ‚“โˆˆCโ‚– ||x โˆ’ ฮผโ‚–||ยฒ ฮผโ‚– = centroid of cluster k  |  Cโ‚– = set of points assigned to cluster k Assignment: cแตข = argminโ‚– ||xแตข โˆ’ ฮผโ‚–||ยฒ Update: ฮผโ‚– = (1/|Cโ‚–|) ฮฃแตขโˆˆCโ‚– xแตข
K-Means Lloyd Algorithm โ€” 4-Step Convergence
Iteration 0 Iteration 1 Iteration 2 Converged โœ“ โ˜… โ˜… โ˜… Initialize: random centroids โ˜… โ˜… โ˜… Assign: color by nearest centroid โ˜… โ˜… โ˜… Update: move centroids to means โ˜… โ˜… โ˜… Converged: stable assignment
โš ๏ธ

Choosing K

  • Elbow method โ€” plot inertia vs K; look for the kink
  • Silhouette score โ€” measures separation quality (higher = better)
  • Gap statistic โ€” compare inertia to random null reference
๐ŸŽฒ

Initialisation Sensitivity

  • Random init can converge to suboptimal local minima
  • K-Means++: spread initial centroids far apart โ€” provably โ‰ค O(log K) of optimal
  • Run multiple times with different seeds; pick lowest inertia
๐Ÿ”ต

Spherical Cluster Assumption

  • Assumes clusters are roughly spherical and equal-size
  • Fails on elongated, ring-shaped, or crescent clusters
  • Use DBSCAN or GMM for arbitrary shapes
๐Ÿ“

Outlier Sensitivity

  • A single outlier can drag a centroid far from the true cluster center
  • K-Medoids (PAM) uses actual data points as centroids โ€” more robust
  • Pre-filter outliers or use robust clustering
K-Means Failure Modes โ€” Shape, Density, Outlier Sensitivity
Non-spherical: fails Crescent shapes โ€” wrong split Different densities: fails Splits large cluster โ€” missed tight one Outliers: distorted centroids โ˜… โ˜… Outliers (โœ•) pull centroid off-center
Choosing K โ€” Elbow Method and Silhouette Score
Elbow Method Number of clusters K Inertia K=3 elbow 1 2 3 4 5 6 7 8 9 10 Silhouette Score K Silhouette Best 2 3 4 5 6 7 8 9 10

Density-Based Spatial Clustering of Applications with Noise. Clusters are dense regions of points separated by low-density space. Unlike K-Means, DBSCAN does not require specifying K โ€” it discovers the number of clusters from the data.

โฌค

Core Point

Has โ‰ฅ min_samples neighbours within radius ฮต. Anchor of a cluster.

โ—‰

Border Point

Within ฮต of a core point, but has fewer than min_samples neighbours itself.

โœ•

Noise / Outlier

Not a core point and not within ฮต of any core point. Labelled โˆ’1.

DBSCAN Hyperparameters ฮต (eps) โ€” neighbourhood radius Too small โ†’ almost everything is noise | Too large โ†’ everything merges into one cluster min_samples โ€” minimum neighbours to be a core point Rule of thumb: min_samples โ‰ฅ D+1 (D = dimensions) | Use k-distance plot to choose ฮต
DBSCAN โ€” Core, Border, and Noise Points
C Core point โ‰ฅ min_samples within ฮต ฮต Border point Within ฮต of core, < min_samples Noise / Outlier (label = โˆ’1) Outside all ฮต-neighbourhoods Cluster 1 Cluster 2 Noise
K-Means vs DBSCAN โ€” DBSCAN Handles Arbitrary Shapes
K-Means (wrong) DBSCAN (correct) Straight boundary cuts through crescents Follows density โ€” correct crescent clusters

Hierarchical clustering builds a full tree (dendrogram) of cluster merges. No need to specify K upfront โ€” pick any cut height and read off the clusters.

Agglomerative (bottom-up) โ†‘
Divisive (top-down) โ†“
Start: every point is its own cluster
Start: one cluster containing all points
Repeatedly merge the two closest clusters
Recursively split into sub-clusters
Common: Ward, Complete, Average, Single linkage
Less common; computationally expensive

Linkage criteria โ€” how "distance between clusters" is measured:

Single

Min distance any two points across clusters. Tends to chain (elongated clusters).

Complete

Max distance. Produces more compact, balanced clusters.

Average

Average pairwise distance. Trade-off between single and complete.

Ward โ˜…

Minimise increase in total within-cluster variance. Best general-purpose choice.

Dendrogram โ€” Hierarchical Clustering / Cut to Get K Clusters
0 0.5 1.0 1.5 2.0 2.8 Merge distance (height) A B C D E F G Cut โ†’ 2 clusters Cut โ†’ 3 clusters [A B C D] [E F G]

PCA reduces dimensionality by finding the directions (principal components) of maximum variance. Each PC is orthogonal to all others โ€” no redundancy. The technique is the eigendecomposition of the covariance matrix (equivalently, SVD of the data matrix).

โ‘  CenterSubtract mean from each feature
โ‘ก Covarianceฮฃ = (1/n) Xแต€X
โ‘ข EigenvectorsSolve ฮฃv = ฮปv
โ‘ฃ SortLargest ฮป = most variance
โ‘ค ProjectX_reduced = X ยท V_k
PCA โ€” Key Equations Covariance matrix: ฮฃ = (1/n) Xแต€X X is mean-centered | ฮฃ is symmetric positive semi-definite Eigendecomposition: ฮฃ = V ฮ› Vแต€ V = matrix of eigenvectors (principal components) | ฮ› = diagonal matrix of eigenvalues Explained variance ratio: ฮปโ‚– / ฮฃโฑผ ฮปโฑผ Choose k PCs so that cumulative ratio โ‰ฅ 90โ€“95%
PCA โ€” Finding Directions of Maximum Variance
Original 2D Data + Principal Components PC1 ฮปโ‚=2.8 (85%) PC2 ฮปโ‚‚=0.4 (15%) After PCA: 2D โ†’ 1D (PC1 only) PC1 axis PC1 85% PC2 15% Reduce 2Dโ†’1D, retain 85% variance
Scree Plot โ€” How Many Principal Components to Keep
Principal Component Explained Variance % 0 10 20 30 40 50 1 2 3 4 5 6 7 8 9 10 95% variance at PC=3 cumulative

When to use PCA: high-dimensional data (>50 features), visualisation (reduce to 2D/3D), noise reduction, or to remove multicollinearity before regression. Always scale features first (StandardScaler) โ€” PCA is variance-based and sensitive to scale.

PCA is linear โ€” it cannot preserve complex non-linear structure. t-SNE and UMAP are non-linear manifold methods that excel at visual exploration of high-dimensional data.

๐ŸŒ€

t-SNE

  • Models pairwise similarities as probabilities in high-D
  • Maps to 2D/3D so similar points stay together, dissimilar push apart
  • Uses t-distribution in 2D to avoid the crowding problem
  • Perplexity hyperparameter: effective neighbourhood size (5โ€“50)
  • Stochastic โ€” different runs give different layouts
  • O(nยฒ) โ€” slow on large datasets
โšก

UMAP

  • Topological approach โ€” models data on a Riemannian manifold
  • Faster than t-SNE (sub-quadratic)
  • Better preserves global structure alongside local
  • Can be used for dimensionality reduction (not just visualisation)
  • n_neighbors: controls local vs global structure trade-off
  • More stable across runs than t-SNE
PCA vs t-SNE on High-Dimensional Digit Data
PCA โ€” Linear Projection t-SNE โ€” Non-Linear Manifold Classes heavily overlap โ€” poor separation Clear cluster separation โ€” local structure preserved
PCA
t-SNE / UMAP
Linear projection
Non-linear manifold embedding
Fast: O(ndยฒ) where d = dimensions
Slow: O(nยฒ) for t-SNE; faster for UMAP
Preserves global variance structure
Preserves local neighbourhood structure
Can be used for downstream ML features
t-SNE: visualisation only โ€” axes meaningless
Deterministic โ€” same result every run
Stochastic โ€” results vary per run (t-SNE)

๐Ÿ“‹ Chapter 3.6 โ€” Key Takeaways

  • K-Means: assign each point to its nearest centroid, recompute centroids โ€” fast but assumes spherical, equal-size clusters
  • DBSCAN: discovers clusters as dense regions separated by low-density space โ€” no K required, naturally handles noise and outliers
  • Hierarchical clustering: builds a dendrogram showing the full merge history โ€” cut at any height to get the desired number of clusters
  • PCA: eigendecomposition of the covariance matrix โ†’ linear projection retaining maximum variance โ€” always scale features first
  • t-SNE / UMAP: non-linear 2D/3D visualisation โ€” use for exploration and intuition, not as features for downstream ML
  • Rule of thumb: use PCA for feature compression before ML; use t-SNE or UMAP for visualisation only
3.7
Chapter 3.7
Model Evaluation, Metrics & Validation

A model is only as good as how you measure it. The wrong metric gives you false confidence; the right metric reveals exactly where your model fails. Evaluation is not an afterthought โ€” it is built into every design decision.

The confusion matrix is the foundation of all classification metrics. For binary classification it is a 2ร—2 table of outcomes that decomposes prediction errors into two fundamentally different types โ€” false positives and false negatives โ€” which carry very different costs depending on the domain.

โœ…

True Positive (TP)

Predicted positive, actually positive. Correct detection.

โœ…

True Negative (TN)

Predicted negative, actually negative. Correct rejection.

โš ๏ธ

False Positive (FP) โ€” Type I Error

Predicted positive, actually negative. "False alarm." E.g. healthy patient flagged โ€” bad but manageable.

๐Ÿšจ

False Negative (FN) โ€” Type II Error

Predicted negative, actually positive. "Missed case." E.g. cancer patient cleared โ€” potentially fatal.

Confusion Matrix โ€” the foundation of all classification metrics
Predicted NEGATIVE Predicted POSITIVE Actual POSITIVE Actual NEGATIVE FN False Negative Missed! Actual=1, Pred=0 Type II Error ๐Ÿšจ 10 TP True Positive Correct! Actual=1, Pred=1 โœ… Hit 90 TN True Negative Correct! Actual=0, Pred=0 โœ… Correct Rejection 850 FP False Positive False alarm! Actual=0, Pred=1 Type I Error โš ๏ธ 50 Recall = 90/(90+10) = 0.90 Precision = 90/(90+50) = 0.643 Example: 100 cancer patients, 900 healthy
๐ŸŽฏ

Accuracy

(TP + TN) / N

Fraction of all predictions that are correct. Useless on imbalanced data. Predicting "no anthrax" for every email gives 99.99% accuracy โ€” meaningless.

๐Ÿ”

Precision

TP / (TP + FP)

Of all predicted positives, how many are actually positive? Use when false positives are costly โ€” spam filter, treatment recommendation.

๐Ÿ”ญ

Recall (Sensitivity)

TP / (TP + FN)

Of all actual positives, how many did we find? Use when false negatives are costly โ€” cancer screening, fraud detection, security alerts.

โš–๏ธ

F1 Score

2ยทPยทR / (P + R)

Harmonic mean of precision and recall. Use when you care about both equally. Fฮฒ: ฮฒ>1 weighs recall more; ฮฒ<1 weighs precision more.

Classification Metrics โ€” Reference Accuracy = (TP + TN) / N Precision = TP / (TP + FP) Recall = TP / (TP + FN) F1 = 2 ยท Precision ยท Recall / (Precision + Recall) Specificity = TN / (TN + FP) โ† True Negative Rate N = total samples | All metrics in range [0, 1] unless stated otherwise
Precision-Recall Curve โ€” choose operating point based on cost tradeoff
Recall โ†’ Precision โ†‘ 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.5 0.8 1.0 F1=0.4 F1=0.6 F1=0.8 threshold=0.5 High precision, low recall (conservative) Low precision, high recall PR Curve F1 iso-lines Op. point
MetricFormulaUse WhenExample
Accuracy(TP+TN)/NBalanced classesHandwriting recognition
PrecisionTP/(TP+FP)FP costlySpam filter, treatment recommendation
RecallTP/(TP+FN)FN costlyCancer screening, fraud detection, security
F1 Score2PR/(P+R)Balance P and RInformation retrieval, general NLP
AUC-ROCArea under ROCImbalanced, rankingCredit scoring, medical diagnosis
MCCMatthews Corr.Highly imbalancedGenomics, rare event detection

The ROC curve (Receiver Operating Characteristic) plots True Positive Rate (Recall) against False Positive Rate at every possible decision threshold. AUC (Area Under the Curve) collapses this to a single number: the probability that the model ranks a random positive example higher than a random negative example.

AUC = 0.5

Random guessing. Model has zero discrimination ability. Diagonal line.

AUC = 0.7โ€“0.9

Good to excellent discrimination. Acceptable for many real-world problems.

AUC = 1.0

Perfect classifier. L-shaped ROC curve hugging top-left corner.

ROC Curves โ€” Comparing 4 Classifiers (AUC: 0.50 โ†’ 0.96)
False Positive Rate (FPR) โ†’ True Positive Rate (TPR / Recall) โ†‘ 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 Better models hug the top-left corner โ†’ Random AUC=0.50 LogReg AUC=0.78 RandForest 0.91 XGBoost AUC=0.96 Perfect AUC=1.00
ROC & AUC โ€” Key Definitions TPR (Recall) = TP / (TP + FN) โ† y-axis of ROC FPR = FP / (FP + TN) โ† x-axis of ROC (= 1 โˆ’ Specificity) AUC = P(score(positive) > score(negative)) AUC is threshold-independent โ€” evaluates model quality across ALL decision thresholds Prefer AUC over accuracy for imbalanced classification problems

For regression problems, evaluation metrics measure how far predictions deviate from true values. The choice of metric determines what kind of errors your model is penalised for โ€” which in turn shapes training and interpretation.

Regression Metrics โ€” Reference MAE = (1/n) ฮฃ|yแตข โˆ’ ลทแตข| Mean Absolute Error โ€” same units as y โ€” robust to outliers MSE = (1/n) ฮฃ(yแตข โˆ’ ลทแตข)ยฒ Mean Squared Error โ€” units are yยฒ โ€” penalises large errors heavily RMSE = โˆšMSE Root MSE โ€” same units as y โ€” most commonly reported Rยฒ = 1 โˆ’ SS_res / SS_tot Coefficient of determination โ€” proportion of variance explained | 1.0 = perfect | 0 = predicting the mean MAPE = (1/n) ฮฃ|yแตข โˆ’ ลทแตข| / yแตข ยท 100% Mean Absolute Percentage Error โ€” interpretable % โ€” undefined when yแตข = 0
MetricRangeUnitsOutlier SensitivityBest For
MAE[0, โˆž)Same as yRobustWhen outliers are real and shouldn't dominate
MSE[0, โˆž)yยฒSensitiveTraining objective โ€” differentiable everywhere
RMSE[0, โˆž)Same as yModerateReporting โ€” interpretable, same unit as target
Rยฒ(-โˆž, 1]UnitlessModerateComparing models on the same dataset
MAPE[0%, โˆž)PercentageSensitive (small y)Business reporting, relative error

Rยฒ < 0 is possible โ€” it means your model is worse than simply predicting the mean. A model with Rยฒ = 0.85 explains 85% of the variance in the target. Always pair Rยฒ with a residual plot to check for systematic bias.

A single train/test split gives a high-variance performance estimate โ€” one unlucky split can make a great model look bad. Cross-validation rotates the validation set across the full dataset, giving a robust, low-variance estimate of generalisation performance.

5-Fold Cross-Validation โ€” rotate validation fold to use all data
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Validation Training Training Training Training Score: 0.87 Training Validation Training Training Training Score: 0.89 Training Training Validation Training Training Score: 0.85 Training Training Training Validation Training Score: 0.91 Training Training Training Training Validation Score: 0.88 CV Score = mean(0.87, 0.89, 0.85, 0.91, 0.88) = 0.88 ยฑ 0.02
MethodDescriptionFoldsData EfficiencyUse When
Hold-outSingle 80/20 split1LowLarge dataset (>50K), fast iteration
K-FoldRotate K foldsK (5โ€“10)HighMedium dataset, standard practice
Stratified K-FoldMaintain class ratio per foldKHighImbalanced classification
Time-Series SplitRespect temporal order, no future leakageKMediumTime-series data
LOOCVEach sample is its own fold (K=n)nMaximumVery small dataset (<100 samples)

Data leakage is the most common and dangerous ML mistake in production. It occurs when information from outside the training set influences the model โ€” artificially inflating validation performance so the model appears to work when it does not.

๐ŸŽฏ

Target Leakage

A feature is a consequence of the target, not a cause. E.g., using "days hospitalised" to predict "admitted to hospital" โ€” the feature only exists after the event.

๐Ÿ”€

Train-Test Contamination

Preprocessing (scaling, imputation, encoding) fitted on the full dataset before splitting. The scaler has seen test data โ€” leakage!

โฑ๏ธ

Temporal Leakage

Future data used to predict past events in time-series. Training on data from 2024 to predict 2023 outcomes.

๐Ÿ“‹

Duplicate Leakage

Same sample appears in both train and test due to duplicates in the raw dataset. Model memorises rather than generalises.

Data Leakage โ€” always split before fitting any preprocessor
โŒ WRONG โ€” Leakage Full Dataset Fit Scaler on all data โš ๏ธ Train/Test Split Train Model Scaler fitted on test data โ€” leakage! Validation score is inflated. โœ… CORRECT โ€” Pipeline Train/Test Split First Training Data only Fit Scaler on train only โœ“ Train model on scaled train โœ“ sklearn Pipeline enforces this automatically โ€” fit on train, transform both.

Data leakage is why models perform brilliantly in development and catastrophically in production. The number one rule: split your data first. Fit ALL preprocessing โ€” scalers, encoders, imputers โ€” ONLY on training data. Use sklearn's Pipeline to enforce this automatically.

๐Ÿ“‹ Chapter 3.7 โ€” Key Takeaways

  • Confusion matrix: foundation of classification evaluation โ€” TP, TN, FP, FN โ€” understand what each error type costs in your domain
  • Choose metric by cost: Recall when FN is costly (cancer, fraud); Precision when FP is costly (spam, treatment); F1 when both matter equally
  • AUC-ROC: threshold-independent โ€” probability that the model ranks a random positive above a random negative โ€” best single metric for imbalanced classification
  • Rยฒ for regression: proportion of variance explained โ€” 1.0 is perfect, 0 is same as predicting the mean, negative means worse than mean
  • K-Fold CV: rotate K folds for robust, low-variance performance estimates โ€” always use Stratified K-Fold for classification
  • Data leakage is the most common production failure: always split first, fit preprocessors only on training data โ€” use sklearn Pipeline
๐ŸŽ“ Domain 3 Complete โ€” Classical Machine Learning
  • Ch 3.1 โ€” The ML Landscape. Supervised, unsupervised, semi-supervised, and reinforcement learning. Mitchell's definition: learn from experience E to improve at tasks T measured by P. ML โ‰  explicit programming โ€” it is function approximation from data.
  • Ch 3.2 โ€” Linear & Polynomial Regression. Minimise MSE via gradient descent. Regularisation (Ridge / Lasso) prevents overfitting by penalising large weights. Bias-variance tradeoff: high bias = underfits, high variance = overfits. The fundamental tension in all ML.
  • Ch 3.3 โ€” Logistic Regression, KNN, Naive Bayes. Logistic regression uses the sigmoid to output probabilities; decision boundary is linear. KNN is instance-based โ€” no training, but slow at inference and sensitive to the curse of dimensionality. Naive Bayes applies Bayes' theorem with the strong (but effective) conditional independence assumption.
  • Ch 3.4 โ€” Decision Trees & Ensembles. Trees split on information gain (Gini / entropy). Single trees overfit โ€” ensemble methods fix this. Bagging (Random Forest) reduces variance via bootstrap sampling. Boosting (Gradient Boosting, XGBoost) reduces bias by sequentially correcting errors. XGBoost is still the dominant algorithm for structured data.
  • Ch 3.5 โ€” Support Vector Machines. Maximum-margin hyperplane โ€” only support vectors matter. Soft margin C controls tolerance. The kernel trick projects data into high-dimensional space via the inner product โ€” enabling non-linear decision boundaries without explicit feature maps. RBF kernel is the most general choice; always scale features.
  • Ch 3.6 โ€” Clustering & Unsupervised Learning. K-Means minimises inertia โ€” fast but assumes spherical clusters and needs K upfront. DBSCAN discovers clusters as dense regions โ€” no K required, handles noise and arbitrary shapes. Hierarchical clustering builds a full dendrogram โ€” cut at any height for K clusters. PCA: eigenvectors of the covariance matrix for maximum-variance projection. t-SNE / UMAP for non-linear 2D/3D visualisation only.
  • Ch 3.7 โ€” Model Evaluation & Validation. Confusion matrix is the foundation: TP/TN/FP/FN. Choose metric by error cost โ€” Recall when FN is fatal (cancer), Precision when FP is costly (spam). AUC-ROC is threshold-independent and handles class imbalance. K-Fold CV gives robust, low-variance performance estimates. Data leakage โ€” the most common production failure โ€” is prevented by splitting before fitting any preprocessor (use Pipeline).

Domain 3 gave you the full classical ML toolkit โ€” the algorithms that power the majority of production ML systems today. Every algorithm here is a specific answer to the same question: how do we build a function that generalises from training data to unseen examples? Domain 4 takes this further into deep neural networks โ€” where the function is parameterised by millions of learned weights.