The ML Landscape
Supervised, unsupervised, semi-supervised, and reinforcement learning โ the map before the territory
Machine learning is not one thing. It is a family of approaches united by one idea: let the computer find the pattern in the data, rather than programming the pattern by hand. Everything in this domain is a specific way of doing that.
What Is Machine Learning? Core
In 1959, Arthur Samuel coined the term machine learning while building a self-improving checkers program, defining it as "the ability to learn without being explicitly programmed." The most precise formulation came from Tom Mitchell in 1997:
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
Mitchell, T. โ Machine Learning, McGraw Hill, 1997Concrete example โ applying Mitchell's definition to a spam filter:
As the filter processes more labelled emails (more E), its accuracy on new mail (P at T) improves โ that's the entire definition in one sentence.
The critical insight is the inversion of the programming paradigm. In traditional software you write rules ("if subject contains 'FREE MONEY' โ spam"). In ML, you provide examples and the algorithm discovers the rules automatically โ rules often far too complex and numerous for any human to encode by hand.
Supervised Learning In-depth
Supervised learning is the most widely deployed form of ML. The word "supervised" means a teacher is present: you provide labelled training examples โ pairs of (input X, correct answer y) โ and the algorithm learns a mapping f: X โ y that generalises to new, unseen inputs. Think of a student studying with an answer key: they infer underlying rules from many (question, answer) pairs, then apply those rules to questions they've never seen.
Tasks split into two types by the nature of y:
The choice of output type determines the loss function, output layer, and evaluation metric.
The learning signal comes from comparing the model's prediction ลท to the true label y via a loss function. Gradient descent nudges the model's parameters in the direction that reduces the loss. Repeat this across millions of examples and the model converges to a good approximation of f. Supervised learning's key limitation is label dependency: human-annotated examples are expensive, slow, and often scarce in specialised domains.
| Property | Regression | Classification |
|---|---|---|
| Output type | Continuous value (โ) | Discrete category |
| Loss function | MSE, MAE, Huber | Cross-entropy, Hinge |
| Output layer | Linear (single node) | Softmax / Sigmoid |
| Evaluation | RMSE, Rยฒ, MAE | Accuracy, F1, AUC-ROC |
| Example | House price prediction | Email spam detection |
| Algorithms | Linear regression, Ridge, SVR | Logistic reg, SVM, Random Forest |
Unsupervised Learning Core
Unsupervised learning removes the teacher entirely. You provide raw inputs X โ no answer key, no rewards. The algorithm must discover hidden structure on its own. Think of giving someone a pile of unlabelled photographs and asking them to organise it: they'll naturally group similar images together even without being told what "similar" means.
Three major tasks fall under unsupervised learning:
The unsupervised challenge is evaluation: without ground-truth labels, how do you know if a clustering is good? Metrics like silhouette score (cohesion vs separation) and reconstruction error (for autoencoders) provide proxies, but ultimately require domain-expert judgment. Despite this, unsupervised learning is invaluable โ it can operate on the vast stores of unlabelled data that exist in every organisation.
Semi-Supervised & Self-Supervised Learning Core
Between the extremes of "all data labelled" (supervised) and "no data labelled" (unsupervised) lies a practical middle ground that powers most of modern AI.
Semi-supervised learning combines a small labelled set with a large unlabelled set. Even without labels, the geometric structure of the data (clusters, manifolds) constrains which labels are plausible: two nearby points probably share a label. Algorithms like label propagation, self-training, and co-training exploit this structure to achieve strong performance even when labels cover only 1โ5% of data โ critical in medical imaging where expert annotation is expensive.
Self-supervised learning is the most important paradigm of the past decade. It creates a training signal directly from raw data, eliminating human annotation entirely. The core trick: mask or corrupt part of the input, then train the model to reconstruct the missing part.
Self-supervised learning is how every major language model is pre-trained. GPT-4 processed trillions of tokens โ each token is the label for the previous one. This technique made trillion-parameter model training economically feasible without any human annotation at scale.
Semi-Supervised Learning
A few labelled points anchor the model; unlabelled points fill in the space via geometric structure. Like teaching a child 5 examples of "cat" and letting them identify 10,000 more.
- Medical AI: 100 labelled scans + 10,000 unlabelled
- Drug discovery: few active compounds + vast chemical space
- Web classification: small annotated corpus + crawled text
Self-Supervised Learning
Labels come from the data structure itself โ the supervision signal is derived from the input. No human annotation required.
- BERT: predict masked tokens โ understand language
- GPT: predict next token โ generate coherent text
- SimCLR: match augmented views โ learn visual features
- WAV2Vec: predict masked audio โ speech recognition
Reinforcement Learning Introductory
Reinforcement learning (RL) is the third major paradigm. An agent takes actions in an environment and receives numerical rewards or penalties based on those actions. No labelled examples exist โ only a reward signal over time. The agent must learn, through trial and error, which sequences of actions lead to the highest cumulative reward.
The central challenge is the exploration-exploitation dilemma: should the agent exploit what it already knows works, or explore new actions that might be better? Too much exploitation โ stuck in a local optimum. Too much exploration โ never commits to a strategy.
Game Playing
AlphaGo / AlphaZero learned Go, chess, and shogi from self-play with game outcome as reward. Beat world champions within days of training.
Robotics
Reward = physical task completion. Boston Dynamics gaits, OpenAI Dactyl hand, DeepMind locomotion โ all RL-trained policies.
RLHF for LLMs
Humans rate responses โ train a reward model โ use PPO to fine-tune the LLM toward preferred outputs. The technique that made ChatGPT helpful.
๐ Full RL coverage โ Q-learning, policy gradients, PPO, actor-critic โ is in Domain 7: Reinforcement Learning. Here we establish only where RL fits in the ML landscape.
The Standard ML Pipeline In-depth
Every ML project โ from a weekend Kaggle competition to a production system at Google โ follows the same general pipeline. Understanding this sequence prevents the most common failure modes: models trained on wrong data, evaluated on contaminated test sets, or deployed without monitoring.
โ Problem Definition
Frame the task precisely: regression or classification? What is the prediction target? What does success look like numerically? Check whether ML is even needed โ sometimes a rule-based system is simpler and more reliable.
โก Data Collection
Gather from APIs, databases, web scraping, sensors, or surveys. Assess quantity (enough to generalise?) and quality (representative? biased? stale?). Data collection is often 60โ70% of total project time in practice.
โข EDA โ Exploratory Data Analysis
Histograms, scatter plots, correlation matrices, missing value heatmaps, class imbalance checks, outlier detection. Goal: understand your data before touching a model. Surprises found here cost 10ร less than surprises found post-deployment.
โฃ Preprocessing
Missing value imputation (mean, median, KNN), feature scaling (StandardScaler, MinMaxScaler), categorical encoding (one-hot, ordinal, target encoding), train/val/test split. Critical rule: fit all transformers on training data only โ never on validation or test data.
โค Modelling
Start with a baseline (majority-class classifier, mean predictor). Then iterate: logistic regression โ random forest โ gradient boosting โ neural network. Use cross-validation. Tune with grid search, random search, or Bayesian optimisation (Optuna).
โฅ Evaluation
Choose metrics aligned with business goals. Accuracy misleads on imbalanced classes โ prefer F1, precision/recall, or AUC-ROC. Never evaluate on training data. Final evaluation on the held-out test set โ used exactly once.
โฆ Deployment & Monitoring
Serve via REST API (FastAPI, Flask, Triton) or batch pipeline. Monitor for data drift (input distribution shifts), concept drift (relationship between X and y changes), and performance degradation. Set retraining triggers. Production models without monitoring degrade silently โ one of the most common failure modes in real ML systems.
The pipeline is not linear โ it is a loop. EDA may force you to redefine the problem. Evaluation may send you back to data collection. Deployment monitoring may trigger the whole cycle again. Budget time for at least 3 full iterations.
When to Use Which Paradigm Core
The most common mistake early practitioners make is defaulting to neural networks or the most complex algorithm available. The right question is always: what data do I have, and what do I need to output?
| Paradigm | Data Requirement | Typical Tasks | Key Algorithms | When to Use |
|---|---|---|---|---|
| Supervised | Labelled pairs (X, y) | Classification, Regression | Linear Reg, RF, SVM, NNs | You have labels and a clear prediction target |
| Unsupervised | Unlabelled X only | Clustering, Compression, Anomaly | k-means, PCA, DBSCAN, Autoencoders | No labels; explore structure; detect anomalies |
| Semi-supervised | Few labels + many unlabelled | Classification with scarce labels | Label propagation, self-training, MixMatch | Labelling expensive; abundant unlabelled data |
| Self-supervised | Raw unlabelled X (large scale) | Pre-training LLMs, vision encoders | BERT, GPT, SimCLR, MAE | Huge unlabelled datasets; build foundation model |
| Reinforcement | Reward signal from environment | Games, robotics, RLHF, trading | Q-learning, PPO, SAC, DQN | Sequential decisions; feedback from consequences |
๐ Chapter 3.1 โ Key Takeaways
- ML = let the algorithm find patterns instead of programming them explicitly โ the key inversion of the traditional software paradigm
- Mitchell's definition: learn from experience E on tasks T as measured by performance P โ applies to everything from spam filters to AlphaFold
- Supervised: labelled {X, y} pairs โ learn f: X โ y. Two subtypes: regression (continuous) and classification (categorical)
- Unsupervised: no labels โ discover hidden clusters, compress via dimensionality reduction, or model density โ evaluation is harder without ground truth
- Self-supervised: labels come from the data itself โ predicting masked or next tokens โ how GPT and BERT are pre-trained at trillion-token scale with zero human annotation
- Reinforcement: learn from reward signals, not labels โ powers AlphaGo, robot locomotion, and RLHF for LLMs (full coverage: Domain 7)
- The ML pipeline: Problem โ Data โ EDA โ Preprocess โ Model โ Evaluate โ Deploy & Monitor โ a loop, not a one-way sequence
- Paradigm choice: do you have labels? A reward signal? Or only raw data? โ always start with the simplest method that fits your data situation
Regression is the simplest form of supervised learning โ and the best place to understand how ML really works. Every concept here (loss functions, gradient descent, overfitting, regularisation) reappears in every algorithm from SVMs to transformers.
Linear Regression In-depth
The goal of regression is to predict a continuous numerical output from one or more inputs. The simplest version โ simple linear regression โ assumes the relationship between input x and output y is a straight line. We fit that line to the data by finding the optimal slope (weight) and intercept (bias).
Our running example: predicting a house's sale price from its size in square footage. We assume price increases roughly linearly with size, with some noise from other unmodelled factors (location, condition, year built). The model makes predictions using:
The model has just two parameters โ w (how much price increases per unit of size) and b (the base price for a zero-size house, which anchors the line). Training means finding the values of w and b that make the line fit the data as closely as possible.
The Cost Function (MSE) In-depth
To fit a line we need to measure how wrong our current w and b are. The residual for one prediction is the gap between the actual value and the predicted value: residual = y โ ลท. If we just summed residuals, positive and negative errors would cancel. Instead we square them.
Mean Squared Error (MSE) averages the squared residuals across all n training examples. Squaring does two things: it makes all errors positive, and it penalises large errors far more than small ones (a 2ร larger error contributes 4ร the cost). MSE is also smooth everywhere โ which means we can take its derivative and use gradient descent.
A worked example โ 4 predictions: actual = [200, 250, 300, 350], predicted = [190, 270, 290, 380]. Residuals: [10, โ20, 10, โ30]. Squared: [100, 400, 100, 900]. MSE = (100+400+100+900)/4 = 375. RMSE = โ375 โ $19.4k โ that's the typical prediction error in the same units as house price.
MSE โ Use When
- Outliers are genuine data (not noise)
- Large errors should be penalised heavily
- You need a differentiable loss for gradient descent
- Default choice for regression โ sklearn, PyTorch default
MAE โ Use When
- Data has many outliers you want to ignore
- Median-like behaviour is preferred over mean
- House price datasets with luxury anomalies
- Not differentiable at 0 โ needs subgradient methods
Gradient Descent for Regression In-depth
We have a cost function J(w, b) โ now we need to minimise it. Two approaches exist. The Normal Equation solves for the optimal weights analytically in one shot. Gradient Descent takes many small iterative steps, using the derivative of J to determine which direction to move.
The intuition: the gradient (derivative of J w.r.t. w) tells you the slope of the cost bowl at your current position. Subtract a fraction (ฮฑ) of that slope from w to move toward the minimum. If the gradient is positive, w is too large โ decrease it. If negative, w is too small โ increase it. The learning rate ฮฑ controls how large each step is โ too large and you overshoot; too small and training takes forever.
Gradient descent is not just for linear regression โ it's the engine that trains every neural network, transformer, and deep learning system. Understanding it here, where the math is simple, makes every downstream algorithm clearer.
| Aspect | Normal Equation | Gradient Descent |
|---|---|---|
| Formula | w = (XแตX)โปยนXแตy | Iterative โJ/โw updates |
| Speed โ small n | Fast, one computation | Slow โ many iterations |
| Speed โ large n | Very slow O(nยณ) | Scales well, mini-batches |
| Memory | Must invert full XแตX matrix | Can stream mini-batches |
| Convergence | Always exact | Needs learning rate tuning |
| Best for | < 1,000 features | Large data, neural networks |
Multiple Linear Regression Core
Real predictions need multiple inputs. House price depends not just on square footage, but on bedrooms, bathrooms, location score, age, and dozens of other features. Multiple linear regression extends the model to handle any number of features:
With multiple features, feature scaling becomes critical. If xโ (square footage) ranges from 500โ3000 and xโ (bedrooms) ranges from 1โ6, the gradient for wโ is tiny compared to wโ. This creates an elongated cost bowl where gradient descent zig-zags inefficiently. Scaling all features to a comparable range makes the bowl round and descent fast.
Polynomial Regression Core
What if the relationship between x and y is curved, not linear? We can handle this without leaving the linear regression framework โ by creating polynomial features: xยฒ, xยณ, and so on. We then fit a standard linear regression to these expanded features. The model is still linear in its parameters (wโ, wโ, wโ...) even though the fit is a curve in the original x space.
The risk: higher polynomial degree โ tighter fit on training data โ worse generalisation to new data. A degree-12 polynomial can pass through every training point perfectly (zero training error) while oscillating wildly between them โ it has memorised the data rather than learned the pattern.
Regularisation: Ridge, Lasso, ElasticNet In-depth
Overfitting happens because the model is too complex for the data โ it fits noise rather than signal. Regularisation is the standard fix: add a penalty term to the loss function that discourages large weights. Smaller weights = simpler model = less overfitting. The penalty strength is controlled by a hyperparameter ฮป (lambda).
Ridge (L2)
Adds ฮปยทฮฃwแตขยฒ to the loss. Penalises large weights by squaring them. Weights shrink toward zero but never exactly reach zero. Keeps all features โ just with smaller influence.
- Best when all features are genuinely useful
- Stable โ handles correlated features well
- sklearn:
Ridge(alpha=ฮป)
Lasso (L1)
Adds ฮปยทฮฃ|wแตข| to the loss. The absolute value creates corners in the constraint region โ the optimal solution tends to land exactly on an axis, driving some weights to exactly zero. Built-in feature selection.
- Best when many features are irrelevant
- Produces sparse models
- sklearn:
Lasso(alpha=ฮป)
ElasticNet
Combines L1 and L2 penalties. Gets some feature selection (from L1) with the stability of L2 when features are correlated. Controlled by a mixing ratio ฮฑ.
- Best when groups of correlated features exist
- More flexible than pure Ridge or Lasso
- sklearn:
ElasticNet(alpha=ฮป, l1_ratio=ฮฑ)
| Property | Ridge (L2) | Lasso (L1) | ElasticNet |
|---|---|---|---|
| Penalty | ฮปยทฮฃwแตขยฒ | ฮปยทฮฃ|wแตข| | ฮฑยทL1 + (1โฮฑ)ยทL2 |
| Feature selection | No โ weights shrink, never zero | Yes โ exact zeros possible | Partial |
| Output | Dense โ all weights non-zero | Sparse โ many weights = 0 | Semi-sparse |
| Best when | All features relevant | Many irrelevant features | Correlated + sparse |
| sklearn | Ridge(alpha=ฮป) | Lasso(alpha=ฮป) | ElasticNet(alpha, l1_ratio) |
The Bias-Variance Tradeoff In-depth
Why does any ML model fail? The expected prediction error can always be decomposed into three components. Understanding them is essential for diagnosing model problems and choosing the right fix.
Fix: more features, higher polynomial degree, less regularisation, more complex model.
Fix: more training data, regularisation (Ridge/Lasso), simpler model, dropout, early stopping.
Regularisation is bias-variance control. Increasing ฮป raises bias (simpler model) and reduces variance (less sensitive to training data). The optimal ฮป sits at the bottom of the U-shaped validation error curve โ found by cross-validation, not guesswork.
๐ Chapter 3.2 โ Key Takeaways
- Linear regression: ลท = wยทx + b โ learn the weight and bias that minimise MSE on training data
- MSE penalises large errors quadratically; RMSE is in the same units as y โ more interpretable for reporting
- Gradient descent: iteratively nudge w and b in the direction that reduces MSE โ the algorithm behind all of deep learning
- Feature scaling (StandardScaler) is essential: unscaled features create elongated cost bowls that slow convergence dramatically
- Polynomial features extend linear regression to curves โ but higher degree risks overfitting (zero training error, high test error)
- Ridge (L2) shrinks weights; Lasso (L1) zeros weights (feature selection); ElasticNet combines both
- Bias-Variance: Error = Biasยฒ + Variance + ฯยฒ. Regularisation trades more bias for less variance โ find the sweet spot with cross-validation
Classification is supervised learning for discrete outputs: given an input, which bucket does it belong to? Logistic regression, k-nearest neighbours, and Naive Bayes each answer that question from a completely different angle โ probabilistic, geometric, and statistical.
Logistic Regression In-Depth
Despite its name, logistic regression is a classifier, not a regressor. It predicts P(y=1 | x) โ the probability that an input belongs to class 1. The trick is the sigmoid function, which squashes any real number from โโ to +โ into the range (0, 1), making the output interpretable as a probability.
Why not use ordinary linear regression for classification? A linear model can predict values well below 0 or well above 1, which are meaningless as probabilities, and the resulting loss surface is non-convex with MSE โ gradient descent is not guaranteed to find the global minimum. The sigmoid + binary cross-entropy combination fixes both problems.
Decision rule: predict class 1 if P(y=1|x) โฅ 0.5, else class 0. The threshold 0.5 is the default but can be tuned โ lower it to increase recall (catch more positives), raise it to increase precision (fewer false alarms).
# Logistic Regression โ sklearn walkthrough from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score X, y = make_classification(n_samples=200, n_features=2, n_redundant=0, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LogisticRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] # probability of class 1 print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}") print(f"Probabilities (first 5): {y_prob[:5].round(3)}")
Sigmoid & Decision Boundaries Core
Logistic regression produces linear decision boundaries โ a straight line in 2D, a hyperplane in higher dimensions. This is fast and interpretable, but it fails when the true boundary between classes is curved or complex.
Three escape routes from linearity: polynomial features (add xยฒ, xยทy, ... as new inputs โ same model, richer boundary), the kernel trick (implicitly maps to a very high-dimensional space, used in SVMs), or simply switch to a non-linear model (decision trees, neural networks). The softmax function generalises the sigmoid to K classes โ each class gets a score zโ, and softmax converts all K scores into a probability distribution that sums to 1.
Multiclass Classification Core
When there are K > 2 classes, three strategies extend binary classification:
One-vs-Rest (OvR)
Train K binary classifiers. Classifier k asks: "is this class k vs everything else?" At prediction, pick the class with the highest confidence score.
- K models trained
- Fast, simple
- Works with any binary classifier
One-vs-One (OvO)
Train K(Kโ1)/2 classifiers โ one for every pair of classes. At prediction, majority vote wins. Works well when individual classifiers are fast to train.
- K(Kโ1)/2 models
- Better for SVMs
- Slower for large K
Softmax (Multinomial)
Single model with K output scores. Softmax converts them to probabilities that sum to 1. The modern default for neural networks and logistic regression.
- 1 model
- Probabilistic output
- End-to-end trainable
K-Nearest Neighbours In-Depth
"You are who your neighbours are." KNN is the simplest meaningful classifier: to classify a new point, find the K nearest training points by distance and take a majority vote.
KNN is a lazy learner โ there is no training phase. The algorithm memorises all training data and performs all computation at prediction time. This makes training instant but prediction potentially slow on large datasets (O(nยทd) per query โ must compute distance to every training point).
Choosing K: K=1 memorises training data perfectly (zero training error) but is extremely sensitive to noise and outliers โ overfitting. K=n (all points) just predicts the majority class โ underfitting. The optimal K is found via cross-validation, typically in the range 3โ15. Odd K values avoid ties in binary classification.
Naive Bayes Core
Naive Bayes is a probabilistic classifier built on Bayes' theorem. The "naive" assumption is that all features are conditionally independent given the class โ in practice this is almost never true, yet the algorithm performs remarkably well, especially on text classification tasks.
Why does it work despite the false assumption? The discriminative signal โ the difference between class probabilities โ is often large enough that the independence error doesn't flip the argmax. In high-dimensional sparse data (like text), there's also very little co-occurrence signal to exploit anyway.
Worked example โ spam filter: Given 5 training emails below, compute the Naive Bayes decision for the new message "free click prize":
| "free" | "click" | "meeting" | "report" | Label | |
|---|---|---|---|---|---|
| Email 1 | 1 | 1 | 0 | 0 | spam |
| Email 2 | 1 | 0 | 0 | 0 | spam |
| Email 3 | 0 | 0 | 1 | 1 | ham |
| Email 4 | 0 | 0 | 0 | 1 | ham |
| Email 5 | 1 | 1 | 0 | 0 | spam |
# P(spam) = 3/5 = 0.6 | P(ham) = 2/5 = 0.4 # P(free|spam)=3/3=1.0 P(click|spam)=2/3โ0.67 # P(free|ham)=0/2=0.0 P(click|ham)=0/2=0.0 (add Laplace smoothing: +1) # With smoothing: P(free|ham)โ0.17 P(click|ham)โ0.17 # Score(spam) = 0.6 ร 1.0 ร 0.67 โ 0.40 # Score(ham) = 0.4 ร 0.17 ร 0.17 โ 0.012 prediction = "spam" # argmax wins
| Variant | Feature Type | Use Case | Example |
|---|---|---|---|
| Gaussian NB | Continuous (normal dist) | Iris classification, sensors | Height/weight data |
| Multinomial NB | Count data (non-negative int) | Text classification (TF) | Word frequency in emails |
| Bernoulli NB | Binary (0 / 1) | Binary feature classification | Word present / absent |
| Complement NB | Count data | Imbalanced text classification | Large corpus, class imbalance |
Algorithm Comparison Core
| Algorithm | Training | Prediction | Interpretable? | Non-linear? | Best For |
|---|---|---|---|---|---|
| Logistic Regression | O(nยทdยทiter) | Fast O(d) | Yes | No (poly features needed) | Linearly separable, probability output needed |
| KNN | None (lazy) | Slow O(nยทd) | Intuitive | Yes (local) | Small dataset, non-linear baseline |
| Naive Bayes | Very Fast O(nยทd) | Very Fast O(Kยทd) | Yes | Partially (Gaussian) | Text classification, high-dim sparse data |
๐ Chapter 3.3 โ Key Takeaways
- Logistic Regression outputs P(y=1|x) via the sigmoid function โ linear decision boundary, interpretable coefficients, trained with Binary Cross-Entropy
- The decision threshold (default 0.5) can be tuned โ lower for higher recall, raise for higher precision; always use the probability output, not just the class label
- KNN: no training phase โ classify by majority vote of K nearest neighbours. Requires feature scaling; slow at prediction O(nยทd); ideal as a non-linear baseline
- Naive Bayes: Bayes' theorem + conditional independence assumption โ surprisingly accurate despite the naive assumption; very fast; best-in-class for text classification
- Multiclass strategies: OvR (K classifiers), OvO (K(Kโ1)/2 classifiers), or Softmax (single model, modern default)
- Linear classifiers fail on non-linear data โ use polynomial features, kernel trick, or switch to a non-linear model (decision trees, neural networks)
A single decision tree mirrors human reasoning but overfits easily. Ensemble methods โ Random Forests and Gradient Boosting โ combine hundreds of trees to become the most powerful classical ML algorithms on tabular data.
Decision Trees In-Depth
Decision trees mirror how humans naturally make decisions: a series of if/else questions on features, arriving at a prediction at the leaves. Each internal node tests one feature, each branch is an outcome of that test, and each leaf holds the class label or numeric prediction. The CART algorithm (Classification And Regression Trees) is the sklearn default and can handle both classification and regression, numerical and categorical features, and naturally captures non-linear relationships.
# Decision Tree โ sklearn from sklearn.tree import DecisionTreeClassifier, export_text from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target tree = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42) tree.fit(X, y) # Print tree structure as text print(export_text(tree, feature_names=list(iris.feature_names))) # Feature importances for name, imp in zip(iris.feature_names, tree.feature_importances_): print(f"{name}: {imp:.3f}")
Splitting Criteria In-Depth
At each node, CART exhaustively searches all features and all thresholds to find the split that maximises purity in the child nodes. Three measures of impurity are used in practice:
Gini Impurity
CART default for classification. Probability of misclassifying a random sample. Range: 0 (pure) โ 0.5 (balanced binary).
Entropy / Info Gain
Used by ID3 and C4.5 algorithms. Measures disorder. Information Gain = parent entropy โ weighted child entropy.
Variance Reduction
For regression trees. Choose the split that minimises the weighted variance of child nodes โ equivalent to minimising MSE.
Worked example โ which split wins? Parent node: 30 samples (20 blue, 10 red). Gini = 1 โ (20/30)ยฒ โ (10/30)ยฒ = 0.444.
| Split | Left child | Right child | Weighted Gini | Result |
|---|---|---|---|---|
| Split A | 15 blue, 5 red โ Gini = 0.375 | 5 blue, 5 red โ Gini = 0.500 | 0.417 | Gain = 0.027 |
| Split B | 19 blue, 1 red โ Gini = 0.099 | 1 blue, 9 red โ Gini = 0.180 | 0.120 | Gain = 0.324 โ Winner |
Overfitting & Pruning Core
An unconstrained decision tree grows until every leaf contains a single sample โ perfect training accuracy, terrible test accuracy. This is the classic overfit problem. Two families of solutions exist: pre-pruning (stop growing early) and post-pruning (grow full, then cut).
Pre-Pruning Hyperparameters
max_depthโ maximum depth of treemin_samples_splitโ min samples to split a nodemin_samples_leafโ min samples in any leafmax_featuresโ max features considered per splitmin_impurity_decreaseโ min gain required to split
Post-Pruning (Cost Complexity)
- Grow full tree to max depth
- Calculate cost-complexity criterion for subtrees
- Remove branches with lowest improvement-to-size ratio
- Select best pruned tree via cross-validation
- sklearn:
ccp_alphaparameter controls pruning strength
Bagging & Random Forests In-Depth
A single decision tree has high variance โ small perturbations to the training data produce completely different trees. Bagging (Bootstrap Aggregating) solves this by training N trees on N bootstrapped datasets (sampled with replacement) and averaging their predictions. The individual errors are largely independent, so they cancel out.
Random Forest = Bagging + random feature subset at each split. The critical insight: if you use all features at each node, every tree will pick the same dominant feature at the root, making all trees highly correlated โ they'd all make the same mistakes. By restricting each node to a random subset of โd features, the trees become decorrelated, and averaging truly independent errors dramatically reduces variance. About 37% of training samples are never used in each tree โ this out-of-bag (OOB) set is a free validation estimate.
Random Forest Strengths
- Handles high-dimensional data well
- Built-in feature importance scores
- Robust to outliers and missing values
- OOB error โ free validation estimate
- Fully parallelisable โ fast training
- One of the best off-the-shelf algorithms
Random Forest Weaknesses
- Less interpretable than a single tree
- Slow prediction for very large forests
- Memory intensive โ stores all trees
- Poor extrapolation beyond training range
- Not ideal for very high-cardinality categoricals
Gradient Boosting In-Depth
Gradient Boosting is fundamentally different from bagging. Trees are built sequentially โ each new tree specifically targets the mistakes of the current ensemble. The key insight (Friedman, 1999): fit each new tree to the negative gradient of the loss function with respect to the current predictions. This is gradient descent โ but in function space rather than parameter space.
The learning rate (shrinkage) controls how much each tree contributes: F(x) += learning_rate ร T(x). A small learning rate requires more trees but generalises better. AdaBoost (the precursor) reweighted misclassified examples; modern Gradient Boosting is more general and works with any differentiable loss function.
XGBoost, LightGBM & CatBoost Core
Vanilla gradient boosting is slow โ it re-evaluates every possible split at every node. The three production-grade libraries solve this with fundamentally different approaches, making gradient boosting practical on millions of rows.
XGBoost (2016)
Chen & Guestrin. Regularised gradient boosting, second-order gradients (Newton step), hardware-optimised cache-aware computation. The competition standard for years.
- Best general-purpose default
- Excellent documentation
- Wide ecosystem support
LightGBM (2017)
Microsoft. Histogram-based splits (buckets values โ much faster), leaf-wise tree growth (vs level-wise), GOSS & EFB sampling. 10ร faster than XGBoost on large data.
- Large datasets (>100K rows)
- High-cardinality features
- Speed matters
CatBoost (2017)
Yandex. Native categorical feature handling (no manual encoding), ordered boosting to prevent target leakage, symmetric (oblivious) trees for fast prediction.
- Many categorical features
- Minimal tuning needed
- Avoids target leakage
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Speed | Good | Fastest | Good |
| Memory usage | High | Low | Medium |
| Categoricals | Manual encoding | Partial support | Native support |
| Tuning effort | Moderate | Moderate | Minimal |
| Tree growth | Level-wise | Leaf-wise | Oblivious trees |
| Best for | General tabular | Large-scale, fast | Many categoricals |
# XGBoost โ binary classification example import xgboost as xgb from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) model = xgb.XGBClassifier( n_estimators=200, # number of trees learning_rate=0.1, # shrinkage โ prevents overfitting max_depth=4, # tree depth subsample=0.8, # row sampling per tree colsample_bytree=0.8, # feature sampling per tree eval_metric='logloss', random_state=42, ) model.fit( X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=20, # stop if no improvement for 20 rounds verbose=False, ) y_prob = model.predict_proba(X_test)[:, 1] # probability of positive class print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
Ensemble Comparison Core
If you have tabular data, try Random Forest first as your baseline. If you need the best possible accuracy, use XGBoost or LightGBM with cross-validated hyperparameter tuning. These two algorithms have won more Kaggle competitions than any other approach combined.
๐ Chapter 3.4 โ Key Takeaways
- Decision tree: recursive binary splits using Gini or Entropy โ interpretable but prone to high-variance overfitting without pruning
- Bagging: train N trees on bootstrapped datasets and aggregate to reduce variance โ errors average out when trees are independent
- Random Forest: bagging + random feature subsets at each node = decorrelated trees = one of the strongest off-the-shelf algorithms
- Gradient Boosting: sequential error correction โ each new tree fits the negative gradient of the loss; controlled by learning rate (shrinkage)
- XGBoost / LightGBM / CatBoost: production-grade gradient boosting โ dominant on structured/tabular data for a decade
- Ensemble principle: combining many diverse, weak learners consistently outperforms any single strong learner
SVMs answer a deceptively simple question: of all the hyperplanes that separate two classes, which one generalises best? The answer โ the maximum margin hyperplane โ leads to one of the most elegant and mathematically rigorous algorithms in machine learning.
Maximum Margin Classifier In-Depth
Many hyperplanes can separate two linearly-separable classes. SVMs choose the one that maximises the margin โ the perpendicular distance between the boundary and the nearest training points on each side. A wider margin means more room for error on unseen data, giving better generalisation.
The decision boundary is the hyperplane wยทx + b = 0. The two margin planes are wยทx + b = +1 and wยทx + b = โ1. The margin width is 2/||w||, so maximising the margin is equivalent to minimising ||w||ยฒ. This is a constrained quadratic optimisation problem โ convex, so it has a unique global solution.
Support Vectors Core
Support vectors are the training points that lie exactly on the margin boundaries (wยทx + b = ยฑ1). They are the only points that matter: if you removed every non-support vector from the training set and retrained, you'd get the exact same model. This is a profound property โ the decision boundary is defined by a sparse subset of the data.
What they are
Training points on or inside the margin boundary. Typically just a small fraction of all training points โ often fewer than 10%.
Why they matter
They are the only points that influence the boundary. Non-support vectors can be moved or removed without changing the model at all.
Model complexity
Fewer support vectors โ simpler, more generalisable model. Many support vectors โ complex boundary, may overfit. Good diagnostic metric.
The SVM optimisation is solved in its dual formulation using Lagrange multipliers โ one multiplier ฮฑแตข per training point. Non-support vectors have ฮฑแตข = 0, so they contribute nothing. The prediction is: ลท = sign(ฮฃ ฮฑแตข yแตข K(xแตข, x) + b), summing only over support vectors.
Soft Margin SVM โ The C Parameter In-Depth
The hard-margin SVM requires perfect linear separability โ it fails completely if even one point is on the wrong side. Real data is always noisy and often overlapping. Soft-margin SVM introduces slack variables ฮพแตข โฅ 0 that allow points to violate the margin. The C hyperparameter controls how heavily violations are penalised.
The Kernel Trick In-Depth
A linear SVM can only draw straight boundaries. But many real datasets are not linearly separable in their original feature space. One solution is to manually engineer polynomial or interaction features โ but for high-dimensional data this is computationally prohibitive. The kernel trick solves this elegantly: it allows the SVM to operate in an implicitly high-dimensional feature space without ever computing the transformation explicitly.
The key insight: the SVM dual formulation only requires computing dot products between feature vectors โ never the vectors themselves. A kernel function K(xแตข, xโฑผ) computes the dot product that would result from a high-dimensional mapping ฯ, without actually applying ฯ. For the RBF kernel, this implicit feature space is infinite-dimensional.
Kernel Types Core
| Kernel | Formula | Parameters | Best For | Limitation |
|---|---|---|---|---|
| Linear | x ยท z | None | Linearly separable, high-d text/NLP | Fails on non-linear data |
| Polynomial | (x ยท z + c)^d | d, c | Feature interactions (image data) | Sensitive to degree d choice |
| RBF / Gaussian | exp(โฮณ||xโz||ยฒ) | ฮณ, C | General purpose โ most used | Slow on very large datasets |
| Sigmoid | tanh(ฮณxยทz + c) | ฮณ, c | Neural network approximation | Rarely the best choice |
# SVM with RBF kernel + GridSearch from sklearn.svm import SVC from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV # SVM requires feature scaling โ always use a Pipeline! svm_pipeline = Pipeline([ ('scaler', StandardScaler()), ('svm', SVC(kernel='rbf', probability=True)), ]) param_grid = { 'svm__C': [0.1, 1, 10, 100], 'svm__gamma': ['scale', 'auto', 0.001, 0.01, 0.1], } grid_search = GridSearchCV( svm_pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1 ) grid_search.fit(X_train, y_train) print(f"Best params: {grid_search.best_params_}") print(f"Best CV score: {grid_search.best_score_:.4f}")
SVR โ Support Vector Regression Reference
Support Vector Regression (SVR) extends the SVM idea to continuous outputs. Instead of maximising a margin between classes, SVR fits a tube of width 2ฮต around the predictions. Points inside the tube incur zero loss โ the model doesn't care about small errors. Only points outside the tube (the support vectors) contribute to the optimisation. This gives SVR a natural insensitivity to small noise.
SVR inherits all the power of kernels โ with an RBF kernel, SVR can fit complex non-linear regression surfaces while remaining robust to outliers (thanks to the ฮต-insensitive loss). The tradeoff: SVR can be slow on large datasets (O(nยฒ) to O(nยณ) training) and requires careful tuning of ฮต, C, and the kernel parameters.
๐ Chapter 3.5 โ Key Takeaways
- SVM finds the maximum margin hyperplane โ widest gap between classes = maximum robustness to noise
- Only support vectors (points on/inside the margin) determine the boundary โ all other training points are irrelevant
- Soft margin C: high C = small margin, strict (overfit risk); low C = wide margin, tolerant (underfit risk)
- Kernel trick: evaluate dot products in high-dimensional space without explicit transformation โ makes non-linear SVMs tractable
- RBF kernel is the most general โ controlled by ฮณ (boundary complexity) and C (margin strictness) โ tune both via cross-validation
- SVMs require feature scaling โ always wrap in a Pipeline with StandardScaler
Unsupervised learning is the art of finding structure without labels. The data speaks for itself โ the algorithm must discover what groups, components, or manifolds naturally exist. Clustering and dimensionality reduction are the two pillars.
K-Means Clustering In-Depth
K-Means partitions n points into K clusters by minimising within-cluster variance. The Lloyd algorithm iterates two steps โ assignment and update โ until centroids stop moving.
K-Means Issues & Choosing K Core
Choosing K
- Elbow method โ plot inertia vs K; look for the kink
- Silhouette score โ measures separation quality (higher = better)
- Gap statistic โ compare inertia to random null reference
Initialisation Sensitivity
- Random init can converge to suboptimal local minima
- K-Means++: spread initial centroids far apart โ provably โค O(log K) of optimal
- Run multiple times with different seeds; pick lowest inertia
Spherical Cluster Assumption
- Assumes clusters are roughly spherical and equal-size
- Fails on elongated, ring-shaped, or crescent clusters
- Use DBSCAN or GMM for arbitrary shapes
Outlier Sensitivity
- A single outlier can drag a centroid far from the true cluster center
- K-Medoids (PAM) uses actual data points as centroids โ more robust
- Pre-filter outliers or use robust clustering
DBSCAN In-Depth
Density-Based Spatial Clustering of Applications with Noise. Clusters are dense regions of points separated by low-density space. Unlike K-Means, DBSCAN does not require specifying K โ it discovers the number of clusters from the data.
Core Point
Has โฅ min_samples neighbours within radius ฮต. Anchor of a cluster.
Border Point
Within ฮต of a core point, but has fewer than min_samples neighbours itself.
Noise / Outlier
Not a core point and not within ฮต of any core point. Labelled โ1.
Hierarchical Clustering Core
Hierarchical clustering builds a full tree (dendrogram) of cluster merges. No need to specify K upfront โ pick any cut height and read off the clusters.
Linkage criteria โ how "distance between clusters" is measured:
Single
Min distance any two points across clusters. Tends to chain (elongated clusters).
Complete
Max distance. Produces more compact, balanced clusters.
Average
Average pairwise distance. Trade-off between single and complete.
Ward โ
Minimise increase in total within-cluster variance. Best general-purpose choice.
PCA โ Principal Component Analysis In-Depth
PCA reduces dimensionality by finding the directions (principal components) of maximum variance. Each PC is orthogonal to all others โ no redundancy. The technique is the eigendecomposition of the covariance matrix (equivalently, SVD of the data matrix).
When to use PCA: high-dimensional data (>50 features), visualisation (reduce to 2D/3D), noise reduction, or to remove multicollinearity before regression. Always scale features first (StandardScaler) โ PCA is variance-based and sensitive to scale.
t-SNE & UMAP Core
PCA is linear โ it cannot preserve complex non-linear structure. t-SNE and UMAP are non-linear manifold methods that excel at visual exploration of high-dimensional data.
t-SNE
- Models pairwise similarities as probabilities in high-D
- Maps to 2D/3D so similar points stay together, dissimilar push apart
- Uses t-distribution in 2D to avoid the crowding problem
- Perplexity hyperparameter: effective neighbourhood size (5โ50)
- Stochastic โ different runs give different layouts
- O(nยฒ) โ slow on large datasets
UMAP
- Topological approach โ models data on a Riemannian manifold
- Faster than t-SNE (sub-quadratic)
- Better preserves global structure alongside local
- Can be used for dimensionality reduction (not just visualisation)
- n_neighbors: controls local vs global structure trade-off
- More stable across runs than t-SNE
๐ Chapter 3.6 โ Key Takeaways
- K-Means: assign each point to its nearest centroid, recompute centroids โ fast but assumes spherical, equal-size clusters
- DBSCAN: discovers clusters as dense regions separated by low-density space โ no K required, naturally handles noise and outliers
- Hierarchical clustering: builds a dendrogram showing the full merge history โ cut at any height to get the desired number of clusters
- PCA: eigendecomposition of the covariance matrix โ linear projection retaining maximum variance โ always scale features first
- t-SNE / UMAP: non-linear 2D/3D visualisation โ use for exploration and intuition, not as features for downstream ML
- Rule of thumb: use PCA for feature compression before ML; use t-SNE or UMAP for visualisation only
A model is only as good as how you measure it. The wrong metric gives you false confidence; the right metric reveals exactly where your model fails. Evaluation is not an afterthought โ it is built into every design decision.
Confusion Matrix In-Depth
The confusion matrix is the foundation of all classification metrics. For binary classification it is a 2ร2 table of outcomes that decomposes prediction errors into two fundamentally different types โ false positives and false negatives โ which carry very different costs depending on the domain.
True Positive (TP)
Predicted positive, actually positive. Correct detection.
True Negative (TN)
Predicted negative, actually negative. Correct rejection.
False Positive (FP) โ Type I Error
Predicted positive, actually negative. "False alarm." E.g. healthy patient flagged โ bad but manageable.
False Negative (FN) โ Type II Error
Predicted negative, actually positive. "Missed case." E.g. cancer patient cleared โ potentially fatal.
Classification Metrics In-Depth
Accuracy
(TP + TN) / N
Fraction of all predictions that are correct. Useless on imbalanced data. Predicting "no anthrax" for every email gives 99.99% accuracy โ meaningless.
Precision
TP / (TP + FP)
Of all predicted positives, how many are actually positive? Use when false positives are costly โ spam filter, treatment recommendation.
Recall (Sensitivity)
TP / (TP + FN)
Of all actual positives, how many did we find? Use when false negatives are costly โ cancer screening, fraud detection, security alerts.
F1 Score
2ยทPยทR / (P + R)
Harmonic mean of precision and recall. Use when you care about both equally. Fฮฒ: ฮฒ>1 weighs recall more; ฮฒ<1 weighs precision more.
| Metric | Formula | Use When | Example |
|---|---|---|---|
| Accuracy | (TP+TN)/N | Balanced classes | Handwriting recognition |
| Precision | TP/(TP+FP) | FP costly | Spam filter, treatment recommendation |
| Recall | TP/(TP+FN) | FN costly | Cancer screening, fraud detection, security |
| F1 Score | 2PR/(P+R) | Balance P and R | Information retrieval, general NLP |
| AUC-ROC | Area under ROC | Imbalanced, ranking | Credit scoring, medical diagnosis |
| MCC | Matthews Corr. | Highly imbalanced | Genomics, rare event detection |
ROC Curve & AUC In-Depth
The ROC curve (Receiver Operating Characteristic) plots True Positive Rate (Recall) against False Positive Rate at every possible decision threshold. AUC (Area Under the Curve) collapses this to a single number: the probability that the model ranks a random positive example higher than a random negative example.
AUC = 0.5
Random guessing. Model has zero discrimination ability. Diagonal line.
AUC = 0.7โ0.9
Good to excellent discrimination. Acceptable for many real-world problems.
AUC = 1.0
Perfect classifier. L-shaped ROC curve hugging top-left corner.
Regression Metrics Core
For regression problems, evaluation metrics measure how far predictions deviate from true values. The choice of metric determines what kind of errors your model is penalised for โ which in turn shapes training and interpretation.
| Metric | Range | Units | Outlier Sensitivity | Best For |
|---|---|---|---|---|
| MAE | [0, โ) | Same as y | Robust | When outliers are real and shouldn't dominate |
| MSE | [0, โ) | yยฒ | Sensitive | Training objective โ differentiable everywhere |
| RMSE | [0, โ) | Same as y | Moderate | Reporting โ interpretable, same unit as target |
| Rยฒ | (-โ, 1] | Unitless | Moderate | Comparing models on the same dataset |
| MAPE | [0%, โ) | Percentage | Sensitive (small y) | Business reporting, relative error |
Rยฒ < 0 is possible โ it means your model is worse than simply predicting the mean. A model with Rยฒ = 0.85 explains 85% of the variance in the target. Always pair Rยฒ with a residual plot to check for systematic bias.
Cross-Validation In-Depth
A single train/test split gives a high-variance performance estimate โ one unlucky split can make a great model look bad. Cross-validation rotates the validation set across the full dataset, giving a robust, low-variance estimate of generalisation performance.
| Method | Description | Folds | Data Efficiency | Use When |
|---|---|---|---|---|
| Hold-out | Single 80/20 split | 1 | Low | Large dataset (>50K), fast iteration |
| K-Fold | Rotate K folds | K (5โ10) | High | Medium dataset, standard practice |
| Stratified K-Fold | Maintain class ratio per fold | K | High | Imbalanced classification |
| Time-Series Split | Respect temporal order, no future leakage | K | Medium | Time-series data |
| LOOCV | Each sample is its own fold (K=n) | n | Maximum | Very small dataset (<100 samples) |
Data Leakage In-Depth
Data leakage is the most common and dangerous ML mistake in production. It occurs when information from outside the training set influences the model โ artificially inflating validation performance so the model appears to work when it does not.
Target Leakage
A feature is a consequence of the target, not a cause. E.g., using "days hospitalised" to predict "admitted to hospital" โ the feature only exists after the event.
Train-Test Contamination
Preprocessing (scaling, imputation, encoding) fitted on the full dataset before splitting. The scaler has seen test data โ leakage!
Temporal Leakage
Future data used to predict past events in time-series. Training on data from 2024 to predict 2023 outcomes.
Duplicate Leakage
Same sample appears in both train and test due to duplicates in the raw dataset. Model memorises rather than generalises.
Data leakage is why models perform brilliantly in development and catastrophically in production. The number one rule: split your data first. Fit ALL preprocessing โ scalers, encoders, imputers โ ONLY on training data. Use sklearn's Pipeline to enforce this automatically.
๐ Chapter 3.7 โ Key Takeaways
- Confusion matrix: foundation of classification evaluation โ TP, TN, FP, FN โ understand what each error type costs in your domain
- Choose metric by cost: Recall when FN is costly (cancer, fraud); Precision when FP is costly (spam, treatment); F1 when both matter equally
- AUC-ROC: threshold-independent โ probability that the model ranks a random positive above a random negative โ best single metric for imbalanced classification
- Rยฒ for regression: proportion of variance explained โ 1.0 is perfect, 0 is same as predicting the mean, negative means worse than mean
- K-Fold CV: rotate K folds for robust, low-variance performance estimates โ always use Stratified K-Fold for classification
- Data leakage is the most common production failure: always split first, fit preprocessors only on training data โ use sklearn Pipeline
- Ch 3.1 โ The ML Landscape. Supervised, unsupervised, semi-supervised, and reinforcement learning. Mitchell's definition: learn from experience E to improve at tasks T measured by P. ML โ explicit programming โ it is function approximation from data.
- Ch 3.2 โ Linear & Polynomial Regression. Minimise MSE via gradient descent. Regularisation (Ridge / Lasso) prevents overfitting by penalising large weights. Bias-variance tradeoff: high bias = underfits, high variance = overfits. The fundamental tension in all ML.
- Ch 3.3 โ Logistic Regression, KNN, Naive Bayes. Logistic regression uses the sigmoid to output probabilities; decision boundary is linear. KNN is instance-based โ no training, but slow at inference and sensitive to the curse of dimensionality. Naive Bayes applies Bayes' theorem with the strong (but effective) conditional independence assumption.
- Ch 3.4 โ Decision Trees & Ensembles. Trees split on information gain (Gini / entropy). Single trees overfit โ ensemble methods fix this. Bagging (Random Forest) reduces variance via bootstrap sampling. Boosting (Gradient Boosting, XGBoost) reduces bias by sequentially correcting errors. XGBoost is still the dominant algorithm for structured data.
- Ch 3.5 โ Support Vector Machines. Maximum-margin hyperplane โ only support vectors matter. Soft margin C controls tolerance. The kernel trick projects data into high-dimensional space via the inner product โ enabling non-linear decision boundaries without explicit feature maps. RBF kernel is the most general choice; always scale features.
- Ch 3.6 โ Clustering & Unsupervised Learning. K-Means minimises inertia โ fast but assumes spherical clusters and needs K upfront. DBSCAN discovers clusters as dense regions โ no K required, handles noise and arbitrary shapes. Hierarchical clustering builds a full dendrogram โ cut at any height for K clusters. PCA: eigenvectors of the covariance matrix for maximum-variance projection. t-SNE / UMAP for non-linear 2D/3D visualisation only.
- Ch 3.7 โ Model Evaluation & Validation. Confusion matrix is the foundation: TP/TN/FP/FN. Choose metric by error cost โ Recall when FN is fatal (cancer), Precision when FP is costly (spam). AUC-ROC is threshold-independent and handles class imbalance. K-Fold CV gives robust, low-variance performance estimates. Data leakage โ the most common production failure โ is prevented by splitting before fitting any preprocessor (use Pipeline).
Domain 3 gave you the full classical ML toolkit โ the algorithms that power the majority of production ML systems today. Every algorithm here is a specific answer to the same question: how do we build a function that generalises from training data to unseen examples? Domain 4 takes this further into deep neural networks โ where the function is parameterised by millions of learned weights.