AI Foundation · Domain 02

Mathematics & Statistics for AI

Linear algebra, calculus, gradients, and the mathematical language that powers every machine learning model.

2.1

Chapter 2.1

Linear Algebra — The Language of Data

Every neural network layer is a matrix multiplication. Every word embedding is a vector. Every attention score is a dot product. Linear algebra is not a prerequisite for machine learning — it is machine learning. If you understand how vectors transform under matrix operations, every model architecture in this course will click into place.

Scalars, Vectors & Matrices Introductory

Before any algorithm, you need to understand the three fundamental objects of linear algebra. Think of them as increasing levels of generality:

🔢

Scalar

A single number. Examples: temperature (21.5°C), learning rate (0.001), a pixel intensity (127). Denoted by a lowercase italic: x.

📊

Vector

An ordered list of numbers — a point in n-dimensional space. Examples: an RGB pixel [255, 128, 0], a 512-dim word embedding. Denoted in bold: x.

📋

Matrix

A 2D table of numbers — rows and columns. Examples: a 28×28 grayscale image, a weight matrix in a neural network. Denoted in bold uppercase: X.

A 512-dimensional word embedding is just a vector with 512 numbers. When GPT processes a prompt of 100 tokens, it works with a 100 × 512 matrix — each row is one token's embedding. A dataset of 1,000 images (28×28 pixels) becomes a 1,000 × 784 matrix.

Notation convention: Scalars are lowercase italic (x), vectors are bold lowercase (x), matrices are bold uppercase (X). This convention is universal across ML papers and textbooks. When you see W in a paper, you instantly know it's a matrix.

Matrix Operations In-depth

Why does matrix multiplication matter? Because a neural network layer is literally: output = W·x + b. Every forward pass through a model is a sequence of matrix multiplications punctuated by nonlinear activation functions. If you can visualise matrix multiplication, you can visualise what a neural network does.

The shape rule: To multiply an (m × n) matrix by an (n × p) matrix, the inner dimensions must match. The result is (m × p). This is why you get "dimension mismatch" errors in PyTorch — it means the inner dimensions don't agree.

Matrix multiplication works by taking the dot product of each row of the left matrix with each column of the right matrix. For element C[i,j]: you multiply corresponding entries of row i and column j, then sum them up.

Worked Example: (2×3) · (3×2) = (2×2)

Matrix A = [[1, 2, 3], [4, 5, 6]] → shape (2×3)

Matrix B = [[7, 10], [8, 11], [9, 12]] → shape (3×2)

C[0,0] = 1×7 + 2×8 + 3×9 = 7 + 16 + 27 = 50

C[0,1] = 1×10 + 2×11 + 3×12 = 10 + 22 + 36 = 68

C[1,0] = 4×7 + 5×8 + 6×9 = 28 + 40 + 54 = 122

C[1,1] = 4×10 + 5×11 + 6×12 = 40 + 55 + 72 = 167

Result C = [[50, 68], [122, 167]] → shape (2×2) ✓

The Neural Network Layer Equation y = Wx + b W = weight matrix (output_size × input_size), x = input vector (input_size × 1), b = bias vector (output_size × 1), y = output vector (output_size × 1). This single equation IS a neural network layer.

Transpose flips rows and columns: if A is (m × n), then A^T is (n × m). You see it constantly — attention computes QK^T, meaning each query vector is dot-producted with each key vector. The transpose makes the shapes line up.

Hadamard (element-wise) product multiplies corresponding entries. Notation: A ⊙ B. It appears in LSTM gating (forget gate × cell state), masked attention, and dropout.

 import torch
 
W = torch.randn(4, 3) # Weight matrix: 4 outputs, 3 inputs

x = torch.randn(3, 1) # Input vector: 3 features

b = torch.randn(4, 1) # Bias vector: 4 outputs

y = W @ x + b # @ = matrix multiply in Python
 # y.shape = (4, 1)       # 4-dimensional output

Operation	Notation	Where in AI
Matrix Multiply	C = AB	Every neural network layer, attention (QKᵀ)
Transpose	Aᵀ	Attention scores: QKᵀ aligns query/key dims
Hadamard	A ⊙ B	LSTM gates, masked attention, dropout
Outer Product	uvᵀ	Attention weight × value in some formulations

Dot Products & Norms In-depth

The dot product is the single most important operation in modern AI. It measures how aligned two vectors are — if they point in the same direction, the dot product is large and positive. If they're perpendicular, it's zero. If they point in opposite directions, it's large and negative.

Dot Product — Two Equivalent Definitions a · b = Σ(aᵢ · bᵢ) = |a| · |b| · cos(θ) First form: multiply corresponding entries and sum. Second form: product of lengths times cosine of angle between them. When vectors are unit-length, the dot product IS the cosine similarity.

🎯

Cosine Similarity

Normalise both vectors to unit length, then take the dot product. Result is between −1 (opposite) and +1 (identical direction). This is how embedding search works: "Is 'king' similar to 'queen'?" becomes "Is cos(king, queen) close to 1?"

🔍

Attention = Dot Products

In Transformers, attention score = Q · Kᵀ / √d_k. Each query vector is dot-producted with each key vector. High dot product means "this token should attend to that token." This is the core of GPT, BERT, and every LLM.

Norms measure the "size" or "length" of a vector. They appear everywhere in regularisation — when you penalise large weights, you're literally penalising the norm of the weight vector.

Norm	Formula	Intuition	Where in AI
L1 (Manhattan)	Σ\|xᵢ\|	Sum of absolute values — taxicab distance	L1 regularisation → sparse weights (Lasso)
L2 (Euclidean)	√(Σxᵢ²)	Straight-line distance	L2 regularisation (weight decay), gradient clipping
Frobenius	√(Σᵢⱼ Aᵢⱼ²)	L2 norm for matrices	Weight matrix regularisation, LoRA rank analysis
Max (L∞)	max\|xᵢ\|	Largest single element	Adversarial robustness, gradient clipping by max

Why L2 regularisation works: Adding λ||W||² to the loss function penalises large weights. This pushes the model to use smaller, more distributed weights — which improves generalisation. In PyTorch, this is the weight_decay parameter in the optimiser.

Eigenvalues & Eigenvectors In-depth

Intuition first: Most vectors change direction when you multiply them by a matrix. But certain special vectors only get scaled — they keep pointing the same way. These are eigenvectors, and the scale factor is the eigenvalue. Think of them as the "natural axes" of a transformation.

The Eigen Equation Av = λv Matrix A times vector v = scalar λ times the same vector v. The vector v is the eigenvector. The scalar λ is the eigenvalue. This means A only stretches v by factor λ — it never rotates it.

🔄

Eigendecomposition

A = QΛQ⁻¹ where Q = matrix of eigenvectors, Λ = diagonal matrix of eigenvalues
Only works for square matrices with n linearly independent eigenvectors
Symmetric matrices always have real eigenvalues and orthogonal eigenvectors

📉

PCA Connection

PCA = find eigenvectors of the covariance matrix
Eigenvector with largest eigenvalue = direction of most variance
Keep top-k eigenvectors → reduce dimensions from n to k
Used in data preprocessing, visualisation, noise reduction

Worked Example: Find Eigenvalues of a 2×2 Matrix

Given A = [[4, 2], [1, 3]]

Solve det(A − λI) = 0 → det([[4−λ, 2], [1, 3−λ]]) = 0

(4−λ)(3−λ) − 2×1 = 0 → λ² − 7λ + 10 = 0

Factor: (λ−5)(λ−2) = 0 → λ₁ = 5, λ₂ = 2

For λ₁=5: (A−5I)v = 0 → [[-1,2],[1,-2]]v = 0 → v₁ = [2, 1]

For λ₂=2: (A−2I)v = 0 → [[2,2],[1,1]]v = 0 → v₂ = [-1, 1]

Why eigenvalues of the Hessian matter: At a critical point of the loss function, the eigenvalues of the Hessian matrix (second-derivative matrix) tell you whether it's a minimum (all positive), maximum (all negative), or saddle point (mixed signs). In deep learning, most critical points are saddle points — which is why SGD still works well.

Singular Value Decomposition Core

SVD is the generalised eigendecomposition that works for any matrix — not just square ones. It decomposes any m×n matrix into three factors: rotation × scale × rotation.

SVD — The Universal Matrix Factorisation A = UΣVᵀ U = left singular vectors (m×m, orthogonal), Σ = singular values on diagonal (m×n), Vᵀ = right singular vectors (n×n, orthogonal). Singular values σ₁ ≥ σ₂ ≥ … ≥ 0 are always non-negative and sorted.

Low-rank approximation: Keep only the top-k singular values (set the rest to zero). This gives the best rank-k approximation of the original matrix — provably optimal. This is how you compress a matrix while losing the least information.

📐

Dimensionality Reduction

Truncated SVD reduces features while preserving the most important structure. LSA (Latent Semantic Analysis) applies SVD to term-document matrices for text.

🎬

Recommender Systems

The Netflix Prize was won by SVD-based methods. Factorise the user-movie rating matrix into user preferences × movie features.

🔗

Connection to PCA

PCA of centered data = SVD of the data matrix. The right singular vectors (V) are the principal components. Same result, different computation.

Tensors Core

A tensor is the N-dimensional generalisation of vectors (1D) and matrices (2D). In deep learning, almost all data lives in tensors. Understanding tensor shapes is the single most practical skill for debugging neural networks.

📦

Common Tensor Shapes

Images: (B, C, H, W) — batch, channels, height, width
Text: (B, T, D) — batch, sequence length, embedding dim
Attention: (B, heads, T, T) — batch, heads, query, key
Video: (B, T, C, H, W) — 5D tensor!

📡

Broadcasting

PyTorch can operate on tensors of different shapes
Rules: dimensions compared right to left
Size 1 is "broadcast" (stretched) to match the other
Adding bias (4,1) to output (4,32) works via broadcasting

 import torch
 
 # Scalar (0D tensor)

x = torch.tensor(3.14) # shape: ()
 
 # Vector (1D tensor) — one word embedding

emb = torch.randn(512) # shape: (512,)
 
 # Matrix (2D tensor) — batch of embeddings

batch_emb = torch.randn(32, 512) # shape: (32, 512) = batch × dim
 
 # 3D tensor — sequence of embeddings

seq = torch.randn(32, 128, 512) # shape: (batch, seq_len, d_model)
 
 # 4D tensor — image batch (B, C, H, W)

images = torch.randn(16, 3, 224, 224) # 16 RGB images, 224×224

The .shape mental model: In PyTorch, every tensor has three critical attributes: .shape (dimensions), .dtype (float32, int64, etc.), and .device (cpu or cuda). When debugging, the first thing to check is always tensor.shape. Most bugs are shape mismatches.

Chapter 2.1 — Key Takeaways

Neural network layers = matrix multiplication: output = Wx + b. This is the single most important equation in deep learning.
Dot product measures alignment between vectors — it powers attention (Q·Kᵀ) and cosine similarity in embedding search.
L1/L2 norms of weight matrices are what regularisation literally penalises — smaller norms → better generalisation.
Eigenvectors = natural axes of a transformation — PCA finds them for data, the Hessian's eigenvalues classify critical points.
SVD decomposes any matrix into UΣVᵀ — enables low-rank approximation, compression, and recommendation systems.
Tensors generalise matrices to N dimensions — shape (B, T, D) is the universal LLM data format; (B, C, H, W) for images.

2.2

Chapter 2.2

Calculus & Gradients — How Models Learn

Backpropagation — the algorithm that trains every neural network — is four lines of calculus applied recursively. Understand the chain rule deeply and everything else follows. Every time you call loss.backward() in PyTorch, the machine is doing exactly the calculus described in this chapter.

Derivatives & Partial Derivatives In-depth

A derivative answers one question: "If I nudge this input by a tiny amount, how much does the output change?" In machine learning, that becomes: "If I nudge this weight by ε, how much does the loss change?" The answer tells you exactly how to adjust the weight to reduce the error.

📐

Derivative (Single Variable)

Rate of change of f(x) at a specific point
Geometrically: the slope of the tangent line
f'(x) = lim(ε→0) [f(x+ε) − f(x)] / ε
If f'(x) > 0: function is increasing at x
If f'(x) < 0: function is decreasing at x

🧭

Gradient (Multi-Variable)

Vector of ALL partial derivatives
∇L = [∂L/∂w₁, ∂L/∂w₂, …, ∂L/∂wₙ]
Points in the direction of steepest ascent
We subtract the gradient to go downhill
GPT-4 has ~1.8 trillion gradient entries per step

Partial derivative: when a function has many inputs (like millions of weights), the partial derivative ∂L/∂wᵢ holds all other weights fixed and measures how L changes with respect to wᵢ alone. The gradient stacks all of these into one vector.

Worked Example: Gradient of MSE for a Single Weight

Model: ŷ = wx (one weight, one input)

Loss: L = (ŷ − y)² = (wx − y)²

Apply chain rule: ∂L/∂w = 2(wx − y) · x

With x=3, y=7, w=1: ŷ = 3, L = (3−7)² = 16

∂L/∂w = 2(3−7)·3 = 2(−4)(3) = −24

Gradient is negative → loss decreases if we increase w ✓

Update: w ← 1 − 0.01·(−24) = 1.24 (closer to correct w=7/3)

Gradient Descent Update Rule w ← w − α · ∂L/∂w w = weight, α = learning rate (typically 0.001–0.01), ∂L/∂w = partial derivative of loss with respect to weight. The minus sign is critical: we go OPPOSITE to the gradient (downhill, not uphill).

The Chain Rule In-depth

A neural network is a chain of composed functions: input → linear → activation → linear → activation → … → loss. The chain rule lets us compute how the loss depends on any weight, no matter how deep the network. It's the mathematical foundation of backpropagation.

The Chain Rule d/dx[ f(g(x)) ] = f'(g(x)) · g'(x) In words: the derivative of a composed function = the derivative of the outer function evaluated at the inner, times the derivative of the inner function. For multiple layers, just keep multiplying.

Three-layer example: Given L(a(z(x))), the chain rule gives us ∂L/∂x = (∂L/∂a) · (∂a/∂z) · (∂z/∂x). Each term is a local gradient — computed at one node. Backpropagation simply evaluates these terms from right to left (output to input), multiplying as it goes.

Backprop Through a Single Layer: z = Wx + b → a = ReLU(z) → L = (a − y)²

→

Forward pass: x=2, W=0.5, b=0.1, y=1.0

z = Wx + b = 0.5·2 + 0.1 = 1.1

a = ReLU(1.1) = 1.1 (positive, so passes through)

L = (1.1 − 1.0)² = 0.01

←

Backward pass:

∂L/∂a = 2(a − y) = 2(0.1) = 0.2

∂a/∂z = 1 (ReLU derivative, z > 0) → ∂L/∂z = 0.2 · 1 = 0.2

∂z/∂W = x = 2 → ∂L/∂W = 0.2 · 2 = 0.4

Update: W ← 0.5 − 0.01·0.4 = 0.496

🔗

Why This Works for Deep Networks

A 100-layer network is 100 composed functions. The chain rule produces a product of 100 local gradients. Backprop computes all of them in one backward pass — same cost as one forward pass. This is why deep learning is computationally tractable.

⚠️

Vanishing & Exploding Gradients

If each local gradient is < 1, the product shrinks exponentially through 100 layers → vanishing gradients. If > 1, it explodes. Solutions: residual connections (ResNet), LayerNorm, gradient clipping, careful initialisation (He/Xavier).

Jacobians & Hessians Core

When your function takes a vector in and produces a vector out, the derivative is no longer a single number or even a vector — it's a matrix. The Jacobian matrix captures how every output component changes with respect to every input component.

Jacobian Matrix J[i,j] = ∂fᵢ / ∂xⱼ For f: ℝⁿ → ℝᵐ, the Jacobian is an m×n matrix. Row i = gradient of the i-th output. Column j = how the j-th input affects all outputs. In PyTorch, torch.autograd.functional.jacobian() computes this.

Hessian Matrix — Second-Order Information H[i,j] = ∂²L / ∂wᵢ∂wⱼ The Hessian captures curvature of the loss surface. For a model with n parameters, the Hessian is n×n. For GPT-3 (175B params), that's 175B × 175B entries — never computed explicitly.

🟢

H Positive Definite

All eigenvalues positive → all directions curve upward → local minimum. The loss increases no matter which direction you move. This is where we want the optimiser to converge.

🟡

H Indefinite

Mixed positive and negative eigenvalues → saddle point. Curves up in some directions, down in others. Most critical points in deep learning are saddle points, not local minima.

🔵

H ≈ 0 (Flat)

Near-zero eigenvalues → flat region. Loss barely changes. Research suggests flat minima generalise better than sharp ones (SAM optimiser exploits this).

Taylor Series & Linearisation Reference

The Taylor series approximates any smooth function locally with a polynomial. The key insight: gradient descent is the first-order Taylor approximation. It assumes the loss surface is locally linear — which is only accurate when the learning rate is small. This is why large learning rates cause training to diverge: the linear approximation breaks down.

Taylor Expansion of Loss Function f(x + δ) ≈ f(x) + ∇f(x)ᵀδ + ½δᵀHδ + … First-order term (∇f(x)ᵀδ) → gradient descent: assume loss is linear locally. Second-order term (½δᵀHδ) → Newton's method: account for curvature. Higher-order terms are almost never used in practice.

Method	Taylor Order	Cost per Step	Convergence
Gradient Descent	1st order	O(n) — one gradient	Slow but cheap
Newton's Method	2nd order	O(n³) — Hessian inverse	Fast but prohibitive for n > 10K
L-BFGS	Quasi-2nd order	O(n) — approximate Hessian	Good for small models
Adam	Adaptive 1st order	O(n) — per-param rates	Default for deep learning

Automatic Differentiation In-depth

Computing gradients for a model with millions of parameters by hand is impossible. Automatic differentiation (autodiff) solves this: the computer applies the chain rule automatically by recording every operation during the forward pass, then replaying them in reverse to compute all gradients simultaneously.

➡️

Forward Mode

Compute derivative alongside the function value
One pass gives derivatives w.r.t. one input
Efficient when: few inputs, many outputs
Cost: O(n) passes for n input variables
Rarely used in deep learning

⬅️

Reverse Mode (Backprop)

Record operations, then replay in reverse
One pass gives derivatives w.r.t. ALL inputs
Efficient when: many inputs, few outputs
Cost: O(1) backward passes (same cost as forward)
This is backpropagation

Why reverse mode wins for ML: A loss function has 1 output (the loss) and millions of weight inputs. Forward mode would need millions of passes (one per weight). Reverse mode needs exactly one backward pass to get all gradients. This is why loss.backward() is one call, not a loop.

 import torch
 
x = torch.tensor(2.0, requires_grad=True)

w = torch.tensor(3.0, requires_grad=True)
 
 # Forward pass: build computational graph

z = w * x ** 2 + 5 # z = 3x² + 5
 # dz/dx = 6x = 12, dz/dw = x² = 4
 
 # Backward pass: compute ALL gradients via reverse-mode autodiff

z.backward()
 
 print(x.grad) # tensor(12.)  ← dz/dx = 6x evaluated at x=2
 print(w.grad) # tensor(4.)   ← dz/dw = x² evaluated at x=2

In PyTorch, every tensor with requires_grad=True records operations in a computational graph. When you call .backward(), PyTorch walks the graph in reverse, applying the chain rule at each node. The gradients accumulate in each leaf tensor's .grad attribute. This is the "gradient tape" — record forward, replay backward.

PyTorch Concept	What It Does	Calculus Equivalent
requires_grad=True	Tell PyTorch to track this tensor's operations	Mark as a variable to differentiate w.r.t.
.backward()	Trigger reverse-mode autodiff	Apply chain rule from output to all inputs
.grad	Access computed gradient	∂L/∂w evaluated at current values
.detach()	Stop gradient tracking	Treat as a constant (no derivative)
torch.no_grad()	Disable gradient computation (inference)	Skip all derivative bookkeeping

Chapter 2.2 — Key Takeaways

Derivative = rate of change; gradient = vector of all partial derivatives pointing in the direction of steepest ascent on the loss surface — we subtract it to descend.
Chain rule: d/dx[f(g(x))] = f'(g(x))·g'(x) — the mathematical foundation of backpropagation, applied recursively through every layer.
Jacobian generalises gradient to vector functions; Hessian captures curvature — expensive (O(n²)), rarely computed explicitly for large models.
Gradient descent is a first-order Taylor approximation — valid only when learning rate is small enough that the linear approximation holds.
PyTorch autograd uses reverse-mode autodiff — one backward pass computes ALL gradients simultaneously, making million-parameter training tractable.

2.3

Chapter 2.3

Probability Theory — Reasoning Under Uncertainty

Machine learning models are probability machines. A classifier outputs P(cat | image). An LLM outputs P(next token | context). A diffusion model reverses a probabilistic noise process. Every loss function, every prediction, and every training signal is grounded in probability theory. Without it, you're flying blind.

Probability Fundamentals Introductory

A probability P(A) is a number between 0 and 1 that measures how likely an event A is. The sample space Ω is the set of all possible outcomes; an event is a subset of Ω. Three axioms define the entire theory: P(Ω) = 1, P(A) ≥ 0 for any event, and for disjoint events, P(A ∪ B) = P(A) + P(B).

🎲

Joint Probability

P(A ∩ B) — the probability that both A and B occur. If independent: P(A ∩ B) = P(A) · P(B). Naive Bayes assumes feature independence to use this simplification.

🔗

Conditional Probability

P(A | B) = P(A ∩ B) / P(B) — probability of A given that B occurred. Every autoregressive LLM computes P(next_token | all_previous_tokens) at each step.

🔀

Independence

A and B are independent if P(A | B) = P(A). Knowing B tells you nothing about A. In practice, few real features are truly independent — but assuming they are often works (Naive Bayes).

When GPT-4 generates the next token, it computes a probability for every word in its vocabulary — typically 50,000+ values that must sum to 1. This is a categorical distribution over tokens. The softmax function transforms raw model outputs (logits) into this valid probability distribution. Every chapter in this course connects back to this fundamental mechanism.

Random Variables & Distributions In-depth

A random variable X is a function that maps outcomes to numbers. Flip a coin → X = 1 (heads) or X = 0 (tails). A distribution describes the probabilities of all possible values of X. Distributions come in two flavours:

📊

Discrete

PMF: P(X = x) for each possible value
All probabilities must sum to 1
Examples: coin flips, dice rolls, token IDs
Classification output is discrete

📈

Continuous

PDF: f(x) such that area under curve = 1
P(X = exact value) = 0; use P(a ≤ X ≤ b)
Examples: weight values, pixel intensities, latent codes
CDF: F(x) = P(X ≤ x), always non-decreasing

Distribution	Type	Parameters	AI Application
Bernoulli	Discrete	p (success probability)	Binary classification: spam/not-spam, dropout mask
Categorical	Discrete	p₁, p₂, …, pₖ	Softmax output — LLM next-token prediction over vocab
Gaussian	Continuous	μ (mean), σ² (variance)	Weight init, noise injection, VAE latent space, diffusion
Uniform	Continuous	a (min), b (max)	Random init, uniform noise, exploration in RL
Poisson	Discrete	λ (rate)	Event count modelling: clicks, failures, arrivals

Expectation, Variance & Covariance Core

Expected value E[X] is the weighted average of all possible outcomes. "If I ran the experiment infinitely many times, what's the average?" For discrete X: E[X] = Σ xᵢ P(xᵢ). For continuous X: E[X] = ∫ x f(x) dx. In ML, the loss function is an expectation: L = E[ℓ(ŷ, y)] averaged over the data distribution.

The Key Quantities E[X] = Σ xᵢ · P(xᵢ) Expected value: weighted average of all outcomes. Var[X] = E[(X − μ)²] = E[X²] − (E[X])² Variance: average squared deviation from the mean. High variance = spread out. Cov(X, Y) = E[(X − μₓ)(Y − μᵧ)] Covariance: how X and Y move together. Positive = same direction. Negative = opposite. Zero = uncorrelated.

🎯

Correlation

Normalised covariance: ρ = Cov(X,Y) / (σₓ · σᵧ). Always between −1 and +1. Correlation of +1 means perfect linear relationship. Zero correlation ≠ independence (only for Gaussians).

📐

Covariance Matrix

For n features, the covariance matrix Σ is n×n, where Σᵢⱼ = Cov(Xᵢ, Xⱼ). Diagonal = variances, off-diagonal = covariances. PCA = eigendecomposition of Σ (Ch 2.1).

Bayes' Theorem In-depth

Bayes' theorem is the most important equation in probabilistic reasoning. It tells you how to update your beliefs when you see new data. Start with a prior belief, observe evidence, compute a posterior belief. This is the core logic of Bayesian machine learning, spam filters, medical diagnosis, and A/B testing.

Bayes' Theorem P(θ | D) = P(D | θ) · P(θ) / P(D) Posterior = (Likelihood × Prior) / Evidence. θ = model/hypothesis, D = observed data. The posterior is your updated belief about θ after seeing data D.

📋

The Four Terms

Prior P(θ): What you believed before data
Likelihood P(D|θ): How probable the data is under this model
Evidence P(D): Total probability of the data (normaliser)
Posterior P(θ|D): Updated belief after seeing data

🤖

In ML Terms

MLE: Find θ maximising P(D|θ) — ignores prior
MAP: Find θ maximising P(θ|D) — includes prior
L2 regularisation = Gaussian prior on weights
L1 regularisation = Laplace prior on weights

Worked Example: Medical Test (The Base Rate Fallacy)

Disease prevalence (prior): P(disease) = 1% = 0.01

Test accuracy: P(positive | disease) = 99% = 0.99

False positive rate: P(positive | no disease) = 1% = 0.01

Total P(positive) = 0.99 × 0.01 + 0.01 × 0.99 = 0.0099 + 0.0099 = 0.0198

P(disease | positive) = (0.99 × 0.01) / 0.0198 = 50%

A "99% accurate" test gives only 50% certainty! The low prior overwhelms the evidence.

Naive Bayes classifier: Assume all features are conditionally independent given the class label. Then P(class | features) ∝ P(class) · ∏ P(featureᵢ | class). Despite the strong (and usually wrong) independence assumption, Naive Bayes works surprisingly well for text classification, spam filtering, and sentiment analysis.

The Gaussian Distribution Core

The Gaussian (normal) distribution is the most important continuous distribution in machine learning. It appears in weight initialisation, noise injection, latent spaces, diffusion models, and the Central Limit Theorem guarantees it emerges from the sum of many independent random variables.

Gaussian Probability Density Function f(x) = (1 / √(2πσ²)) · exp(−(x − μ)² / 2σ²) μ = mean (center of the bell), σ² = variance (width of the bell). The standard normal has μ=0, σ=1. Xavier/He initialisation draws weights from scaled Gaussians.

⚖️

68-95-99.7 Rule

68% of values within ±1σ, 95% within ±2σ, 99.7% within ±3σ. Values beyond 3σ are extreme outliers. This is why gradient clipping often clips at 3–5× the standard deviation.

🏗️

Weight Initialisation

Xavier: W ~ N(0, 1/n_in). He: W ~ N(0, 2/n_in). Proper init keeps activations and gradients at consistent scale through layers — prevents vanishing/exploding gradients.

🌫️

Gaussian Noise

Diffusion models add Gaussian noise progressively, then learn to reverse it. VAEs encode data as μ + σ·ε where ε ~ N(0,1). Dropout is Bernoulli noise; data augmentation often adds Gaussian noise.

Standardisation (z-score normalisation) transforms features to zero mean and unit variance: z = (x − μ) / σ. Almost every ML pipeline starts with this step — it ensures all features contribute equally and speeds up gradient descent convergence.

Sampling Methods Core

When an LLM computes P(next_token | context), it still needs to choose a token. The simplest approach — always pick the most probable token (greedy decoding) — produces repetitive, boring text. Instead, we sample from the distribution, with several control knobs.

🌡️

Temperature τ

Divide logits by τ before softmax. τ < 1 sharpens (more confident). τ > 1 flattens (more random). τ → 0 = greedy. τ → ∞ = uniform random.

🔝

Top-k Sampling

Only consider the k most probable tokens. Redistributes probability among top-k and samples from that subset. k=1 = greedy, k=50 is common.

🎯

Top-p (Nucleus)

Sample from the smallest set of tokens whose cumulative probability ≥ p. Adapts k dynamically — fewer tokens when confident, more when uncertain. p=0.9 is standard.

Temperature Effect: Token "The" with Base P = 0.6

Effect on P("The")

0.1

P ≈ 0.99 — almost deterministic, always picks "The"

0.5

P ≈ 0.85 — still strongly favours "The"

1.0

P = 0.60 — original distribution (balanced)

2.0

P ≈ 0.35 — flattened, other tokens get more chance

5.0

P ≈ 0.22 — nearly uniform, very "creative" / chaotic

Monte Carlo estimation: Any expectation E[f(X)] can be approximated by averaging samples: E[f(X)] ≈ (1/N) Σ f(xᵢ) where xᵢ ~ P(X). This is the foundation of Monte Carlo methods. In RL, policy gradient methods estimate the expected reward via Monte Carlo sampling of trajectories.

Law of Large Numbers & Central Limit Theorem Reference

Law of Large Numbers (LLN): As sample size grows, the sample mean converges to the true mean. This is why test set accuracy is a reliable estimate of true model performance — given enough test samples. It's also why mini-batch gradient is an unbiased estimator of the full-batch gradient: E[∇L_batch] = ∇L_full.

Central Limit Theorem (CLT): The sum (or average) of many independent random variables approaches a Gaussian distribution, regardless of the original distribution. This is why Gaussians appear everywhere — any quantity that arises from the aggregate of many small, independent effects will be approximately normal. Practical implication: larger mini-batches produce lower-variance gradient estimates (variance decreases as 1/n), giving smoother training.

Central Limit Theorem X̄ₙ = (1/n) Σ Xᵢ → N(μ, σ²/n) as n → ∞ The sample mean of n i.i.d. random variables approaches a Gaussian centered on the true mean μ, with variance shrinking as 1/n. Double the batch size → halve the gradient variance.

Chapter 2.3 — Key Takeaways

LLM token generation = sampling from a categorical distribution over 50,000+ vocabulary items — softmax converts logits to valid probabilities.
Distributions: Bernoulli for binary, Categorical/Softmax for multi-class, Gaussian for continuous — know which one your model outputs.
Bayes' theorem: posterior ∝ likelihood × prior — Bayesian learning in one equation. L2 regularisation = Gaussian prior on weights.
Covariance matrix captures feature relationships — the foundation of PCA. Diagonal = variances, off-diagonal = covariances.
Temperature controls softmax sharpness: low τ = greedy/predictable, high τ = diverse/creative. Top-p adapts dynamically.
CLT: why mini-batch gradient descent works — sample averages converge to true expectations; larger batches = lower variance estimates.

2.4

Chapter 2.4

Bayesian Inference & Statistical Estimation

Every time you train a neural network, you are doing maximum likelihood estimation. Your choice of loss function is not arbitrary — it follows directly from your assumption about how outputs are distributed. Understanding MLE, MAP, and full Bayesian inference reveals why loss functions are what they are, and why regularisation works.

Maximum Likelihood Estimation In-depth

Core idea: Given observed data D = {x₁, x₂, …, xₙ}, find the parameters θ that make the data most probable. The likelihood function L(θ) = P(D | θ) measures how well θ explains the data. MLE finds θ* = argmax L(θ).

In practice we maximise the log-likelihood instead: log L(θ) = Σ log P(xᵢ | θ). This is equivalent (log is monotonic) but easier — products become sums, which are numerically stable and analytically simpler. Minimising the negative log-likelihood = maximising the log-likelihood. This is your loss function.

Maximum Likelihood Estimation θ* = argmax Σ log P(xᵢ | θ) Find parameters θ that maximise the sum of log-probabilities of observed data. Equivalently: minimise the negative log-likelihood (NLL). Training a neural network with any standard loss function IS MLE.

The punchline: Training a neural network IS maximum likelihood estimation. The loss function is not arbitrary — it follows directly from your assumption about the output distribution. Assume Gaussian outputs → you get MSE loss. Assume Bernoulli outputs → you get binary cross-entropy. The math decides the loss.

Gaussian Output → MSE Loss

Bernoulli Output → Binary Cross-Entropy

1. Assume y ~ N(f(x; θ), σ²)
2. P(y | x, θ) = (1/√2πσ²) exp(−(y − f(x;θ))² / 2σ²)
3. log P = −(y − f(x;θ))² / 2σ² − const
4. Maximise log P ⟺ minimise (y − ŷ)²
∴ MSE loss = MLE under Gaussian assumption

1. Assume y ~ Bernoulli(p), where p = σ(f(x; θ))
2. P(y | x, θ) = p^y · (1−p)^(1−y)
3. log P = y·log(p) + (1−y)·log(1−p)
4. Maximise log P ⟺ minimise −[y log p + (1−y)log(1−p)]
∴ BCE loss = MLE under Bernoulli assumption

Output Assumption	MLE Loss Function	When Used
Gaussian (continuous)	MSE / L2 loss	Regression: predict price, temperature, age
Bernoulli (binary)	Binary cross-entropy	Binary classification: spam/not-spam
Categorical (multi-class)	Categorical cross-entropy	Multi-class: ImageNet, LLM next-token
Laplace (heavy tails)	L1 / MAE loss	Robust regression, outlier-resistant

Maximum A Posteriori (MAP) In-depth

MLE can overfit — it finds the parameters that best explain the training data, with no constraint on how "reasonable" those parameters are. MAP adds a prior belief: what kind of parameter values do you expect? The prior acts as a regulariser.

MAP = MLE + Prior θ_MAP = argmax [ log P(D|θ) + log P(θ) ] The first term is the log-likelihood (data fit). The second term is the log-prior (what you believe about θ before seeing data). MAP balances fitting the data with staying close to your prior belief.

Regularisation is not a trick — it's a Bayesian prior on what you believe about your weights. When you add a weight decay term to the loss, you're implicitly saying "I believe weights should be small." The specific penalty shape reveals the prior distribution:

Gaussian Prior → L2 Regularisation

Laplace Prior → L1 Regularisation

1. Prior: P(w) = N(0, 1/2λ) for each weight
2. log P(w) = −λw² + const
3. MAP loss = L_data + λ Σ wᵢ²
4. This is L2 / Ridge / weight decay
Effect: Shrinks all weights toward zero, keeps them small but non-zero. Smooth, differentiable everywhere.

1. Prior: P(w) = Laplace(0, 1/λ) for each weight
2. log P(w) = −λ|w| + const
3. MAP loss = L_data + λ Σ |wᵢ|
4. This is L1 / Lasso
Effect: Drives some weights to exactly zero → feature selection. Produces sparse models.

🎯

L2 in Practice

PyTorch: weight_decay=0.01 in AdamW
Every LLM uses weight decay (typically 0.01–0.1)
Keeps weights small → better generalisation
AdamW decouples weight decay from Adam updates

✂️

L1 in Practice

Produces sparse weights — many exactly zero
Acts as automatic feature selection
Elastic Net combines L1 + L2
Less common in deep learning, common in classical ML

Full Bayesian Inference Core

MLE and MAP both produce point estimates — a single "best" θ. Full Bayesian inference goes further: it computes the entire posterior distribution P(θ | D). Instead of saying "the best weight is 0.42", Bayesian inference says "the weight is probably between 0.35 and 0.49, with 95% probability." This is uncertainty quantification.

🎲

Conjugate Priors

Special prior-likelihood pairs where the posterior has the same family as the prior. Beta-Bernoulli, Gaussian-Gaussian, Dirichlet-Categorical. These give closed-form posteriors — no sampling needed.

🔗

MCMC Sampling

When the posterior has no closed form, draw samples from it using Markov Chain Monte Carlo. Metropolis-Hastings, Hamiltonian MC (HMC). Gold standard for accuracy, but slow for large models.

⚡

Variational Inference

Approximate the true posterior with a simpler distribution (e.g., Gaussian). Optimise parameters of the approximation to minimise KL divergence. Fast and scalable — used in VAEs.

Method	Output	Cost	When to Use
MLE	Point estimate θ*	O(training)	Large datasets, standard deep learning
MAP	Point estimate θ*	O(training)	When regularisation helps (almost always)
MCMC	Samples from P(θ\|D)	Very expensive	Small datasets, need calibrated uncertainty
Variational Inference	Approximate P(θ\|D)	Moderate	VAEs, Bayesian neural nets, latent variable models

When to go full Bayesian: (1) Small datasets where overfitting is severe, (2) Safety-critical applications (medical, autonomous driving) where you need calibrated uncertainty — "I don't know" is a valid and necessary output, (3) Active learning — decide which data to label next based on model uncertainty.

Frequentist vs Bayesian Reference

Two philosophical camps interpret probability differently. Frequentists see probability as long-run frequency of events; parameters are fixed but unknown. Bayesians treat parameters as random variables with distributions that represent degrees of belief. Most deep learning is frequentist MLE in practice, but Bayesian ideas (priors, posterior, uncertainty) are increasingly important.

Frequentist

Bayesian

Parameters: Fixed but unknown constants
Probability: Long-run frequency of events
Estimation: MLE — maximise P(data | θ)
Uncertainty: Confidence intervals (frequentist coverage)
Hypothesis testing: p-values, null hypothesis
Regularisation: Ad-hoc penalty term
Deep learning: Standard SGD + weight decay

Parameters: Random variables with distributions
Probability: Degree of belief / uncertainty
Estimation: Compute posterior P(θ | data)
Uncertainty: Credible intervals (direct probability)
Hypothesis testing: Model comparison via Bayes factors
Regularisation: Natural consequence of the prior
Deep learning: BNNs, MC Dropout, ensembles

Probabilistic Graphical Models Core

Probabilistic Graphical Models (PGMs) represent complex joint distributions as graphs. Nodes are random variables, edges encode conditional dependencies. PGMs make the structure of a probabilistic model explicit — you can literally see which variables depend on which.

➡️

Bayesian Networks (Directed)

Directed acyclic graph (DAG)
Arrows encode causal/conditional relationships
P(A,B,C) = P(A) · P(B|A) · P(C|A,B)
D-separation determines conditional independence
Examples: medical diagnosis, spam filters, causal reasoning

↔️

Markov Random Fields (Undirected)

Undirected edges = symmetric correlations
No causal direction implied
Clique potentials define the joint
Markov blanket: a node is independent of all others given its neighbours
Examples: image segmentation, physics simulations

Modern connection: An autoregressive LLM is an implicit PGM — it factors the joint probability of a sequence as P(x₁, x₂, …, xₙ) = P(x₁) · P(x₂|x₁) · P(x₃|x₁,x₂) · … This is exactly the chain rule of probability applied to a directed graphical model where each token depends on all previous tokens.

Expectation-Maximisation (EM) Reference

EM solves MLE when there are latent (hidden) variables that prevent direct optimisation. The classic example: fitting a Gaussian Mixture Model (GMM) when you don't know which Gaussian generated each data point. EM alternates between two steps until convergence:

E-step: Given current parameters, compute the expected value of the latent variables (soft assignments — "this point is 80% from Gaussian 1, 20% from Gaussian 2"). M-step: Given the soft assignments, update parameters (means, variances, mixing weights) to maximise the expected log-likelihood. Repeat until convergence. Guaranteed to increase likelihood at each step (but may converge to local optimum).

Initialise θ

→

E-step: compute E[latent | data, θ]

→

M-step: update θ to max E[log L]

→

Converged?

→

Done

Algorithm	Latent Variable	Application
GMM + EM	Cluster assignment	Clustering, density estimation, speaker diarisation
HMM + Baum-Welch	Hidden states	Speech recognition (pre-deep-learning), genomics
LDA + EM/VI	Topic assignment	Topic modelling in text
VAE training	Latent code z	Image generation, representation learning

Chapter 2.4 — Key Takeaways

Neural network training IS MLE: your choice of loss function encodes your distributional assumption — MSE = Gaussian, cross-entropy = categorical/Bernoulli.
MSE loss = Gaussian output assumption; cross-entropy = categorical/Bernoulli assumption — the math decides the loss, not convention.
L2 regularisation = Gaussian prior on weights (MAP); L1 = Laplace prior — regularisation is not a trick, it's a Bayesian statement about what you believe.
Full Bayesian inference gives posterior distributions — enables uncertainty quantification for safety-critical and small-data settings.
PGMs encode conditional independence; HMMs were the sequence model before RNNs — autoregressive LLMs are implicit directed graphical models.

2.5

Chapter 2.5

Information Theory — Measuring Knowledge

Every loss function in deep learning has its roots in information theory. Cross-entropy loss — the training objective for every LLM — is literally the cross-entropy between the true data distribution and the model's predicted distribution. Understanding entropy, cross-entropy, and KL divergence reveals why we train the way we do.

Shannon Entropy In-depth

Entropy measures uncertainty, surprise, or unpredictability. Claude Shannon defined it in 1948: given a random variable X with possible outcomes, how much "information" does each outcome carry? The less likely an event, the more surprising (informative) it is when it occurs.

💡

Information Content

I(x) = −log₂ P(x) bits
P = 0.5 (fair coin): I = 1 bit
P = 0.99 (almost certain): I = 0.015 bits
P = 0.001 (rare event): I = 9.97 bits
Rare events carry more information

📊

Entropy = Expected Information

H(X) = −Σ P(x) · log P(x)
Average surprise across all outcomes
Uniform distribution → maximum entropy
Peaked distribution → low entropy
Measured in bits (log₂) or nats (ln)

Shannon Entropy H(X) = −Σ P(x) · log P(x) The expected information content of a random variable. High entropy = uncertain, spread-out distribution. Low entropy = predictable, peaked distribution. For a uniform distribution over n outcomes: H = log₂(n) bits (maximum).

LLM Perplexity: Perplexity = 2^(entropy) = e^(cross-entropy loss). A language model with perplexity 20 means it's "as uncertain as if choosing uniformly among 20 tokens" at each step. Lower perplexity = better model. GPT-4 has perplexity around 8–12 on standard benchmarks.

Cross-Entropy In-depth

Cross-entropy H(P, Q) measures how many bits distribution Q needs to encode samples from distribution P. If Q is a perfect model of P (Q = P), then H(P, Q) = H(P) — the minimum. If Q is wrong, it wastes extra bits. Training a model minimises cross-entropy between the true labels and model predictions.

Cross-Entropy — The Universal Training Loss H(P, Q) = −Σ P(x) · log Q(x) P = true distribution (labels), Q = model's predicted distribution. For one-hot labels, this simplifies to L = −log Q(correct class) = −log(p_correct). This IS the classification loss for every neural network.

Worked Example: Cross-Entropy for One Prediction

True label: class 2 → one-hot y = [0, 0, 1, 0]

Model softmax output: Q = [0.1, 0.2, 0.6, 0.1]

H(P, Q) = −(0·log 0.1 + 0·log 0.2 + 1·log 0.6 + 0·log 0.1)

H(P, Q) = −log(0.6) = 0.51 nats

If model had predicted P(class 2) = 0.99 → loss = −log(0.99) = 0.01 ✓

If model had predicted P(class 2) = 0.01 → loss = −log(0.01) = 4.60 ✗ harsh!

📦

Categorical Cross-Entropy

Multi-class: one-hot labels, softmax outputs
L = −Σ yᵢ log(pᵢ) = −log(p_correct)
Used for: ImageNet, LLM next-token, NER
PyTorch: F.cross_entropy(logits, labels)

🔘

Binary Cross-Entropy

Binary: one sigmoid output
L = −[y log(p) + (1−y)log(1−p)]
Used for: spam detection, multi-label tasks
PyTorch: F.binary_cross_entropy_with_logits

 import torch
 import torch.nn.functional as F
 
logits = torch.tensor([1.2, 0.5, 3.1, 0.8]) # raw model outputs

y_true = torch.tensor(2) # correct class index = 2
 
loss = F.cross_entropy(logits.unsqueeze(0), y_true.unsqueeze(0))
 # Internally: softmax(logits) → [-log(p_class_2)]
 
p_class2 = F.softmax(logits, dim=0)[2]
 print(f"P(correct class) = {p_class2:.3f}") # higher = lower loss
 print(f"Cross-entropy loss = {loss:.3f}")

Why cross-entropy penalises confident wrong predictions so harshly: −log(0.99) = 0.01 (confident and correct → tiny loss). −log(0.01) = 4.60 (confident and wrong → massive loss). The logarithm makes the penalty asymmetric — being confidently wrong is 460× worse than being confidently right is good. This is exactly the behaviour we want.

KL Divergence In-depth

Kullback-Leibler divergence measures the "extra cost" of using distribution Q when the true distribution is P. It's the gap between cross-entropy and true entropy — the inefficiency penalty for using the wrong model.

KL Divergence & The Fundamental Relationship D_KL(P ‖ Q) = Σ P(x) · log(P(x) / Q(x)) Always ≥ 0. Equals 0 only when P = Q exactly. NOT symmetric: D_KL(P‖Q) ≠ D_KL(Q‖P). H(P, Q) = H(P) + D_KL(P ‖ Q) Cross-entropy = true entropy + KL divergence. Since H(P) is fixed, minimising cross-entropy = minimising KL divergence between model and data.

🔑

Key Properties

D_KL ≥ 0 always (Gibbs' inequality)
D_KL = 0 iff P = Q
Not a true metric: asymmetric, no triangle inequality
Forward KL (D_KL(P‖Q)): mode-covering — Q tries to cover all of P
Reverse KL (D_KL(Q‖P)): mode-seeking — Q locks onto one mode of P

🤖

Where KL Appears in AI

VAE loss: reconstruction + KL(posterior ‖ prior)
RLHF/PPO: KL penalty between policy and reference
Knowledge distillation: KL(teacher ‖ student)
Diffusion models: KL between forward/reverse
Training objective: minimising CE = minimising KL

Forward vs Reverse KL: Forward KL D_KL(P‖Q) penalises Q for putting zero probability where P has mass → Q covers all modes of P (mode-covering). Reverse KL D_KL(Q‖P) penalises Q for putting mass where P has zero → Q concentrates on one mode (mode-seeking). VAEs use forward KL. Variational inference often uses reverse KL — which is why it can miss modes.

Mutual Information Core

Mutual information I(X;Y) measures how much knowing Y reduces your uncertainty about X. Unlike correlation, MI captures any statistical relationship — not just linear ones.

Mutual Information I(X; Y) = H(X) − H(X|Y) = H(Y) − H(Y|X) How many bits of information Y provides about X (and vice versa). I(X;Y) = 0 means X and Y are independent. I(X;Y) = H(X) means Y completely determines X.

🔍

Feature Selection

Select features with highest MI to the target label. Unlike correlation, MI detects non-linear relationships. Used in preprocessing for classical ML.

🖼️

Contrastive Learning

SimCLR, CLIP, and BYOL maximise MI between representations of augmented views of the same data. InfoNCE loss is a lower bound on MI.

🔬

Information Bottleneck

Compress input X into representation Z while preserving MI with target Y. The theoretical framework for why deep networks learn good representations.

Data Compression & Coding Theory Reference

Shannon's source coding theorem (1948): the optimal average code length for messages from a source with entropy H is at least H bits. You cannot compress below the entropy limit. This means a better probability model = better compression. Language models are, in a deep sense, compressors — a model with low perplexity assigns high probability to real text, which means it can encode that text in fewer bits.

Minimum Description Length (MDL) formalises model selection: the best model is the one that gives the shortest total description (model complexity + data encoded by model). This connects directly to the bias-variance tradeoff: a too-complex model has a long description; a too-simple model encodes data poorly. Bits-back coding shows VAEs are theoretically equivalent to lossless compression schemes — the ELBO loss is the expected code length.

Chapter 2.5 — Key Takeaways

Entropy H(X) = expected surprise: high entropy = uniform = uncertain; low entropy = peaked = predictable.
Cross-entropy loss = −log P(correct class) — the model is penalised for assigning low probability to the truth. Being confidently wrong is catastrophically expensive.
H(P, Q) = H(P) + D_KL(P‖Q): minimising cross-entropy = minimising KL divergence from model to data distribution.
KL divergence appears in VAE loss, RLHF PPO penalty, and knowledge distillation — it's the universal measure of distributional mismatch.
LLM perplexity = e^(cross-entropy) — directly measures how well the model predicts text. Lower perplexity = better language model.

2.6

Chapter 2.6

Optimisation — Finding the Minimum

Given a loss function L(θ), find the parameters θ that minimise it. In deep learning, loss landscapes are non-convex with millions of dimensions — we need smart optimisers, adaptive learning rates, and carefully tuned schedules. The learning rate is the single most important hyperparameter. Get it wrong and nothing else matters.

Gradient Descent — Why It Works In-depth

The loss landscape is a high-dimensional surface — we want the lowest point. The gradient ∇L tells us which direction is "uphill" at our current position. Gradient descent takes a step in the opposite direction: downhill. Repeat until the gradient is near zero (we've reached a minimum or saddle point).

Gradient Descent Update Rule θ ← θ − α · ∇L(θ) θ = all parameters, α = learning rate (step size), ∇L = gradient of loss. The minus sign = we go OPPOSITE to the gradient (downhill). Convergence criterion: ‖∇L‖ < ε.

✅

α Just Right

Smooth convergence toward minimum. Steps shrink naturally as gradient flattens near the bottom. Training loss decreases steadily.

💥

α Too Large

Overshoot past the minimum. Loss oscillates or explodes. Training diverges — you see NaN in your loss. The #1 training failure mode.

🐌

α Too Small

Tiny steps, extremely slow progress. May get stuck in a saddle point or flat region. Training takes forever (or budget runs out first).

SGD Variants In-depth

Computing the gradient over the entire dataset each step is accurate but slow. The solution: estimate the gradient from a random subset (mini-batch). The noise from subsampling is not a bug — it's a feature.

Variant	Batch Size	Gradient Quality	Speed	When Used
Full-Batch GD	N (all data)	Exact gradient	Very slow per step	Small datasets, convex problems
Stochastic GD	1 sample	Very noisy estimate	Fast per step	Online learning, streaming data
Mini-Batch GD	32–512	Good estimate	Best trade-off	Industry standard for all DL

🎲

Why Noise Helps

Mini-batch noise = implicit regularisation
Helps escape saddle points and sharp minima
Smaller batches → flatter minima → better generalisation
Large-batch training needs special tricks (LARS, LAMB)

💾

Gradient Accumulation

GPU can only fit batch size B in memory
Accumulate gradients over K steps before updating
Effective batch size = B × K
Simulates large batch without extra memory

Momentum & Adaptive Optimisers In-depth

Vanilla SGD has two problems: it oscillates in steep dimensions and moves slowly in flat ones. Momentum fixes the first. Adaptive learning rates fix the second. Adam combines both — and it's the default for almost all deep learning.

Adam Optimiser — The Default Choice mₜ = β₁·mₜ₋₁ + (1−β₁)·gₜ First moment estimate (exponential moving average of gradients — momentum). vₜ = β₂·vₜ₋₁ + (1−β₂)·gₜ² Second moment estimate (exponential moving average of squared gradients — adaptive rate). θₜ₊₁ = θₜ − α · m̂ₜ / (√v̂ₜ + ε) Update with bias-corrected moments. Defaults: β₁=0.9, β₂=0.999, ε=1e-8. m̂ and v̂ are bias-corrected: m̂ₜ = mₜ/(1−β₁ᵗ).

Optimiser	Key Idea	Pros	When to Use
SGD	Basic gradient step	Simple, well-understood	Rarely used alone
SGD + Momentum	Accumulate velocity β·v + ∇L	Dampens oscillations, faster	CNNs, ResNets, fine-tuning
RMSProp	Per-parameter adaptive LR via √(avg g²)	Handles sparse gradients	RNNs, non-stationary objectives
Adam	Momentum + RMSProp combined	Fast, robust, low tuning	Default for most DL tasks
AdamW	Adam + decoupled weight decay	Better generalisation than Adam	All LLM/Transformer training
Adafactor	Adam with factored second moments	Memory efficient (no per-param state)	Very large models (T5, PaLM)

AdamW vs Adam: Original Adam applies weight decay inside the adaptive learning rate, which couples it with gradient statistics. AdamW decouples the weight decay — applying it directly to weights. This seemingly minor change significantly improves generalisation. Every major LLM (GPT, LLaMA, Gemini) uses AdamW.

Learning Rate Schedules Core

A constant learning rate is rarely optimal. The standard practice: warm up from a very small LR, then gradually decay it. Warmup prevents early training instability (large initial gradients). Decay allows fine-grained convergence near the end.

Schedule	Shape	When Used
Constant	Flat line	Baselines, very short training
Step Decay	Staircase drops every N epochs	ResNet training (divide by 10 at epoch 30, 60)
Cosine Annealing	Smooth cosine curve from α_max to α_min	LLM pre-training, standard modern choice
Warmup + Cosine	Linear rise then cosine decay	All Transformer training from scratch
OneCycleLR	Rise to peak then decay to near-zero	Fast training with super-convergence

Why LLMs need warmup: At initialisation, gradients can be very large and unstable. Starting with a tiny learning rate (e.g., 1e-7) and linearly increasing to the target (e.g., 3e-4) over the first 1–5% of training steps prevents the model from making catastrophic early updates. After warmup, cosine annealing smoothly decays the LR. This recipe (warmup + cosine) is used for essentially all LLM pre-training.

Convexity & the Loss Landscape Core

A convex function has a single global minimum — gradient descent is guaranteed to find it. Linear regression (MSE) and logistic regression have convex losses. But deep neural networks are non-convex — the loss surface is filled with local minima, saddle points, and flat plateaus.

✅

The Good News

Deep networks rarely get stuck in bad local minima
Saddle points are far more common than local minima in high dimensions
Momentum and Adam help escape saddle points
Many local minima have similar loss values (the landscape is "benign")

📐

Flat Minima Hypothesis

Flat (wide) minima generalise better than sharp (narrow) ones
Small perturbations in weights don't hurt performance
SAM optimiser: explicitly seeks flat minima
Small batch SGD implicitly finds flatter minima

Constrained Optimisation Reference

Lagrange multipliers solve the problem of optimising f(x) subject to constraints g(x) = 0. The key insight: at the optimal point, the gradient of f must be parallel to the gradient of g. Introduce multiplier λ and solve ∇f = λ∇g. The KKT conditions extend this to inequality constraints (g(x) ≤ 0).

Where constrained optimisation appears in AI: Support Vector Machines maximise the classification margin subject to correct classification constraints (a quadratic program). RLHF constrains the new policy to stay close to the reference policy (KL constraint). Safety-constrained RL limits reward optimisation to stay within safety bounds. Optimal transport problems are constrained linear programs.

Lagrangian — The Core Idea L(x, λ) = f(x) − λ · g(x) Optimise f(x) subject to g(x) = 0. At the solution: ∂L/∂x = 0 and ∂L/∂λ = 0. The multiplier λ measures how much the constraint "costs" in terms of the objective.

Second-Order Methods Reference

Newton's method uses the Hessian (second derivative matrix) to account for curvature. Instead of following the gradient blindly, it finds the optimal step size by solving ∇²L · δ = −∇L. In theory, it converges in far fewer steps. In practice, the Hessian is n×n for n parameters — for GPT-3 with 175B parameters, that's 175B × 175B entries. Completely infeasible.

Quasi-Newton methods (L-BFGS) approximate the Hessian from gradient history, avoiding the O(n²) cost. They work well for small models (up to ~millions of parameters) and are sometimes used in fine-tuning. The natural gradient uses the Fisher information matrix as a metric — accounting for the geometry of probability distributions rather than Euclidean parameter space. K-FAC and Shampoo are practical approximations used in some large-scale training runs.

Chapter 2.6 — Key Takeaways

Gradient descent: step in −∇L direction; learning rate controls step size — the most critical hyperparameter to tune.
Mini-batch GD (B=32–512) is the industry standard: balances gradient accuracy with speed, and noise helps generalisation.
Adam = momentum + adaptive learning rates — the default choice for most deep learning tasks.
AdamW (Adam + decoupled weight decay) is preferred for all LLM and Transformer training.
Warmup + cosine annealing schedule is standard for training large models from scratch — prevents early instability.
Deep networks are non-convex, but saddle points (not local minima) are the main obstacle — momentum and adaptive methods help escape them.

2.7

Chapter 2.7

Graph Theory & Discrete Mathematics

Graphs encode relational structure — who connects to whom, what depends on what, which atoms bond to which. Graph neural networks extend deep learning to molecules, social networks, and knowledge bases. Self-attention in Transformers is a GNN on a fully-connected graph. This chapter is a concise reference for the graph concepts that matter in modern AI.

Graph Fundamentals Introductory

A graph G = (V, E) consists of vertices (nodes) V and edges (links) E. Edges can be directed (A → B) or undirected (A — B), weighted or unweighted. Graphs are represented in code as adjacency matrices (dense, good for small graphs) or adjacency lists (sparse, good for large graphs).

👥

Social Networks

Undirected: users are nodes, friendships are edges. Community detection, influence propagation, friend recommendation.

🧬

Molecular Graphs

Atoms are nodes, bonds are edges. Drug discovery, protein structure prediction (AlphaFold), material science.

🔄

Computation Graphs

Operations are nodes, data flow is edges. DAGs (no cycles). PyTorch autograd builds one every forward pass.

Graph Algorithms in AI Core

Algorithm	Strategy	Complexity	AI Application
BFS	Explore level by level	O(V + E)	Shortest path (unweighted), social network distance
DFS	Explore deep first, backtrack	O(V + E)	Cycle detection, topological sort
Dijkstra	Greedy shortest path (weighted)	O((V+E) log V)	Robotics path planning, network routing
A*	Dijkstra + heuristic	O(b^d) worst case	Game AI, planning agents, maze solving
Topological Sort	Linearise a DAG	O(V + E)	Execution order in computation graphs
Graph Colouring	Assign colours, no adjacent match	NP-hard (general)	Scheduling, register allocation

 # Topological sort via DFS (Kahn's algorithm variant)
 from collections import deque
 
 def topo_sort(graph): # graph = adjacency list

    in_deg = {n: 0 for n in graph}

    for u in graph:

        for v in graph[u]:

            in_deg[v] += 1

    q = deque(n for n in in_deg if in_deg[n] == 0)

    order = []

    while q:

        u = q.popleft(); order.append(u)

        for v in graph[u]:

            in_deg[v] -= 1

            if in_deg[v] == 0: q.append(v)

    return order
 
 # DAG: x→Wx, W→Wx, Wx→Wx+b, b→Wx+b

dag = {"x": ["Wx"], "W": ["Wx"], "Wx": ["Wx+b"], "b": ["Wx+b"], "Wx+b": []}
 print(topo_sort(dag)) # ['x', 'W', 'b', 'Wx', 'Wx+b']

PyTorch uses topological sort internally. Every forward pass builds a DAG of operations. When you call .backward(), PyTorch topologically sorts this DAG and walks it in reverse order, computing gradients at each node via the chain rule. This is why the order of operations matters for autograd.

Graph Neural Networks Core

Standard neural networks expect fixed-size inputs (vectors, grids). Graphs have variable numbers of nodes and edges with no fixed ordering. GNNs solve this with the message passing framework: each node updates its representation by aggregating information from its neighbours.

Message Passing — The Universal GNN Framework

Message: Each node computes a message from its features and sends to neighbours

Aggregate: Each node collects messages from all neighbours (sum, mean, or max)

Update: Each node updates its representation using aggregated messages + its own features

Repeat for K layers → each node "sees" K hops of neighbourhood information

GNN Variant	Key Idea	Application
GCN	Spectral convolution, normalised adjacency	Node classification, citation networks
GraphSAGE	Sample + aggregate — scales to large graphs	Pinterest recommendation (billions of nodes)
GAT	Attention weights on edges	Heterogeneous graphs, knowledge graphs
MPNN	General message passing framework	Molecular property prediction
GIN	Maximally expressive — as powerful as WL test	Graph isomorphism, graph classification

Self-attention = GNN on a fully-connected graph. In a Transformer, every token attends to every other token — this is equivalent to message passing on a complete graph where the attention weights are the edge weights. GNNs generalise Transformers to arbitrary graph structures.

Knowledge Graphs & Ontologies Reference

Knowledge graphs store facts as (subject, predicate, object) triples. For example: (Paris, isCapitalOf, France), (France, isPartOf, Europe). Major KGs include Wikidata (100M+ entities), Google Knowledge Graph, and domain-specific KGs for medicine, law, and science.

KG embeddings like TransE learn vector representations where subject + relation ≈ object in embedding space. This enables link prediction ("if we know Paris is in France and France is in Europe, can we infer Paris is in Europe?"). RAG with KGs retrieves relevant triples and injects them into LLM prompts — grounding generation in verified facts.

Limitations: KGs are brittle, require manual curation, can't handle uncertainty or nuance well, and struggle with temporal facts. Hybrid approaches (KG + LLM) are an active research frontier.

Combinatorics & Computational Complexity Reference

Why symbolic AI died: The search space for reasoning grows exponentially with problem size — the combinatorial explosion. A game tree for Go has ~10¹⁷⁰ positions. Brute-force search is impossible. This is why statistical (learning-based) methods replaced rule-based AI.

For ML practitioners, Big-O notation matters most when reasoning about scaling. Standard Transformer attention is O(n²) in sequence length — the main bottleneck for long-context models. Linear attention variants reduce this to O(n) but often sacrifice quality.

Operation	Complexity	Example in AI
Embedding lookup	O(1)	Token embedding in LLMs
Matrix multiply (n×n)	O(n³)	Dense layer forward pass
Self-attention	O(n²·d)	Transformer — main bottleneck
Linear attention	O(n·d²)	Mamba, RWKV, RetNet
Sorting	O(n log n)	Top-k token selection
k-NN search (brute)	O(n·d)	Vector DB exact search
ANN search (HNSW)	O(log n)	Vector DB approximate search

Chapter 2.7 — Key Takeaways

Graphs encode relational structure — nodes, edges, and their properties represent social networks, molecules, computation flows, and knowledge bases.
Message passing is the universal GNN framework: aggregate neighbour information to update node representations — repeat for K layers to see K hops.
GNNs excel at molecular graphs (AlphaFold), knowledge graphs, and recommendation systems — self-attention is a GNN on a fully-connected graph.
Knowledge graphs store facts as (subject, predicate, object) triples — can augment LLM retrieval via RAG with structured facts.
Standard Transformer attention is O(n²) in sequence length — the main scaling bottleneck driving research into linear attention alternatives.

Domain 2 — What You Now Know

Linear algebra: neural net layers = Wx + b; attention = QKᵀ; embeddings are vectors; tensors shape everything.
Calculus: backprop = chain rule applied recursively; gradient = direction of steepest ascent on loss surface.
Probability: model outputs are distributions; LLM tokens are sampled from a categorical distribution over 50K+ items.
Bayesian inference: MSE = Gaussian MLE; cross-entropy = Bernoulli/categorical MLE; L2 regularisation = Gaussian prior (MAP).
Information theory: CE loss = −log(p_correct); minimising cross-entropy = minimising KL divergence from model to data.
Optimisation: Adam = momentum + adaptive LR; warmup + cosine annealing is the standard LLM training recipe.
Graphs: GNNs use message passing; knowledge graphs store (s, p, o) triples; self-attention = GNN on complete graph.
These are not prerequisites — they are the explanation for why ML works.

← Domain 01: Foundations of AI Domain 03: Classical ML →