Mathematics & Statistics for AI
Linear algebra, calculus, gradients, and the mathematical language that powers every machine learning model.
Every neural network layer is a matrix multiplication. Every word embedding is a vector. Every attention score is a dot product. Linear algebra is not a prerequisite for machine learning β it is machine learning. If you understand how vectors transform under matrix operations, every model architecture in this course will click into place.
Before any algorithm, you need to understand the three fundamental objects of linear algebra. Think of them as increasing levels of generality:
Scalar
A single number. Examples: temperature (21.5Β°C), learning rate (0.001), a pixel intensity (127). Denoted by a lowercase italic: x.
Vector
An ordered list of numbers β a point in n-dimensional space. Examples: an RGB pixel [255, 128, 0], a 512-dim word embedding. Denoted in bold: x.
Matrix
A 2D table of numbers β rows and columns. Examples: a 28Γ28 grayscale image, a weight matrix in a neural network. Denoted in bold uppercase: X.
A 512-dimensional word embedding is just a vector with 512 numbers. When GPT processes a prompt of 100 tokens, it works with a 100 Γ 512 matrix β each row is one token's embedding. A dataset of 1,000 images (28Γ28 pixels) becomes a 1,000 Γ 784 matrix.
Why does matrix multiplication matter? Because a neural network layer is literally: output = WΒ·x + b. Every forward pass through a model is a sequence of matrix multiplications punctuated by nonlinear activation functions. If you can visualise matrix multiplication, you can visualise what a neural network does.
Matrix multiplication works by taking the dot product of each row of the left matrix with each column of the right matrix. For element C[i,j]: you multiply corresponding entries of row i and column j, then sum them up.
Transpose flips rows and columns: if A is (m Γ n), then AT is (n Γ m). You see it constantly β attention computes QKT, meaning each query vector is dot-producted with each key vector. The transpose makes the shapes line up.
Hadamard (element-wise) product multiplies corresponding entries. Notation: A β B. It appears in LSTM gating (forget gate Γ cell state), masked attention, and dropout.
W = torch.randn(4, 3) # Weight matrix: 4 outputs, 3 inputs
x = torch.randn(3, 1) # Input vector: 3 features
b = torch.randn(4, 1) # Bias vector: 4 outputs
y = W @ x + b # @ = matrix multiply in Python
# y.shape = (4, 1) # 4-dimensional output
| Operation | Notation | Where in AI |
|---|---|---|
| Matrix Multiply | C = AB | Every neural network layer, attention (QKα΅) |
| Transpose | Aα΅ | Attention scores: QKα΅ aligns query/key dims |
| Hadamard | A β B | LSTM gates, masked attention, dropout |
| Outer Product | uvα΅ | Attention weight Γ value in some formulations |
The dot product is the single most important operation in modern AI. It measures how aligned two vectors are β if they point in the same direction, the dot product is large and positive. If they're perpendicular, it's zero. If they point in opposite directions, it's large and negative.
Cosine Similarity
Normalise both vectors to unit length, then take the dot product. Result is between β1 (opposite) and +1 (identical direction). This is how embedding search works: "Is 'king' similar to 'queen'?" becomes "Is cos(king, queen) close to 1?"
Attention = Dot Products
In Transformers, attention score = Q Β· Kα΅ / βdk. Each query vector is dot-producted with each key vector. High dot product means "this token should attend to that token." This is the core of GPT, BERT, and every LLM.
Norms measure the "size" or "length" of a vector. They appear everywhere in regularisation β when you penalise large weights, you're literally penalising the norm of the weight vector.
| Norm | Formula | Intuition | Where in AI |
|---|---|---|---|
| L1 (Manhattan) | Ξ£|xα΅’| | Sum of absolute values β taxicab distance | L1 regularisation β sparse weights (Lasso) |
| L2 (Euclidean) | β(Ξ£xα΅’Β²) | Straight-line distance | L2 regularisation (weight decay), gradient clipping |
| Frobenius | β(Ξ£α΅’β±Ό Aα΅’β±ΌΒ²) | L2 norm for matrices | Weight matrix regularisation, LoRA rank analysis |
| Max (Lβ) | max|xα΅’| | Largest single element | Adversarial robustness, gradient clipping by max |
weight_decay parameter in the optimiser.
Intuition first: Most vectors change direction when you multiply them by a matrix. But certain special vectors only get scaled β they keep pointing the same way. These are eigenvectors, and the scale factor is the eigenvalue. Think of them as the "natural axes" of a transformation.
Eigendecomposition
- A = QΞQβ»ΒΉ where Q = matrix of eigenvectors, Ξ = diagonal matrix of eigenvalues
- Only works for square matrices with n linearly independent eigenvectors
- Symmetric matrices always have real eigenvalues and orthogonal eigenvectors
PCA Connection
- PCA = find eigenvectors of the covariance matrix
- Eigenvector with largest eigenvalue = direction of most variance
- Keep top-k eigenvectors β reduce dimensions from n to k
- Used in data preprocessing, visualisation, noise reduction
SVD is the generalised eigendecomposition that works for any matrix β not just square ones. It decomposes any mΓn matrix into three factors: rotation Γ scale Γ rotation.
Low-rank approximation: Keep only the top-k singular values (set the rest to zero). This gives the best rank-k approximation of the original matrix β provably optimal. This is how you compress a matrix while losing the least information.
Dimensionality Reduction
Truncated SVD reduces features while preserving the most important structure. LSA (Latent Semantic Analysis) applies SVD to term-document matrices for text.
Recommender Systems
The Netflix Prize was won by SVD-based methods. Factorise the user-movie rating matrix into user preferences Γ movie features.
Connection to PCA
PCA of centered data = SVD of the data matrix. The right singular vectors (V) are the principal components. Same result, different computation.
A tensor is the N-dimensional generalisation of vectors (1D) and matrices (2D). In deep learning, almost all data lives in tensors. Understanding tensor shapes is the single most practical skill for debugging neural networks.
Common Tensor Shapes
- Images: (B, C, H, W) β batch, channels, height, width
- Text: (B, T, D) β batch, sequence length, embedding dim
- Attention: (B, heads, T, T) β batch, heads, query, key
- Video: (B, T, C, H, W) β 5D tensor!
Broadcasting
- PyTorch can operate on tensors of different shapes
- Rules: dimensions compared right to left
- Size 1 is "broadcast" (stretched) to match the other
- Adding bias (4,1) to output (4,32) works via broadcasting
# Scalar (0D tensor)
x = torch.tensor(3.14) # shape: ()
# Vector (1D tensor) β one word embedding
emb = torch.randn(512) # shape: (512,)
# Matrix (2D tensor) β batch of embeddings
batch_emb = torch.randn(32, 512) # shape: (32, 512) = batch Γ dim
# 3D tensor β sequence of embeddings
seq = torch.randn(32, 128, 512) # shape: (batch, seq_len, d_model)
# 4D tensor β image batch (B, C, H, W)
images = torch.randn(16, 3, 224, 224) # 16 RGB images, 224Γ224
.shape (dimensions), .dtype (float32, int64, etc.), and .device (cpu or cuda). When debugging, the first thing to check is always tensor.shape. Most bugs are shape mismatches.
- Neural network layers = matrix multiplication: output = Wx + b. This is the single most important equation in deep learning.
- Dot product measures alignment between vectors β it powers attention (QΒ·Kα΅) and cosine similarity in embedding search.
- L1/L2 norms of weight matrices are what regularisation literally penalises β smaller norms β better generalisation.
- Eigenvectors = natural axes of a transformation β PCA finds them for data, the Hessian's eigenvalues classify critical points.
- SVD decomposes any matrix into UΞ£Vα΅ β enables low-rank approximation, compression, and recommendation systems.
- Tensors generalise matrices to N dimensions β shape (B, T, D) is the universal LLM data format; (B, C, H, W) for images.
Backpropagation β the algorithm that trains every neural network β is four lines of calculus applied recursively. Understand the chain rule deeply and everything else follows. Every time you call loss.backward() in PyTorch, the machine is doing exactly the calculus described in this chapter.
A derivative answers one question: "If I nudge this input by a tiny amount, how much does the output change?" In machine learning, that becomes: "If I nudge this weight by Ξ΅, how much does the loss change?" The answer tells you exactly how to adjust the weight to reduce the error.
Derivative (Single Variable)
- Rate of change of f(x) at a specific point
- Geometrically: the slope of the tangent line
- f'(x) = lim(Ξ΅β0) [f(x+Ξ΅) β f(x)] / Ξ΅
- If f'(x) > 0: function is increasing at x
- If f'(x) < 0: function is decreasing at x
Gradient (Multi-Variable)
- Vector of ALL partial derivatives
- βL = [βL/βwβ, βL/βwβ, β¦, βL/βwβ]
- Points in the direction of steepest ascent
- We subtract the gradient to go downhill
- GPT-4 has ~1.8 trillion gradient entries per step
Partial derivative: when a function has many inputs (like millions of weights), the partial derivative βL/βwα΅’ holds all other weights fixed and measures how L changes with respect to wα΅’ alone. The gradient stacks all of these into one vector.
A neural network is a chain of composed functions: input β linear β activation β linear β activation β β¦ β loss. The chain rule lets us compute how the loss depends on any weight, no matter how deep the network. It's the mathematical foundation of backpropagation.
Why This Works for Deep Networks
A 100-layer network is 100 composed functions. The chain rule produces a product of 100 local gradients. Backprop computes all of them in one backward pass β same cost as one forward pass. This is why deep learning is computationally tractable.
Vanishing & Exploding Gradients
If each local gradient is < 1, the product shrinks exponentially through 100 layers β vanishing gradients. If > 1, it explodes. Solutions: residual connections (ResNet), LayerNorm, gradient clipping, careful initialisation (He/Xavier).
When your function takes a vector in and produces a vector out, the derivative is no longer a single number or even a vector β it's a matrix. The Jacobian matrix captures how every output component changes with respect to every input component.
H Positive Definite
All eigenvalues positive β all directions curve upward β local minimum. The loss increases no matter which direction you move. This is where we want the optimiser to converge.
H Indefinite
Mixed positive and negative eigenvalues β saddle point. Curves up in some directions, down in others. Most critical points in deep learning are saddle points, not local minima.
H β 0 (Flat)
Near-zero eigenvalues β flat region. Loss barely changes. Research suggests flat minima generalise better than sharp ones (SAM optimiser exploits this).
The Taylor series approximates any smooth function locally with a polynomial. The key insight: gradient descent is the first-order Taylor approximation. It assumes the loss surface is locally linear β which is only accurate when the learning rate is small. This is why large learning rates cause training to diverge: the linear approximation breaks down.
| Method | Taylor Order | Cost per Step | Convergence |
|---|---|---|---|
| Gradient Descent | 1st order | O(n) β one gradient | Slow but cheap |
| Newton's Method | 2nd order | O(nΒ³) β Hessian inverse | Fast but prohibitive for n > 10K |
| L-BFGS | Quasi-2nd order | O(n) β approximate Hessian | Good for small models |
| Adam | Adaptive 1st order | O(n) β per-param rates | Default for deep learning |
Computing gradients for a model with millions of parameters by hand is impossible. Automatic differentiation (autodiff) solves this: the computer applies the chain rule automatically by recording every operation during the forward pass, then replaying them in reverse to compute all gradients simultaneously.
Forward Mode
- Compute derivative alongside the function value
- One pass gives derivatives w.r.t. one input
- Efficient when: few inputs, many outputs
- Cost: O(n) passes for n input variables
- Rarely used in deep learning
Reverse Mode (Backprop)
- Record operations, then replay in reverse
- One pass gives derivatives w.r.t. ALL inputs
- Efficient when: many inputs, few outputs
- Cost: O(1) backward passes (same cost as forward)
- This is backpropagation
loss.backward() is one call, not a loop.
x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
# Forward pass: build computational graph
z = w * x ** 2 + 5 # z = 3xΒ² + 5
# dz/dx = 6x = 12, dz/dw = xΒ² = 4
# Backward pass: compute ALL gradients via reverse-mode autodiff
z.backward()
print(x.grad) # tensor(12.) β dz/dx = 6x evaluated at x=2
print(w.grad) # tensor(4.) β dz/dw = xΒ² evaluated at x=2
In PyTorch, every tensor with requires_grad=True records operations in a computational graph. When you call .backward(), PyTorch walks the graph in reverse, applying the chain rule at each node. The gradients accumulate in each leaf tensor's .grad attribute. This is the "gradient tape" β record forward, replay backward.
| PyTorch Concept | What It Does | Calculus Equivalent |
|---|---|---|
| requires_grad=True | Tell PyTorch to track this tensor's operations | Mark as a variable to differentiate w.r.t. |
| .backward() | Trigger reverse-mode autodiff | Apply chain rule from output to all inputs |
| .grad | Access computed gradient | βL/βw evaluated at current values |
| .detach() | Stop gradient tracking | Treat as a constant (no derivative) |
| torch.no_grad() | Disable gradient computation (inference) | Skip all derivative bookkeeping |
- Derivative = rate of change; gradient = vector of all partial derivatives pointing in the direction of steepest ascent on the loss surface β we subtract it to descend.
- Chain rule: d/dx[f(g(x))] = f'(g(x))Β·g'(x) β the mathematical foundation of backpropagation, applied recursively through every layer.
- Jacobian generalises gradient to vector functions; Hessian captures curvature β expensive (O(nΒ²)), rarely computed explicitly for large models.
- Gradient descent is a first-order Taylor approximation β valid only when learning rate is small enough that the linear approximation holds.
- PyTorch autograd uses reverse-mode autodiff β one backward pass computes ALL gradients simultaneously, making million-parameter training tractable.
Machine learning models are probability machines. A classifier outputs P(cat | image). An LLM outputs P(next token | context). A diffusion model reverses a probabilistic noise process. Every loss function, every prediction, and every training signal is grounded in probability theory. Without it, you're flying blind.
A probability P(A) is a number between 0 and 1 that measures how likely an event A is. The sample space Ξ© is the set of all possible outcomes; an event is a subset of Ξ©. Three axioms define the entire theory: P(Ξ©) = 1, P(A) β₯ 0 for any event, and for disjoint events, P(A βͺ B) = P(A) + P(B).
Joint Probability
P(A β© B) β the probability that both A and B occur. If independent: P(A β© B) = P(A) Β· P(B). Naive Bayes assumes feature independence to use this simplification.
Conditional Probability
P(A | B) = P(A β© B) / P(B) β probability of A given that B occurred. Every autoregressive LLM computes P(next_token | all_previous_tokens) at each step.
Independence
A and B are independent if P(A | B) = P(A). Knowing B tells you nothing about A. In practice, few real features are truly independent β but assuming they are often works (Naive Bayes).
When GPT-4 generates the next token, it computes a probability for every word in its vocabulary β typically 50,000+ values that must sum to 1. This is a categorical distribution over tokens. The softmax function transforms raw model outputs (logits) into this valid probability distribution. Every chapter in this course connects back to this fundamental mechanism.
A random variable X is a function that maps outcomes to numbers. Flip a coin β X = 1 (heads) or X = 0 (tails). A distribution describes the probabilities of all possible values of X. Distributions come in two flavours:
Discrete
- PMF: P(X = x) for each possible value
- All probabilities must sum to 1
- Examples: coin flips, dice rolls, token IDs
- Classification output is discrete
Continuous
- PDF: f(x) such that area under curve = 1
- P(X = exact value) = 0; use P(a β€ X β€ b)
- Examples: weight values, pixel intensities, latent codes
- CDF: F(x) = P(X β€ x), always non-decreasing
| Distribution | Type | Parameters | AI Application |
|---|---|---|---|
| Bernoulli | Discrete | p (success probability) | Binary classification: spam/not-spam, dropout mask |
| Categorical | Discrete | pβ, pβ, β¦, pβ | Softmax output β LLM next-token prediction over vocab |
| Gaussian | Continuous | ΞΌ (mean), ΟΒ² (variance) | Weight init, noise injection, VAE latent space, diffusion |
| Uniform | Continuous | a (min), b (max) | Random init, uniform noise, exploration in RL |
| Poisson | Discrete | Ξ» (rate) | Event count modelling: clicks, failures, arrivals |
Expected value E[X] is the weighted average of all possible outcomes. "If I ran the experiment infinitely many times, what's the average?" For discrete X: E[X] = Ξ£ xα΅’ P(xα΅’). For continuous X: E[X] = β« x f(x) dx. In ML, the loss function is an expectation: L = E[β(Ε·, y)] averaged over the data distribution.
Correlation
Normalised covariance: Ο = Cov(X,Y) / (Οβ Β· Οα΅§). Always between β1 and +1. Correlation of +1 means perfect linear relationship. Zero correlation β independence (only for Gaussians).
Covariance Matrix
For n features, the covariance matrix Ξ£ is nΓn, where Ξ£α΅’β±Ό = Cov(Xα΅’, Xβ±Ό). Diagonal = variances, off-diagonal = covariances. PCA = eigendecomposition of Ξ£ (Ch 2.1).
Bayes' theorem is the most important equation in probabilistic reasoning. It tells you how to update your beliefs when you see new data. Start with a prior belief, observe evidence, compute a posterior belief. This is the core logic of Bayesian machine learning, spam filters, medical diagnosis, and A/B testing.
The Four Terms
- Prior P(ΞΈ): What you believed before data
- Likelihood P(D|ΞΈ): How probable the data is under this model
- Evidence P(D): Total probability of the data (normaliser)
- Posterior P(ΞΈ|D): Updated belief after seeing data
In ML Terms
- MLE: Find ΞΈ maximising P(D|ΞΈ) β ignores prior
- MAP: Find ΞΈ maximising P(ΞΈ|D) β includes prior
- L2 regularisation = Gaussian prior on weights
- L1 regularisation = Laplace prior on weights
The Gaussian (normal) distribution is the most important continuous distribution in machine learning. It appears in weight initialisation, noise injection, latent spaces, diffusion models, and the Central Limit Theorem guarantees it emerges from the sum of many independent random variables.
68-95-99.7 Rule
68% of values within Β±1Ο, 95% within Β±2Ο, 99.7% within Β±3Ο. Values beyond 3Ο are extreme outliers. This is why gradient clipping often clips at 3β5Γ the standard deviation.
Weight Initialisation
Xavier: W ~ N(0, 1/n_in). He: W ~ N(0, 2/n_in). Proper init keeps activations and gradients at consistent scale through layers β prevents vanishing/exploding gradients.
Gaussian Noise
Diffusion models add Gaussian noise progressively, then learn to reverse it. VAEs encode data as ΞΌ + ΟΒ·Ξ΅ where Ξ΅ ~ N(0,1). Dropout is Bernoulli noise; data augmentation often adds Gaussian noise.
Standardisation (z-score normalisation) transforms features to zero mean and unit variance: z = (x β ΞΌ) / Ο. Almost every ML pipeline starts with this step β it ensures all features contribute equally and speeds up gradient descent convergence.
When an LLM computes P(next_token | context), it still needs to choose a token. The simplest approach β always pick the most probable token (greedy decoding) β produces repetitive, boring text. Instead, we sample from the distribution, with several control knobs.
Temperature Ο
Divide logits by Ο before softmax. Ο < 1 sharpens (more confident). Ο > 1 flattens (more random). Ο β 0 = greedy. Ο β β = uniform random.
Top-k Sampling
Only consider the k most probable tokens. Redistributes probability among top-k and samples from that subset. k=1 = greedy, k=50 is common.
Top-p (Nucleus)
Sample from the smallest set of tokens whose cumulative probability β₯ p. Adapts k dynamically β fewer tokens when confident, more when uncertain. p=0.9 is standard.
Law of Large Numbers (LLN): As sample size grows, the sample mean converges to the true mean. This is why test set accuracy is a reliable estimate of true model performance β given enough test samples. It's also why mini-batch gradient is an unbiased estimator of the full-batch gradient: E[βL_batch] = βL_full.
Central Limit Theorem (CLT): The sum (or average) of many independent random variables approaches a Gaussian distribution, regardless of the original distribution. This is why Gaussians appear everywhere β any quantity that arises from the aggregate of many small, independent effects will be approximately normal. Practical implication: larger mini-batches produce lower-variance gradient estimates (variance decreases as 1/n), giving smoother training.
- LLM token generation = sampling from a categorical distribution over 50,000+ vocabulary items β softmax converts logits to valid probabilities.
- Distributions: Bernoulli for binary, Categorical/Softmax for multi-class, Gaussian for continuous β know which one your model outputs.
- Bayes' theorem: posterior β likelihood Γ prior β Bayesian learning in one equation. L2 regularisation = Gaussian prior on weights.
- Covariance matrix captures feature relationships β the foundation of PCA. Diagonal = variances, off-diagonal = covariances.
- Temperature controls softmax sharpness: low Ο = greedy/predictable, high Ο = diverse/creative. Top-p adapts dynamically.
- CLT: why mini-batch gradient descent works β sample averages converge to true expectations; larger batches = lower variance estimates.
Every time you train a neural network, you are doing maximum likelihood estimation. Your choice of loss function is not arbitrary β it follows directly from your assumption about how outputs are distributed. Understanding MLE, MAP, and full Bayesian inference reveals why loss functions are what they are, and why regularisation works.
Core idea: Given observed data D = {xβ, xβ, β¦, xβ}, find the parameters ΞΈ that make the data most probable. The likelihood function L(ΞΈ) = P(D | ΞΈ) measures how well ΞΈ explains the data. MLE finds ΞΈ* = argmax L(ΞΈ).
In practice we maximise the log-likelihood instead: log L(ΞΈ) = Ξ£ log P(xα΅’ | ΞΈ). This is equivalent (log is monotonic) but easier β products become sums, which are numerically stable and analytically simpler. Minimising the negative log-likelihood = maximising the log-likelihood. This is your loss function.
2. P(y | x, ΞΈ) = (1/β2ΟΟΒ²) exp(β(y β f(x;ΞΈ))Β² / 2ΟΒ²)
3. log P = β(y β f(x;ΞΈ))Β² / 2ΟΒ² β const
4. Maximise log P βΊ minimise (y β Ε·)Β²
β΄ MSE loss = MLE under Gaussian assumption
2. P(y | x, ΞΈ) = p^y Β· (1βp)^(1βy)
3. log P = yΒ·log(p) + (1βy)Β·log(1βp)
4. Maximise log P βΊ minimise β[y log p + (1βy)log(1βp)]
β΄ BCE loss = MLE under Bernoulli assumption
| Output Assumption | MLE Loss Function | When Used |
|---|---|---|
| Gaussian (continuous) | MSE / L2 loss | Regression: predict price, temperature, age |
| Bernoulli (binary) | Binary cross-entropy | Binary classification: spam/not-spam |
| Categorical (multi-class) | Categorical cross-entropy | Multi-class: ImageNet, LLM next-token |
| Laplace (heavy tails) | L1 / MAE loss | Robust regression, outlier-resistant |
MLE can overfit β it finds the parameters that best explain the training data, with no constraint on how "reasonable" those parameters are. MAP adds a prior belief: what kind of parameter values do you expect? The prior acts as a regulariser.
Regularisation is not a trick β it's a Bayesian prior on what you believe about your weights. When you add a weight decay term to the loss, you're implicitly saying "I believe weights should be small." The specific penalty shape reveals the prior distribution:
2. log P(w) = βΞ»wΒ² + const
3. MAP loss = L_data + Ξ» Ξ£ wα΅’Β²
4. This is L2 / Ridge / weight decay
Effect: Shrinks all weights toward zero, keeps them small but non-zero. Smooth, differentiable everywhere.
2. log P(w) = βΞ»|w| + const
3. MAP loss = L_data + Ξ» Ξ£ |wα΅’|
4. This is L1 / Lasso
Effect: Drives some weights to exactly zero β feature selection. Produces sparse models.
L2 in Practice
- PyTorch:
weight_decay=0.01in AdamW - Every LLM uses weight decay (typically 0.01β0.1)
- Keeps weights small β better generalisation
- AdamW decouples weight decay from Adam updates
L1 in Practice
- Produces sparse weights β many exactly zero
- Acts as automatic feature selection
- Elastic Net combines L1 + L2
- Less common in deep learning, common in classical ML
MLE and MAP both produce point estimates β a single "best" ΞΈ. Full Bayesian inference goes further: it computes the entire posterior distribution P(ΞΈ | D). Instead of saying "the best weight is 0.42", Bayesian inference says "the weight is probably between 0.35 and 0.49, with 95% probability." This is uncertainty quantification.
Conjugate Priors
Special prior-likelihood pairs where the posterior has the same family as the prior. Beta-Bernoulli, Gaussian-Gaussian, Dirichlet-Categorical. These give closed-form posteriors β no sampling needed.
MCMC Sampling
When the posterior has no closed form, draw samples from it using Markov Chain Monte Carlo. Metropolis-Hastings, Hamiltonian MC (HMC). Gold standard for accuracy, but slow for large models.
Variational Inference
Approximate the true posterior with a simpler distribution (e.g., Gaussian). Optimise parameters of the approximation to minimise KL divergence. Fast and scalable β used in VAEs.
| Method | Output | Cost | When to Use |
|---|---|---|---|
| MLE | Point estimate ΞΈ* | O(training) | Large datasets, standard deep learning |
| MAP | Point estimate ΞΈ* | O(training) | When regularisation helps (almost always) |
| MCMC | Samples from P(ΞΈ|D) | Very expensive | Small datasets, need calibrated uncertainty |
| Variational Inference | Approximate P(ΞΈ|D) | Moderate | VAEs, Bayesian neural nets, latent variable models |
Two philosophical camps interpret probability differently. Frequentists see probability as long-run frequency of events; parameters are fixed but unknown. Bayesians treat parameters as random variables with distributions that represent degrees of belief. Most deep learning is frequentist MLE in practice, but Bayesian ideas (priors, posterior, uncertainty) are increasingly important.
Probability: Long-run frequency of events
Estimation: MLE β maximise P(data | ΞΈ)
Uncertainty: Confidence intervals (frequentist coverage)
Hypothesis testing: p-values, null hypothesis
Regularisation: Ad-hoc penalty term
Deep learning: Standard SGD + weight decay
Probability: Degree of belief / uncertainty
Estimation: Compute posterior P(ΞΈ | data)
Uncertainty: Credible intervals (direct probability)
Hypothesis testing: Model comparison via Bayes factors
Regularisation: Natural consequence of the prior
Deep learning: BNNs, MC Dropout, ensembles
Probabilistic Graphical Models (PGMs) represent complex joint distributions as graphs. Nodes are random variables, edges encode conditional dependencies. PGMs make the structure of a probabilistic model explicit β you can literally see which variables depend on which.
Bayesian Networks (Directed)
- Directed acyclic graph (DAG)
- Arrows encode causal/conditional relationships
- P(A,B,C) = P(A) Β· P(B|A) Β· P(C|A,B)
- D-separation determines conditional independence
- Examples: medical diagnosis, spam filters, causal reasoning
Markov Random Fields (Undirected)
- Undirected edges = symmetric correlations
- No causal direction implied
- Clique potentials define the joint
- Markov blanket: a node is independent of all others given its neighbours
- Examples: image segmentation, physics simulations
EM solves MLE when there are latent (hidden) variables that prevent direct optimisation. The classic example: fitting a Gaussian Mixture Model (GMM) when you don't know which Gaussian generated each data point. EM alternates between two steps until convergence:
E-step: Given current parameters, compute the expected value of the latent variables (soft assignments β "this point is 80% from Gaussian 1, 20% from Gaussian 2"). M-step: Given the soft assignments, update parameters (means, variances, mixing weights) to maximise the expected log-likelihood. Repeat until convergence. Guaranteed to increase likelihood at each step (but may converge to local optimum).
| Algorithm | Latent Variable | Application |
|---|---|---|
| GMM + EM | Cluster assignment | Clustering, density estimation, speaker diarisation |
| HMM + Baum-Welch | Hidden states | Speech recognition (pre-deep-learning), genomics |
| LDA + EM/VI | Topic assignment | Topic modelling in text |
| VAE training | Latent code z | Image generation, representation learning |
- Neural network training IS MLE: your choice of loss function encodes your distributional assumption β MSE = Gaussian, cross-entropy = categorical/Bernoulli.
- MSE loss = Gaussian output assumption; cross-entropy = categorical/Bernoulli assumption β the math decides the loss, not convention.
- L2 regularisation = Gaussian prior on weights (MAP); L1 = Laplace prior β regularisation is not a trick, it's a Bayesian statement about what you believe.
- Full Bayesian inference gives posterior distributions β enables uncertainty quantification for safety-critical and small-data settings.
- PGMs encode conditional independence; HMMs were the sequence model before RNNs β autoregressive LLMs are implicit directed graphical models.
Every loss function in deep learning has its roots in information theory. Cross-entropy loss β the training objective for every LLM β is literally the cross-entropy between the true data distribution and the model's predicted distribution. Understanding entropy, cross-entropy, and KL divergence reveals why we train the way we do.
Entropy measures uncertainty, surprise, or unpredictability. Claude Shannon defined it in 1948: given a random variable X with possible outcomes, how much "information" does each outcome carry? The less likely an event, the more surprising (informative) it is when it occurs.
Information Content
- I(x) = βlogβ P(x) bits
- P = 0.5 (fair coin): I = 1 bit
- P = 0.99 (almost certain): I = 0.015 bits
- P = 0.001 (rare event): I = 9.97 bits
- Rare events carry more information
Entropy = Expected Information
- H(X) = βΞ£ P(x) Β· log P(x)
- Average surprise across all outcomes
- Uniform distribution β maximum entropy
- Peaked distribution β low entropy
- Measured in bits (logβ) or nats (ln)
Cross-entropy H(P, Q) measures how many bits distribution Q needs to encode samples from distribution P. If Q is a perfect model of P (Q = P), then H(P, Q) = H(P) β the minimum. If Q is wrong, it wastes extra bits. Training a model minimises cross-entropy between the true labels and model predictions.
Categorical Cross-Entropy
- Multi-class: one-hot labels, softmax outputs
- L = βΞ£ yα΅’ log(pα΅’) = βlog(p_correct)
- Used for: ImageNet, LLM next-token, NER
- PyTorch:
F.cross_entropy(logits, labels)
Binary Cross-Entropy
- Binary: one sigmoid output
- L = β[y log(p) + (1βy)log(1βp)]
- Used for: spam detection, multi-label tasks
- PyTorch:
F.binary_cross_entropy_with_logits
import torch.nn.functional as F
logits = torch.tensor([1.2, 0.5, 3.1, 0.8]) # raw model outputs
y_true = torch.tensor(2) # correct class index = 2
loss = F.cross_entropy(logits.unsqueeze(0), y_true.unsqueeze(0))
# Internally: softmax(logits) β [-log(p_class_2)]
p_class2 = F.softmax(logits, dim=0)[2]
print(f"P(correct class) = {p_class2:.3f}") # higher = lower loss
print(f"Cross-entropy loss = {loss:.3f}")
Kullback-Leibler divergence measures the "extra cost" of using distribution Q when the true distribution is P. It's the gap between cross-entropy and true entropy β the inefficiency penalty for using the wrong model.
Key Properties
- D_KL β₯ 0 always (Gibbs' inequality)
- D_KL = 0 iff P = Q
- Not a true metric: asymmetric, no triangle inequality
- Forward KL (D_KL(PβQ)): mode-covering β Q tries to cover all of P
- Reverse KL (D_KL(QβP)): mode-seeking β Q locks onto one mode of P
Where KL Appears in AI
- VAE loss: reconstruction + KL(posterior β prior)
- RLHF/PPO: KL penalty between policy and reference
- Knowledge distillation: KL(teacher β student)
- Diffusion models: KL between forward/reverse
- Training objective: minimising CE = minimising KL
Mutual information I(X;Y) measures how much knowing Y reduces your uncertainty about X. Unlike correlation, MI captures any statistical relationship β not just linear ones.
Feature Selection
Select features with highest MI to the target label. Unlike correlation, MI detects non-linear relationships. Used in preprocessing for classical ML.
Contrastive Learning
SimCLR, CLIP, and BYOL maximise MI between representations of augmented views of the same data. InfoNCE loss is a lower bound on MI.
Information Bottleneck
Compress input X into representation Z while preserving MI with target Y. The theoretical framework for why deep networks learn good representations.
Shannon's source coding theorem (1948): the optimal average code length for messages from a source with entropy H is at least H bits. You cannot compress below the entropy limit. This means a better probability model = better compression. Language models are, in a deep sense, compressors β a model with low perplexity assigns high probability to real text, which means it can encode that text in fewer bits.
Minimum Description Length (MDL) formalises model selection: the best model is the one that gives the shortest total description (model complexity + data encoded by model). This connects directly to the bias-variance tradeoff: a too-complex model has a long description; a too-simple model encodes data poorly. Bits-back coding shows VAEs are theoretically equivalent to lossless compression schemes β the ELBO loss is the expected code length.
- Entropy H(X) = expected surprise: high entropy = uniform = uncertain; low entropy = peaked = predictable.
- Cross-entropy loss = βlog P(correct class) β the model is penalised for assigning low probability to the truth. Being confidently wrong is catastrophically expensive.
- H(P, Q) = H(P) + D_KL(PβQ): minimising cross-entropy = minimising KL divergence from model to data distribution.
- KL divergence appears in VAE loss, RLHF PPO penalty, and knowledge distillation β it's the universal measure of distributional mismatch.
- LLM perplexity = e^(cross-entropy) β directly measures how well the model predicts text. Lower perplexity = better language model.
Given a loss function L(ΞΈ), find the parameters ΞΈ that minimise it. In deep learning, loss landscapes are non-convex with millions of dimensions β we need smart optimisers, adaptive learning rates, and carefully tuned schedules. The learning rate is the single most important hyperparameter. Get it wrong and nothing else matters.
The loss landscape is a high-dimensional surface β we want the lowest point. The gradient βL tells us which direction is "uphill" at our current position. Gradient descent takes a step in the opposite direction: downhill. Repeat until the gradient is near zero (we've reached a minimum or saddle point).
Ξ± Just Right
Smooth convergence toward minimum. Steps shrink naturally as gradient flattens near the bottom. Training loss decreases steadily.
Ξ± Too Large
Overshoot past the minimum. Loss oscillates or explodes. Training diverges β you see NaN in your loss. The #1 training failure mode.
Ξ± Too Small
Tiny steps, extremely slow progress. May get stuck in a saddle point or flat region. Training takes forever (or budget runs out first).
Computing the gradient over the entire dataset each step is accurate but slow. The solution: estimate the gradient from a random subset (mini-batch). The noise from subsampling is not a bug β it's a feature.
| Variant | Batch Size | Gradient Quality | Speed | When Used |
|---|---|---|---|---|
| Full-Batch GD | N (all data) | Exact gradient | Very slow per step | Small datasets, convex problems |
| Stochastic GD | 1 sample | Very noisy estimate | Fast per step | Online learning, streaming data |
| Mini-Batch GD | 32β512 | Good estimate | Best trade-off | Industry standard for all DL |
Why Noise Helps
- Mini-batch noise = implicit regularisation
- Helps escape saddle points and sharp minima
- Smaller batches β flatter minima β better generalisation
- Large-batch training needs special tricks (LARS, LAMB)
Gradient Accumulation
- GPU can only fit batch size B in memory
- Accumulate gradients over K steps before updating
- Effective batch size = B Γ K
- Simulates large batch without extra memory
Vanilla SGD has two problems: it oscillates in steep dimensions and moves slowly in flat ones. Momentum fixes the first. Adaptive learning rates fix the second. Adam combines both β and it's the default for almost all deep learning.
| Optimiser | Key Idea | Pros | When to Use |
|---|---|---|---|
| SGD | Basic gradient step | Simple, well-understood | Rarely used alone |
| SGD + Momentum | Accumulate velocity Ξ²Β·v + βL | Dampens oscillations, faster | CNNs, ResNets, fine-tuning |
| RMSProp | Per-parameter adaptive LR via β(avg gΒ²) | Handles sparse gradients | RNNs, non-stationary objectives |
| Adam | Momentum + RMSProp combined | Fast, robust, low tuning | Default for most DL tasks |
| AdamW | Adam + decoupled weight decay | Better generalisation than Adam | All LLM/Transformer training |
| Adafactor | Adam with factored second moments | Memory efficient (no per-param state) | Very large models (T5, PaLM) |
A constant learning rate is rarely optimal. The standard practice: warm up from a very small LR, then gradually decay it. Warmup prevents early training instability (large initial gradients). Decay allows fine-grained convergence near the end.
| Schedule | Shape | When Used |
|---|---|---|
| Constant | Flat line | Baselines, very short training |
| Step Decay | Staircase drops every N epochs | ResNet training (divide by 10 at epoch 30, 60) |
| Cosine Annealing | Smooth cosine curve from Ξ±_max to Ξ±_min | LLM pre-training, standard modern choice |
| Warmup + Cosine | Linear rise then cosine decay | All Transformer training from scratch |
| OneCycleLR | Rise to peak then decay to near-zero | Fast training with super-convergence |
A convex function has a single global minimum β gradient descent is guaranteed to find it. Linear regression (MSE) and logistic regression have convex losses. But deep neural networks are non-convex β the loss surface is filled with local minima, saddle points, and flat plateaus.
The Good News
- Deep networks rarely get stuck in bad local minima
- Saddle points are far more common than local minima in high dimensions
- Momentum and Adam help escape saddle points
- Many local minima have similar loss values (the landscape is "benign")
Flat Minima Hypothesis
- Flat (wide) minima generalise better than sharp (narrow) ones
- Small perturbations in weights don't hurt performance
- SAM optimiser: explicitly seeks flat minima
- Small batch SGD implicitly finds flatter minima
Lagrange multipliers solve the problem of optimising f(x) subject to constraints g(x) = 0. The key insight: at the optimal point, the gradient of f must be parallel to the gradient of g. Introduce multiplier Ξ» and solve βf = Ξ»βg. The KKT conditions extend this to inequality constraints (g(x) β€ 0).
Where constrained optimisation appears in AI: Support Vector Machines maximise the classification margin subject to correct classification constraints (a quadratic program). RLHF constrains the new policy to stay close to the reference policy (KL constraint). Safety-constrained RL limits reward optimisation to stay within safety bounds. Optimal transport problems are constrained linear programs.
Newton's method uses the Hessian (second derivative matrix) to account for curvature. Instead of following the gradient blindly, it finds the optimal step size by solving βΒ²L Β· Ξ΄ = ββL. In theory, it converges in far fewer steps. In practice, the Hessian is nΓn for n parameters β for GPT-3 with 175B parameters, that's 175B Γ 175B entries. Completely infeasible.
Quasi-Newton methods (L-BFGS) approximate the Hessian from gradient history, avoiding the O(nΒ²) cost. They work well for small models (up to ~millions of parameters) and are sometimes used in fine-tuning. The natural gradient uses the Fisher information matrix as a metric β accounting for the geometry of probability distributions rather than Euclidean parameter space. K-FAC and Shampoo are practical approximations used in some large-scale training runs.
- Gradient descent: step in ββL direction; learning rate controls step size β the most critical hyperparameter to tune.
- Mini-batch GD (B=32β512) is the industry standard: balances gradient accuracy with speed, and noise helps generalisation.
- Adam = momentum + adaptive learning rates β the default choice for most deep learning tasks.
- AdamW (Adam + decoupled weight decay) is preferred for all LLM and Transformer training.
- Warmup + cosine annealing schedule is standard for training large models from scratch β prevents early instability.
- Deep networks are non-convex, but saddle points (not local minima) are the main obstacle β momentum and adaptive methods help escape them.
Graphs encode relational structure β who connects to whom, what depends on what, which atoms bond to which. Graph neural networks extend deep learning to molecules, social networks, and knowledge bases. Self-attention in Transformers is a GNN on a fully-connected graph. This chapter is a concise reference for the graph concepts that matter in modern AI.
A graph G = (V, E) consists of vertices (nodes) V and edges (links) E. Edges can be directed (A β B) or undirected (A β B), weighted or unweighted. Graphs are represented in code as adjacency matrices (dense, good for small graphs) or adjacency lists (sparse, good for large graphs).
Social Networks
Undirected: users are nodes, friendships are edges. Community detection, influence propagation, friend recommendation.
Molecular Graphs
Atoms are nodes, bonds are edges. Drug discovery, protein structure prediction (AlphaFold), material science.
Computation Graphs
Operations are nodes, data flow is edges. DAGs (no cycles). PyTorch autograd builds one every forward pass.
| Algorithm | Strategy | Complexity | AI Application |
|---|---|---|---|
| BFS | Explore level by level | O(V + E) | Shortest path (unweighted), social network distance |
| DFS | Explore deep first, backtrack | O(V + E) | Cycle detection, topological sort |
| Dijkstra | Greedy shortest path (weighted) | O((V+E) log V) | Robotics path planning, network routing |
| A* | Dijkstra + heuristic | O(b^d) worst case | Game AI, planning agents, maze solving |
| Topological Sort | Linearise a DAG | O(V + E) | Execution order in computation graphs |
| Graph Colouring | Assign colours, no adjacent match | NP-hard (general) | Scheduling, register allocation |
from collections import deque
def topo_sort(graph): # graph = adjacency list
in_deg = {n: 0 for n in graph}
for u in graph:
for v in graph[u]:
in_deg[v] += 1
q = deque(n for n in in_deg if in_deg[n] == 0)
order = []
while q:
u = q.popleft(); order.append(u)
for v in graph[u]:
in_deg[v] -= 1
if in_deg[v] == 0: q.append(v)
return order
# DAG: xβWx, WβWx, WxβWx+b, bβWx+b
dag = {"x": ["Wx"], "W": ["Wx"], "Wx": ["Wx+b"], "b": ["Wx+b"], "Wx+b": []}
print(topo_sort(dag)) # ['x', 'W', 'b', 'Wx', 'Wx+b']
.backward(), PyTorch topologically sorts this DAG and walks it in reverse order, computing gradients at each node via the chain rule. This is why the order of operations matters for autograd.
Standard neural networks expect fixed-size inputs (vectors, grids). Graphs have variable numbers of nodes and edges with no fixed ordering. GNNs solve this with the message passing framework: each node updates its representation by aggregating information from its neighbours.
| GNN Variant | Key Idea | Application |
|---|---|---|
| GCN | Spectral convolution, normalised adjacency | Node classification, citation networks |
| GraphSAGE | Sample + aggregate β scales to large graphs | Pinterest recommendation (billions of nodes) |
| GAT | Attention weights on edges | Heterogeneous graphs, knowledge graphs |
| MPNN | General message passing framework | Molecular property prediction |
| GIN | Maximally expressive β as powerful as WL test | Graph isomorphism, graph classification |
Knowledge graphs store facts as (subject, predicate, object) triples. For example: (Paris, isCapitalOf, France), (France, isPartOf, Europe). Major KGs include Wikidata (100M+ entities), Google Knowledge Graph, and domain-specific KGs for medicine, law, and science.
KG embeddings like TransE learn vector representations where subject + relation β object in embedding space. This enables link prediction ("if we know Paris is in France and France is in Europe, can we infer Paris is in Europe?"). RAG with KGs retrieves relevant triples and injects them into LLM prompts β grounding generation in verified facts.
Limitations: KGs are brittle, require manual curation, can't handle uncertainty or nuance well, and struggle with temporal facts. Hybrid approaches (KG + LLM) are an active research frontier.
Why symbolic AI died: The search space for reasoning grows exponentially with problem size β the combinatorial explosion. A game tree for Go has ~10ΒΉβ·β° positions. Brute-force search is impossible. This is why statistical (learning-based) methods replaced rule-based AI.
For ML practitioners, Big-O notation matters most when reasoning about scaling. Standard Transformer attention is O(nΒ²) in sequence length β the main bottleneck for long-context models. Linear attention variants reduce this to O(n) but often sacrifice quality.
| Operation | Complexity | Example in AI |
|---|---|---|
| Embedding lookup | O(1) | Token embedding in LLMs |
| Matrix multiply (nΓn) | O(nΒ³) | Dense layer forward pass |
| Self-attention | O(nΒ²Β·d) | Transformer β main bottleneck |
| Linear attention | O(nΒ·dΒ²) | Mamba, RWKV, RetNet |
| Sorting | O(n log n) | Top-k token selection |
| k-NN search (brute) | O(nΒ·d) | Vector DB exact search |
| ANN search (HNSW) | O(log n) | Vector DB approximate search |
- Graphs encode relational structure β nodes, edges, and their properties represent social networks, molecules, computation flows, and knowledge bases.
- Message passing is the universal GNN framework: aggregate neighbour information to update node representations β repeat for K layers to see K hops.
- GNNs excel at molecular graphs (AlphaFold), knowledge graphs, and recommendation systems β self-attention is a GNN on a fully-connected graph.
- Knowledge graphs store facts as (subject, predicate, object) triples β can augment LLM retrieval via RAG with structured facts.
- Standard Transformer attention is O(nΒ²) in sequence length β the main scaling bottleneck driving research into linear attention alternatives.
- Linear algebra: neural net layers = Wx + b; attention = QKα΅; embeddings are vectors; tensors shape everything.
- Calculus: backprop = chain rule applied recursively; gradient = direction of steepest ascent on loss surface.
- Probability: model outputs are distributions; LLM tokens are sampled from a categorical distribution over 50K+ items.
- Bayesian inference: MSE = Gaussian MLE; cross-entropy = Bernoulli/categorical MLE; L2 regularisation = Gaussian prior (MAP).
- Information theory: CE loss = βlog(p_correct); minimising cross-entropy = minimising KL divergence from model to data.
- Optimisation: Adam = momentum + adaptive LR; warmup + cosine annealing is the standard LLM training recipe.
- Graphs: GNNs use message passing; knowledge graphs store (s, p, o) triples; self-attention = GNN on complete graph.
- These are not prerequisites β they are the explanation for why ML works.