AI Foundation Β· Domain 02

Mathematics & Statistics for AI

Linear algebra, calculus, gradients, and the mathematical language that powers every machine learning model.

2.1
Chapter 2.1
Linear Algebra β€” The Language of Data

Every neural network layer is a matrix multiplication. Every word embedding is a vector. Every attention score is a dot product. Linear algebra is not a prerequisite for machine learning β€” it is machine learning. If you understand how vectors transform under matrix operations, every model architecture in this course will click into place.

Before any algorithm, you need to understand the three fundamental objects of linear algebra. Think of them as increasing levels of generality:

πŸ”’

Scalar

A single number. Examples: temperature (21.5Β°C), learning rate (0.001), a pixel intensity (127). Denoted by a lowercase italic: x.

πŸ“Š

Vector

An ordered list of numbers β€” a point in n-dimensional space. Examples: an RGB pixel [255, 128, 0], a 512-dim word embedding. Denoted in bold: x.

πŸ“‹

Matrix

A 2D table of numbers β€” rows and columns. Examples: a 28Γ—28 grayscale image, a weight matrix in a neural network. Denoted in bold uppercase: X.

A 512-dimensional word embedding is just a vector with 512 numbers. When GPT processes a prompt of 100 tokens, it works with a 100 Γ— 512 matrix β€” each row is one token's embedding. A dataset of 1,000 images (28Γ—28 pixels) becomes a 1,000 Γ— 784 matrix.

Notation convention: Scalars are lowercase italic (x), vectors are bold lowercase (x), matrices are bold uppercase (X). This convention is universal across ML papers and textbooks. When you see W in a paper, you instantly know it's a matrix.
3.14 Scalar β†’ generalises to β†’ 0.2 0.8 0.4 0.1 Vector (4-dim) β†’ generalises to β†’ 1 4 7 2 5 8 3 6 9 Matrix (3Γ—3)

Why does matrix multiplication matter? Because a neural network layer is literally: output = WΒ·x + b. Every forward pass through a model is a sequence of matrix multiplications punctuated by nonlinear activation functions. If you can visualise matrix multiplication, you can visualise what a neural network does.

The shape rule: To multiply an (m Γ— n) matrix by an (n Γ— p) matrix, the inner dimensions must match. The result is (m Γ— p). This is why you get "dimension mismatch" errors in PyTorch β€” it means the inner dimensions don't agree.

Matrix multiplication works by taking the dot product of each row of the left matrix with each column of the right matrix. For element C[i,j]: you multiply corresponding entries of row i and column j, then sum them up.

Worked Example: (2Γ—3) Β· (3Γ—2) = (2Γ—2)
01
Matrix A = [[1, 2, 3], [4, 5, 6]] β†’ shape (2Γ—3)
02
Matrix B = [[7, 10], [8, 11], [9, 12]] β†’ shape (3Γ—2)
03
C[0,0] = 1Γ—7 + 2Γ—8 + 3Γ—9 = 7 + 16 + 27 = 50
04
C[0,1] = 1Γ—10 + 2Γ—11 + 3Γ—12 = 10 + 22 + 36 = 68
05
C[1,0] = 4Γ—7 + 5Γ—8 + 6Γ—9 = 28 + 40 + 54 = 122
06
C[1,1] = 4Γ—10 + 5Γ—11 + 6Γ—12 = 40 + 55 + 72 = 167
07
Result C = [[50, 68], [122, 167]] β†’ shape (2Γ—2) βœ“
The Neural Network Layer Equation y = Wx + b W = weight matrix (output_size Γ— input_size), x = input vector (input_size Γ— 1), b = bias vector (output_size Γ— 1), y = output vector (output_size Γ— 1). This single equation IS a neural network layer.

Transpose flips rows and columns: if A is (m Γ— n), then AT is (n Γ— m). You see it constantly β€” attention computes QKT, meaning each query vector is dot-producted with each key vector. The transpose makes the shapes line up.

Hadamard (element-wise) product multiplies corresponding entries. Notation: A βŠ™ B. It appears in LSTM gating (forget gate Γ— cell state), masked attention, and dropout.

W (2Γ—3) 1 2 3 4 5 6 Γ— x (3Γ—1) 7 8 9 = y (2Γ—1) 50 122 dot product Row 0: 1Γ—7 + 2Γ—8 + 3Γ—9 = 50 Row 1: 4Γ—7 + 5Γ—8 + 6Γ—9 = 122 (2Γ—3) Β· (3Γ—1) = (2Γ—1) β€” inner dims must match Each output element = dot product of one row of W with the column of x This IS a neural network layer.
import torch

W = torch.randn(4, 3) # Weight matrix: 4 outputs, 3 inputs
x = torch.randn(3, 1) # Input vector: 3 features
b = torch.randn(4, 1) # Bias vector: 4 outputs
y = W @ x + b # @ = matrix multiply in Python
# y.shape = (4, 1) # 4-dimensional output
OperationNotationWhere in AI
Matrix MultiplyC = ABEvery neural network layer, attention (QKα΅€)
TransposeAα΅€Attention scores: QKα΅€ aligns query/key dims
HadamardA βŠ™ BLSTM gates, masked attention, dropout
Outer Productuvα΅€Attention weight Γ— value in some formulations

The dot product is the single most important operation in modern AI. It measures how aligned two vectors are β€” if they point in the same direction, the dot product is large and positive. If they're perpendicular, it's zero. If they point in opposite directions, it's large and negative.

Dot Product β€” Two Equivalent Definitions a Β· b = Ξ£(aα΅’ Β· bα΅’) = |a| Β· |b| Β· cos(ΞΈ) First form: multiply corresponding entries and sum. Second form: product of lengths times cosine of angle between them. When vectors are unit-length, the dot product IS the cosine similarity.
🎯

Cosine Similarity

Normalise both vectors to unit length, then take the dot product. Result is between βˆ’1 (opposite) and +1 (identical direction). This is how embedding search works: "Is 'king' similar to 'queen'?" becomes "Is cos(king, queen) close to 1?"

πŸ”

Attention = Dot Products

In Transformers, attention score = Q Β· Kα΅€ / √dk. Each query vector is dot-producted with each key vector. High dot product means "this token should attend to that token." This is the core of GPT, BERT, and every LLM.

GEOMETRIC INTERPRETATION a b ΞΈ cos(0Β°) = 1 β†’ aligned cos(90Β°) = 0 β†’ perpendicular EMBEDDING SPACE king queen apple cos β‰ˆ 0.92 cos β‰ˆ 0.15

Norms measure the "size" or "length" of a vector. They appear everywhere in regularisation β€” when you penalise large weights, you're literally penalising the norm of the weight vector.

NormFormulaIntuitionWhere in AI
L1 (Manhattan)Ξ£|xα΅’|Sum of absolute values β€” taxicab distanceL1 regularisation β†’ sparse weights (Lasso)
L2 (Euclidean)√(Σxᡒ²)Straight-line distanceL2 regularisation (weight decay), gradient clipping
Frobenius√(Σᡒⱼ Aᡒⱼ²)L2 norm for matricesWeight matrix regularisation, LoRA rank analysis
Max (L∞)max|xᡒ|Largest single elementAdversarial robustness, gradient clipping by max
Why L2 regularisation works: Adding Ξ»||W||Β² to the loss function penalises large weights. This pushes the model to use smaller, more distributed weights β€” which improves generalisation. In PyTorch, this is the weight_decay parameter in the optimiser.

Intuition first: Most vectors change direction when you multiply them by a matrix. But certain special vectors only get scaled β€” they keep pointing the same way. These are eigenvectors, and the scale factor is the eigenvalue. Think of them as the "natural axes" of a transformation.

The Eigen Equation Av = Ξ»v Matrix A times vector v = scalar Ξ» times the same vector v. The vector v is the eigenvector. The scalar Ξ» is the eigenvalue. This means A only stretches v by factor Ξ» β€” it never rotates it.
πŸ”„

Eigendecomposition

  • A = QΞ›Q⁻¹ where Q = matrix of eigenvectors, Ξ› = diagonal matrix of eigenvalues
  • Only works for square matrices with n linearly independent eigenvectors
  • Symmetric matrices always have real eigenvalues and orthogonal eigenvectors
πŸ“‰

PCA Connection

  • PCA = find eigenvectors of the covariance matrix
  • Eigenvector with largest eigenvalue = direction of most variance
  • Keep top-k eigenvectors β†’ reduce dimensions from n to k
  • Used in data preprocessing, visualisation, noise reduction
Worked Example: Find Eigenvalues of a 2Γ—2 Matrix
01
Given A = [[4, 2], [1, 3]]
02
Solve det(A βˆ’ Ξ»I) = 0 β†’ det([[4βˆ’Ξ», 2], [1, 3βˆ’Ξ»]]) = 0
03
(4βˆ’Ξ»)(3βˆ’Ξ») βˆ’ 2Γ—1 = 0 β†’ λ² βˆ’ 7Ξ» + 10 = 0
04
Factor: (Ξ»βˆ’5)(Ξ»βˆ’2) = 0 β†’ λ₁ = 5, Ξ»β‚‚ = 2
05
For λ₁=5: (Aβˆ’5I)v = 0 β†’ [[-1,2],[1,-2]]v = 0 β†’ v₁ = [2, 1]
06
For Ξ»β‚‚=2: (Aβˆ’2I)v = 0 β†’ [[2,2],[1,1]]v = 0 β†’ vβ‚‚ = [-1, 1]
EIGENVECTOR TRANSFORMATION v₁ (Ξ»=5) stretched 5Γ— vβ‚‚ (Ξ»=2) stretched 2Γ— regular vector β†’ rotated + scaled PCA β€” EIGENVECTORS OF COVARIANCE PC₁ most variance PCβ‚‚ less variance
Why eigenvalues of the Hessian matter: At a critical point of the loss function, the eigenvalues of the Hessian matrix (second-derivative matrix) tell you whether it's a minimum (all positive), maximum (all negative), or saddle point (mixed signs). In deep learning, most critical points are saddle points β€” which is why SGD still works well.

SVD is the generalised eigendecomposition that works for any matrix β€” not just square ones. It decomposes any mΓ—n matrix into three factors: rotation Γ— scale Γ— rotation.

SVD β€” The Universal Matrix Factorisation A = UΞ£Vα΅€ U = left singular vectors (mΓ—m, orthogonal), Ξ£ = singular values on diagonal (mΓ—n), Vα΅€ = right singular vectors (nΓ—n, orthogonal). Singular values σ₁ β‰₯ Οƒβ‚‚ β‰₯ … β‰₯ 0 are always non-negative and sorted.

Low-rank approximation: Keep only the top-k singular values (set the rest to zero). This gives the best rank-k approximation of the original matrix β€” provably optimal. This is how you compress a matrix while losing the least information.

πŸ“

Dimensionality Reduction

Truncated SVD reduces features while preserving the most important structure. LSA (Latent Semantic Analysis) applies SVD to term-document matrices for text.

🎬

Recommender Systems

The Netflix Prize was won by SVD-based methods. Factorise the user-movie rating matrix into user preferences Γ— movie features.

πŸ”—

Connection to PCA

PCA of centered data = SVD of the data matrix. The right singular vectors (V) are the principal components. Same result, different computation.

A (m Γ— n) = U (m Γ— m) Γ— Ξ£ (m Γ— n, diag) Γ— Vα΅€ (n Γ— n) LOW-RANK APPROX Keep top-k singular values: σ₁ β‰₯ Οƒβ‚‚ β‰₯ … β‰₯ Οƒβ‚– Set Οƒβ‚–β‚Šβ‚ … Οƒβ‚™ = 0 Best rank-k approximation (Eckart-Young)

A tensor is the N-dimensional generalisation of vectors (1D) and matrices (2D). In deep learning, almost all data lives in tensors. Understanding tensor shapes is the single most practical skill for debugging neural networks.

πŸ“¦

Common Tensor Shapes

  • Images: (B, C, H, W) β€” batch, channels, height, width
  • Text: (B, T, D) β€” batch, sequence length, embedding dim
  • Attention: (B, heads, T, T) β€” batch, heads, query, key
  • Video: (B, T, C, H, W) β€” 5D tensor!
πŸ“‘

Broadcasting

  • PyTorch can operate on tensors of different shapes
  • Rules: dimensions compared right to left
  • Size 1 is "broadcast" (stretched) to match the other
  • Adding bias (4,1) to output (4,32) works via broadcasting
0D scalar () β†’ 1D vector (512,) β†’ 2D matrix (32, 512) β†’ 3D 3D tensor (B, T, D) β†’ 4D images (B, C, H, W) β†’ 5D video (B,T,C,H,W)
import torch

# Scalar (0D tensor)
x = torch.tensor(3.14) # shape: ()

# Vector (1D tensor) β€” one word embedding
emb = torch.randn(512) # shape: (512,)

# Matrix (2D tensor) β€” batch of embeddings
batch_emb = torch.randn(32, 512) # shape: (32, 512) = batch Γ— dim

# 3D tensor β€” sequence of embeddings
seq = torch.randn(32, 128, 512) # shape: (batch, seq_len, d_model)

# 4D tensor β€” image batch (B, C, H, W)
images = torch.randn(16, 3, 224, 224) # 16 RGB images, 224Γ—224
The .shape mental model: In PyTorch, every tensor has three critical attributes: .shape (dimensions), .dtype (float32, int64, etc.), and .device (cpu or cuda). When debugging, the first thing to check is always tensor.shape. Most bugs are shape mismatches.
Chapter 2.1 β€” Key Takeaways
  • Neural network layers = matrix multiplication: output = Wx + b. This is the single most important equation in deep learning.
  • Dot product measures alignment between vectors β€” it powers attention (QΒ·Kα΅€) and cosine similarity in embedding search.
  • L1/L2 norms of weight matrices are what regularisation literally penalises β€” smaller norms β†’ better generalisation.
  • Eigenvectors = natural axes of a transformation β€” PCA finds them for data, the Hessian's eigenvalues classify critical points.
  • SVD decomposes any matrix into UΞ£Vα΅€ β€” enables low-rank approximation, compression, and recommendation systems.
  • Tensors generalise matrices to N dimensions β€” shape (B, T, D) is the universal LLM data format; (B, C, H, W) for images.
2.2
Chapter 2.2
Calculus & Gradients β€” How Models Learn

Backpropagation β€” the algorithm that trains every neural network β€” is four lines of calculus applied recursively. Understand the chain rule deeply and everything else follows. Every time you call loss.backward() in PyTorch, the machine is doing exactly the calculus described in this chapter.

A derivative answers one question: "If I nudge this input by a tiny amount, how much does the output change?" In machine learning, that becomes: "If I nudge this weight by Ξ΅, how much does the loss change?" The answer tells you exactly how to adjust the weight to reduce the error.

πŸ“

Derivative (Single Variable)

  • Rate of change of f(x) at a specific point
  • Geometrically: the slope of the tangent line
  • f'(x) = lim(Ξ΅β†’0) [f(x+Ξ΅) βˆ’ f(x)] / Ξ΅
  • If f'(x) > 0: function is increasing at x
  • If f'(x) < 0: function is decreasing at x
🧭

Gradient (Multi-Variable)

  • Vector of ALL partial derivatives
  • βˆ‡L = [βˆ‚L/βˆ‚w₁, βˆ‚L/βˆ‚wβ‚‚, …, βˆ‚L/βˆ‚wβ‚™]
  • Points in the direction of steepest ascent
  • We subtract the gradient to go downhill
  • GPT-4 has ~1.8 trillion gradient entries per step

Partial derivative: when a function has many inputs (like millions of weights), the partial derivative βˆ‚L/βˆ‚wα΅’ holds all other weights fixed and measures how L changes with respect to wα΅’ alone. The gradient stacks all of these into one vector.

Worked Example: Gradient of MSE for a Single Weight
01
Model: Ε· = wx (one weight, one input)
02
Loss: L = (Ε· βˆ’ y)Β² = (wx βˆ’ y)Β²
03
Apply chain rule: βˆ‚L/βˆ‚w = 2(wx βˆ’ y) Β· x
04
With x=3, y=7, w=1: Ε· = 3, L = (3βˆ’7)Β² = 16
05
βˆ‚L/βˆ‚w = 2(3βˆ’7)Β·3 = 2(βˆ’4)(3) = βˆ’24
06
Gradient is negative β†’ loss decreases if we increase w βœ“
07
Update: w ← 1 βˆ’ 0.01Β·(βˆ’24) = 1.24 (closer to correct w=7/3)
Gradient Descent Update Rule w ← w βˆ’ Ξ± Β· βˆ‚L/βˆ‚w w = weight, Ξ± = learning rate (typically 0.001–0.01), βˆ‚L/βˆ‚w = partial derivative of loss with respect to weight. The minus sign is critical: we go OPPOSITE to the gradient (downhill, not uphill).
GRADIENT DESCENT ON A LOSS LANDSCAPE weight w Loss L minimum wβ‚€ slope = βˆ‚L/βˆ‚w w₁ wβ‚‚ w₃ Gradient Descent 1. Compute gradient (slope) 2. Step opposite to gradient w ← w βˆ’ Ξ± Β· βˆ‚L/βˆ‚w

A neural network is a chain of composed functions: input β†’ linear β†’ activation β†’ linear β†’ activation β†’ … β†’ loss. The chain rule lets us compute how the loss depends on any weight, no matter how deep the network. It's the mathematical foundation of backpropagation.

The Chain Rule d/dx[ f(g(x)) ] = f'(g(x)) Β· g'(x) In words: the derivative of a composed function = the derivative of the outer function evaluated at the inner, times the derivative of the inner function. For multiple layers, just keep multiplying.
Three-layer example: Given L(a(z(x))), the chain rule gives us βˆ‚L/βˆ‚x = (βˆ‚L/βˆ‚a) Β· (βˆ‚a/βˆ‚z) Β· (βˆ‚z/βˆ‚x). Each term is a local gradient β€” computed at one node. Backpropagation simply evaluates these terms from right to left (output to input), multiplying as it goes.
Backprop Through a Single Layer: z = Wx + b β†’ a = ReLU(z) β†’ L = (a βˆ’ y)Β²
β†’
Forward pass: x=2, W=0.5, b=0.1, y=1.0
01
z = Wx + b = 0.5Β·2 + 0.1 = 1.1
02
a = ReLU(1.1) = 1.1 (positive, so passes through)
03
L = (1.1 βˆ’ 1.0)Β² = 0.01
←
Backward pass:
04
βˆ‚L/βˆ‚a = 2(a βˆ’ y) = 2(0.1) = 0.2
05
βˆ‚a/βˆ‚z = 1 (ReLU derivative, z > 0) β†’ βˆ‚L/βˆ‚z = 0.2 Β· 1 = 0.2
06
βˆ‚z/βˆ‚W = x = 2 β†’ βˆ‚L/βˆ‚W = 0.2 Β· 2 = 0.4
07
Update: W ← 0.5 βˆ’ 0.01Β·0.4 = 0.496
COMPUTATIONAL GRAPH β€” FORWARD & BACKWARD PASS x = 2 W = 0.5 z = Wx+b = 1.1 ReLU(z) = 1.1 L=(aβˆ’y)Β² = 0.01 FORWARD β†’ βˆ‚L/βˆ‚a = 0.2 βˆ‚L/βˆ‚z = 0.2 βˆ‚L/βˆ‚W = 0.4 ← BACKWARD Chain Rule: βˆ‚L/βˆ‚W = βˆ‚L/βˆ‚a Γ— βˆ‚a/βˆ‚z Γ— βˆ‚z/βˆ‚W = 0.2 Γ— 1 Γ— 2 = 0.4
πŸ”—

Why This Works for Deep Networks

A 100-layer network is 100 composed functions. The chain rule produces a product of 100 local gradients. Backprop computes all of them in one backward pass β€” same cost as one forward pass. This is why deep learning is computationally tractable.

⚠️

Vanishing & Exploding Gradients

If each local gradient is < 1, the product shrinks exponentially through 100 layers β†’ vanishing gradients. If > 1, it explodes. Solutions: residual connections (ResNet), LayerNorm, gradient clipping, careful initialisation (He/Xavier).

When your function takes a vector in and produces a vector out, the derivative is no longer a single number or even a vector β€” it's a matrix. The Jacobian matrix captures how every output component changes with respect to every input component.

Jacobian Matrix J[i,j] = βˆ‚fα΅’ / βˆ‚xβ±Ό For f: ℝⁿ β†’ ℝᡐ, the Jacobian is an mΓ—n matrix. Row i = gradient of the i-th output. Column j = how the j-th input affects all outputs. In PyTorch, torch.autograd.functional.jacobian() computes this.
Hessian Matrix β€” Second-Order Information H[i,j] = βˆ‚Β²L / βˆ‚wα΅’βˆ‚wβ±Ό The Hessian captures curvature of the loss surface. For a model with n parameters, the Hessian is nΓ—n. For GPT-3 (175B params), that's 175B Γ— 175B entries β€” never computed explicitly.
🟒

H Positive Definite

All eigenvalues positive β†’ all directions curve upward β†’ local minimum. The loss increases no matter which direction you move. This is where we want the optimiser to converge.

🟑

H Indefinite

Mixed positive and negative eigenvalues β†’ saddle point. Curves up in some directions, down in others. Most critical points in deep learning are saddle points, not local minima.

πŸ”΅

H β‰ˆ 0 (Flat)

Near-zero eigenvalues β†’ flat region. Loss barely changes. Research suggests flat minima generalise better than sharp ones (SAM optimiser exploits this).

LOSS SURFACE β€” CRITICAL POINTS Minimum H positive definite Saddle Point H indefinite Flat Region H β‰ˆ 0

The Taylor series approximates any smooth function locally with a polynomial. The key insight: gradient descent is the first-order Taylor approximation. It assumes the loss surface is locally linear β€” which is only accurate when the learning rate is small. This is why large learning rates cause training to diverge: the linear approximation breaks down.

Taylor Expansion of Loss Function f(x + Ξ΄) β‰ˆ f(x) + βˆ‡f(x)α΅€Ξ΄ + Β½Ξ΄α΅€HΞ΄ + … First-order term (βˆ‡f(x)α΅€Ξ΄) β†’ gradient descent: assume loss is linear locally. Second-order term (Β½Ξ΄α΅€HΞ΄) β†’ Newton's method: account for curvature. Higher-order terms are almost never used in practice.
MethodTaylor OrderCost per StepConvergence
Gradient Descent1st orderO(n) β€” one gradientSlow but cheap
Newton's Method2nd orderO(nΒ³) β€” Hessian inverseFast but prohibitive for n > 10K
L-BFGSQuasi-2nd orderO(n) β€” approximate HessianGood for small models
AdamAdaptive 1st orderO(n) β€” per-param ratesDefault for deep learning

Computing gradients for a model with millions of parameters by hand is impossible. Automatic differentiation (autodiff) solves this: the computer applies the chain rule automatically by recording every operation during the forward pass, then replaying them in reverse to compute all gradients simultaneously.

➑️

Forward Mode

  • Compute derivative alongside the function value
  • One pass gives derivatives w.r.t. one input
  • Efficient when: few inputs, many outputs
  • Cost: O(n) passes for n input variables
  • Rarely used in deep learning
⬅️

Reverse Mode (Backprop)

  • Record operations, then replay in reverse
  • One pass gives derivatives w.r.t. ALL inputs
  • Efficient when: many inputs, few outputs
  • Cost: O(1) backward passes (same cost as forward)
  • This is backpropagation
Why reverse mode wins for ML: A loss function has 1 output (the loss) and millions of weight inputs. Forward mode would need millions of passes (one per weight). Reverse mode needs exactly one backward pass to get all gradients. This is why loss.backward() is one call, not a loop.
import torch

x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)

# Forward pass: build computational graph
z = w * x ** 2 + 5 # z = 3xΒ² + 5
# dz/dx = 6x = 12, dz/dw = xΒ² = 4

# Backward pass: compute ALL gradients via reverse-mode autodiff
z.backward()

print(x.grad) # tensor(12.) ← dz/dx = 6x evaluated at x=2
print(w.grad) # tensor(4.) ← dz/dw = xΒ² evaluated at x=2
FORWARD PASS x=2 w=3 xΒ²=4 wxΒ²=12 z=17 BACKWARD PASS βˆ‚z/βˆ‚z = 1 βˆ‚z/βˆ‚(wxΒ²) = 1 βˆ‚(wxΒ²)/βˆ‚(xΒ²) = w = 3 βˆ‚z/βˆ‚w = xΒ² = 4 βˆ‚z/βˆ‚x = 6x = 12 One backward pass gives: x.grad = 12 w.grad = 4 All gradients simultaneously!

In PyTorch, every tensor with requires_grad=True records operations in a computational graph. When you call .backward(), PyTorch walks the graph in reverse, applying the chain rule at each node. The gradients accumulate in each leaf tensor's .grad attribute. This is the "gradient tape" β€” record forward, replay backward.

PyTorch ConceptWhat It DoesCalculus Equivalent
requires_grad=TrueTell PyTorch to track this tensor's operationsMark as a variable to differentiate w.r.t.
.backward()Trigger reverse-mode autodiffApply chain rule from output to all inputs
.gradAccess computed gradientβˆ‚L/βˆ‚w evaluated at current values
.detach()Stop gradient trackingTreat as a constant (no derivative)
torch.no_grad()Disable gradient computation (inference)Skip all derivative bookkeeping
Chapter 2.2 β€” Key Takeaways
  • Derivative = rate of change; gradient = vector of all partial derivatives pointing in the direction of steepest ascent on the loss surface β€” we subtract it to descend.
  • Chain rule: d/dx[f(g(x))] = f'(g(x))Β·g'(x) β€” the mathematical foundation of backpropagation, applied recursively through every layer.
  • Jacobian generalises gradient to vector functions; Hessian captures curvature β€” expensive (O(nΒ²)), rarely computed explicitly for large models.
  • Gradient descent is a first-order Taylor approximation β€” valid only when learning rate is small enough that the linear approximation holds.
  • PyTorch autograd uses reverse-mode autodiff β€” one backward pass computes ALL gradients simultaneously, making million-parameter training tractable.
2.3
Chapter 2.3
Probability Theory β€” Reasoning Under Uncertainty

Machine learning models are probability machines. A classifier outputs P(cat | image). An LLM outputs P(next token | context). A diffusion model reverses a probabilistic noise process. Every loss function, every prediction, and every training signal is grounded in probability theory. Without it, you're flying blind.

A probability P(A) is a number between 0 and 1 that measures how likely an event A is. The sample space Ξ© is the set of all possible outcomes; an event is a subset of Ξ©. Three axioms define the entire theory: P(Ξ©) = 1, P(A) β‰₯ 0 for any event, and for disjoint events, P(A βˆͺ B) = P(A) + P(B).

🎲

Joint Probability

P(A ∩ B) β€” the probability that both A and B occur. If independent: P(A ∩ B) = P(A) Β· P(B). Naive Bayes assumes feature independence to use this simplification.

πŸ”—

Conditional Probability

P(A | B) = P(A ∩ B) / P(B) β€” probability of A given that B occurred. Every autoregressive LLM computes P(next_token | all_previous_tokens) at each step.

πŸ”€

Independence

A and B are independent if P(A | B) = P(A). Knowing B tells you nothing about A. In practice, few real features are truly independent β€” but assuming they are often works (Naive Bayes).

When GPT-4 generates the next token, it computes a probability for every word in its vocabulary β€” typically 50,000+ values that must sum to 1. This is a categorical distribution over tokens. The softmax function transforms raw model outputs (logits) into this valid probability distribution. Every chapter in this course connects back to this fundamental mechanism.

A random variable X is a function that maps outcomes to numbers. Flip a coin β†’ X = 1 (heads) or X = 0 (tails). A distribution describes the probabilities of all possible values of X. Distributions come in two flavours:

πŸ“Š

Discrete

  • PMF: P(X = x) for each possible value
  • All probabilities must sum to 1
  • Examples: coin flips, dice rolls, token IDs
  • Classification output is discrete
πŸ“ˆ

Continuous

  • PDF: f(x) such that area under curve = 1
  • P(X = exact value) = 0; use P(a ≀ X ≀ b)
  • Examples: weight values, pixel intensities, latent codes
  • CDF: F(x) = P(X ≀ x), always non-decreasing
DistributionTypeParametersAI Application
BernoulliDiscretep (success probability)Binary classification: spam/not-spam, dropout mask
CategoricalDiscretep₁, pβ‚‚, …, pβ‚–Softmax output β€” LLM next-token prediction over vocab
GaussianContinuousΞΌ (mean), σ² (variance)Weight init, noise injection, VAE latent space, diffusion
UniformContinuousa (min), b (max)Random init, uniform noise, exploration in RL
PoissonDiscreteΞ» (rate)Event count modelling: clicks, failures, arrivals
BERNOULLI 0.3 0.7 0 1 Binary classification spam / not-spam CATEGORICAL ← next token LLM softmax output P(token | context) GAUSSIAN ΞΌ Weight initialisation VAE latent space UNIFORM a b Random initialisation Dropout masks

Expected value E[X] is the weighted average of all possible outcomes. "If I ran the experiment infinitely many times, what's the average?" For discrete X: E[X] = Ξ£ xα΅’ P(xα΅’). For continuous X: E[X] = ∫ x f(x) dx. In ML, the loss function is an expectation: L = E[β„“(Ε·, y)] averaged over the data distribution.

The Key Quantities E[X] = Ξ£ xα΅’ Β· P(xα΅’) Expected value: weighted average of all outcomes. Var[X] = E[(X βˆ’ ΞΌ)Β²] = E[XΒ²] βˆ’ (E[X])Β² Variance: average squared deviation from the mean. High variance = spread out. Cov(X, Y) = E[(X βˆ’ ΞΌβ‚“)(Y βˆ’ ΞΌα΅§)] Covariance: how X and Y move together. Positive = same direction. Negative = opposite. Zero = uncorrelated.
🎯

Correlation

Normalised covariance: ρ = Cov(X,Y) / (Οƒβ‚“ Β· Οƒα΅§). Always between βˆ’1 and +1. Correlation of +1 means perfect linear relationship. Zero correlation β‰  independence (only for Gaussians).

πŸ“

Covariance Matrix

For n features, the covariance matrix Ξ£ is nΓ—n, where Ξ£α΅’β±Ό = Cov(Xα΅’, Xβ±Ό). Diagonal = variances, off-diagonal = covariances. PCA = eigendecomposition of Ξ£ (Ch 2.1).

Cov(X,Y) > 0 Positive correlation Cov(X,Y) β‰ˆ 0 No correlation Cov(X,Y) < 0 Negative correlation

Bayes' theorem is the most important equation in probabilistic reasoning. It tells you how to update your beliefs when you see new data. Start with a prior belief, observe evidence, compute a posterior belief. This is the core logic of Bayesian machine learning, spam filters, medical diagnosis, and A/B testing.

Bayes' Theorem P(ΞΈ | D) = P(D | ΞΈ) Β· P(ΞΈ) / P(D) Posterior = (Likelihood Γ— Prior) / Evidence. ΞΈ = model/hypothesis, D = observed data. The posterior is your updated belief about ΞΈ after seeing data D.
πŸ“‹

The Four Terms

  • Prior P(ΞΈ): What you believed before data
  • Likelihood P(D|ΞΈ): How probable the data is under this model
  • Evidence P(D): Total probability of the data (normaliser)
  • Posterior P(ΞΈ|D): Updated belief after seeing data
πŸ€–

In ML Terms

  • MLE: Find ΞΈ maximising P(D|ΞΈ) β€” ignores prior
  • MAP: Find ΞΈ maximising P(ΞΈ|D) β€” includes prior
  • L2 regularisation = Gaussian prior on weights
  • L1 regularisation = Laplace prior on weights
Worked Example: Medical Test (The Base Rate Fallacy)
01
Disease prevalence (prior): P(disease) = 1% = 0.01
02
Test accuracy: P(positive | disease) = 99% = 0.99
03
False positive rate: P(positive | no disease) = 1% = 0.01
04
Total P(positive) = 0.99 Γ— 0.01 + 0.01 Γ— 0.99 = 0.0099 + 0.0099 = 0.0198
05
P(disease | positive) = (0.99 Γ— 0.01) / 0.0198 = 50%
06
A "99% accurate" test gives only 50% certainty! The low prior overwhelms the evidence.
Naive Bayes classifier: Assume all features are conditionally independent given the class label. Then P(class | features) ∝ P(class) · ∏ P(featureᡒ | class). Despite the strong (and usually wrong) independence assumption, Naive Bayes works surprisingly well for text classification, spam filtering, and sentiment analysis.
BAYESIAN UPDATE: PRIOR β†’ DATA β†’ POSTERIOR Prior P(ΞΈ) broad = uncertain Observe Data D data clusters here Γ— Posterior P(ΞΈ|D) narrow = confident more data β†’ sharper posterior

The Gaussian (normal) distribution is the most important continuous distribution in machine learning. It appears in weight initialisation, noise injection, latent spaces, diffusion models, and the Central Limit Theorem guarantees it emerges from the sum of many independent random variables.

Gaussian Probability Density Function f(x) = (1 / √(2πσ²)) Β· exp(βˆ’(x βˆ’ ΞΌ)Β² / 2σ²) ΞΌ = mean (center of the bell), σ² = variance (width of the bell). The standard normal has ΞΌ=0, Οƒ=1. Xavier/He initialisation draws weights from scaled Gaussians.
βš–οΈ

68-95-99.7 Rule

68% of values within Β±1Οƒ, 95% within Β±2Οƒ, 99.7% within Β±3Οƒ. Values beyond 3Οƒ are extreme outliers. This is why gradient clipping often clips at 3–5Γ— the standard deviation.

πŸ—οΈ

Weight Initialisation

Xavier: W ~ N(0, 1/n_in). He: W ~ N(0, 2/n_in). Proper init keeps activations and gradients at consistent scale through layers β€” prevents vanishing/exploding gradients.

🌫️

Gaussian Noise

Diffusion models add Gaussian noise progressively, then learn to reverse it. VAEs encode data as ΞΌ + σ·Ρ where Ξ΅ ~ N(0,1). Dropout is Bernoulli noise; data augmentation often adds Gaussian noise.

1D GAUSSIANS β€” EFFECT OF Οƒ Οƒ = 0.5 Οƒ = 1.0 Οƒ = 2.0 ΞΌ ← 68% (Β±1Οƒ) β†’ 2D MULTIVARIATE GAUSSIAN x₁ xβ‚‚ σ₁ Οƒβ‚‚ Ellipse axes = eigenvectors of Ξ£

Standardisation (z-score normalisation) transforms features to zero mean and unit variance: z = (x βˆ’ ΞΌ) / Οƒ. Almost every ML pipeline starts with this step β€” it ensures all features contribute equally and speeds up gradient descent convergence.

When an LLM computes P(next_token | context), it still needs to choose a token. The simplest approach β€” always pick the most probable token (greedy decoding) β€” produces repetitive, boring text. Instead, we sample from the distribution, with several control knobs.

🌑️

Temperature Ο„

Divide logits by Ο„ before softmax. Ο„ < 1 sharpens (more confident). Ο„ > 1 flattens (more random). Ο„ β†’ 0 = greedy. Ο„ β†’ ∞ = uniform random.

πŸ”

Top-k Sampling

Only consider the k most probable tokens. Redistributes probability among top-k and samples from that subset. k=1 = greedy, k=50 is common.

🎯

Top-p (Nucleus)

Sample from the smallest set of tokens whose cumulative probability β‰₯ p. Adapts k dynamically β€” fewer tokens when confident, more when uncertain. p=0.9 is standard.

Temperature Effect: Token "The" with Base P = 0.6
Ο„
Effect on P("The")
0.1
P β‰ˆ 0.99 β€” almost deterministic, always picks "The"
0.5
P β‰ˆ 0.85 β€” still strongly favours "The"
1.0
P = 0.60 β€” original distribution (balanced)
2.0
P β‰ˆ 0.35 β€” flattened, other tokens get more chance
5.0
P β‰ˆ 0.22 β€” nearly uniform, very "creative" / chaotic
Ο„ = 0.1 Predictable one token dominates Ο„ = 1.0 Balanced original distribution Ο„ = 2.0 Creative flattened, more diverse
Monte Carlo estimation: Any expectation E[f(X)] can be approximated by averaging samples: E[f(X)] β‰ˆ (1/N) Ξ£ f(xα΅’) where xα΅’ ~ P(X). This is the foundation of Monte Carlo methods. In RL, policy gradient methods estimate the expected reward via Monte Carlo sampling of trajectories.

Law of Large Numbers (LLN): As sample size grows, the sample mean converges to the true mean. This is why test set accuracy is a reliable estimate of true model performance β€” given enough test samples. It's also why mini-batch gradient is an unbiased estimator of the full-batch gradient: E[βˆ‡L_batch] = βˆ‡L_full.

Central Limit Theorem (CLT): The sum (or average) of many independent random variables approaches a Gaussian distribution, regardless of the original distribution. This is why Gaussians appear everywhere β€” any quantity that arises from the aggregate of many small, independent effects will be approximately normal. Practical implication: larger mini-batches produce lower-variance gradient estimates (variance decreases as 1/n), giving smoother training.

Central Limit Theorem XΜ„β‚™ = (1/n) Ξ£ Xα΅’ β†’ N(ΞΌ, σ²/n) as n β†’ ∞ The sample mean of n i.i.d. random variables approaches a Gaussian centered on the true mean ΞΌ, with variance shrinking as 1/n. Double the batch size β†’ halve the gradient variance.
Chapter 2.3 β€” Key Takeaways
  • LLM token generation = sampling from a categorical distribution over 50,000+ vocabulary items β€” softmax converts logits to valid probabilities.
  • Distributions: Bernoulli for binary, Categorical/Softmax for multi-class, Gaussian for continuous β€” know which one your model outputs.
  • Bayes' theorem: posterior ∝ likelihood Γ— prior β€” Bayesian learning in one equation. L2 regularisation = Gaussian prior on weights.
  • Covariance matrix captures feature relationships β€” the foundation of PCA. Diagonal = variances, off-diagonal = covariances.
  • Temperature controls softmax sharpness: low Ο„ = greedy/predictable, high Ο„ = diverse/creative. Top-p adapts dynamically.
  • CLT: why mini-batch gradient descent works β€” sample averages converge to true expectations; larger batches = lower variance estimates.
2.4
Chapter 2.4
Bayesian Inference & Statistical Estimation

Every time you train a neural network, you are doing maximum likelihood estimation. Your choice of loss function is not arbitrary β€” it follows directly from your assumption about how outputs are distributed. Understanding MLE, MAP, and full Bayesian inference reveals why loss functions are what they are, and why regularisation works.

Core idea: Given observed data D = {x₁, xβ‚‚, …, xβ‚™}, find the parameters ΞΈ that make the data most probable. The likelihood function L(ΞΈ) = P(D | ΞΈ) measures how well ΞΈ explains the data. MLE finds ΞΈ* = argmax L(ΞΈ).

In practice we maximise the log-likelihood instead: log L(ΞΈ) = Ξ£ log P(xα΅’ | ΞΈ). This is equivalent (log is monotonic) but easier β€” products become sums, which are numerically stable and analytically simpler. Minimising the negative log-likelihood = maximising the log-likelihood. This is your loss function.

Maximum Likelihood Estimation ΞΈ* = argmax Ξ£ log P(xα΅’ | ΞΈ) Find parameters ΞΈ that maximise the sum of log-probabilities of observed data. Equivalently: minimise the negative log-likelihood (NLL). Training a neural network with any standard loss function IS MLE.
The punchline: Training a neural network IS maximum likelihood estimation. The loss function is not arbitrary β€” it follows directly from your assumption about the output distribution. Assume Gaussian outputs β†’ you get MSE loss. Assume Bernoulli outputs β†’ you get binary cross-entropy. The math decides the loss.
Gaussian Output β†’ MSE Loss
Bernoulli Output β†’ Binary Cross-Entropy
1. Assume y ~ N(f(x; ΞΈ), σ²)
2. P(y | x, ΞΈ) = (1/√2πσ²) exp(βˆ’(y βˆ’ f(x;ΞΈ))Β² / 2σ²)
3. log P = βˆ’(y βˆ’ f(x;ΞΈ))Β² / 2σ² βˆ’ const
4. Maximise log P ⟺ minimise (y βˆ’ Ε·)Β²
∴ MSE loss = MLE under Gaussian assumption
1. Assume y ~ Bernoulli(p), where p = Οƒ(f(x; ΞΈ))
2. P(y | x, ΞΈ) = p^y Β· (1βˆ’p)^(1βˆ’y)
3. log P = yΒ·log(p) + (1βˆ’y)Β·log(1βˆ’p)
4. Maximise log P ⟺ minimise βˆ’[y log p + (1βˆ’y)log(1βˆ’p)]
∴ BCE loss = MLE under Bernoulli assumption
Output AssumptionMLE Loss FunctionWhen Used
Gaussian (continuous)MSE / L2 lossRegression: predict price, temperature, age
Bernoulli (binary)Binary cross-entropyBinary classification: spam/not-spam
Categorical (multi-class)Categorical cross-entropyMulti-class: ImageNet, LLM next-token
Laplace (heavy tails)L1 / MAE lossRobust regression, outlier-resistant

MLE can overfit β€” it finds the parameters that best explain the training data, with no constraint on how "reasonable" those parameters are. MAP adds a prior belief: what kind of parameter values do you expect? The prior acts as a regulariser.

MAP = MLE + Prior ΞΈ_MAP = argmax [ log P(D|ΞΈ) + log P(ΞΈ) ] The first term is the log-likelihood (data fit). The second term is the log-prior (what you believe about ΞΈ before seeing data). MAP balances fitting the data with staying close to your prior belief.

Regularisation is not a trick β€” it's a Bayesian prior on what you believe about your weights. When you add a weight decay term to the loss, you're implicitly saying "I believe weights should be small." The specific penalty shape reveals the prior distribution:

Gaussian Prior β†’ L2 Regularisation
Laplace Prior β†’ L1 Regularisation
1. Prior: P(w) = N(0, 1/2Ξ») for each weight
2. log P(w) = βˆ’Ξ»wΒ² + const
3. MAP loss = L_data + Ξ» Ξ£ wα΅’Β²
4. This is L2 / Ridge / weight decay
Effect: Shrinks all weights toward zero, keeps them small but non-zero. Smooth, differentiable everywhere.
1. Prior: P(w) = Laplace(0, 1/Ξ») for each weight
2. log P(w) = βˆ’Ξ»|w| + const
3. MAP loss = L_data + Ξ» Ξ£ |wα΅’|
4. This is L1 / Lasso
Effect: Drives some weights to exactly zero β†’ feature selection. Produces sparse models.
🎯

L2 in Practice

  • PyTorch: weight_decay=0.01 in AdamW
  • Every LLM uses weight decay (typically 0.01–0.1)
  • Keeps weights small β†’ better generalisation
  • AdamW decouples weight decay from Adam updates
βœ‚οΈ

L1 in Practice

  • Produces sparse weights β€” many exactly zero
  • Acts as automatic feature selection
  • Elastic Net combines L1 + L2
  • Less common in deep learning, common in classical ML
MLE β†’ MAP β†’ FULL BAYESIAN: INCREASING SOPHISTICATION MLE argmax P(D|ΞΈ) β€’ No prior on weights β€’ Point estimate (single ΞΈ*) β€’ Can overfit with small data Loss = NLL only +prior MAP argmax P(D|ΞΈ)Β·P(ΞΈ) β€’ Adds prior (regularisation) β€’ Still a point estimate β€’ Better generalisation Loss = NLL + Ξ»||W||Β² +full Full Bayesian compute P(ΞΈ|D) β€’ Full posterior distribution β€’ Uncertainty quantification β€’ Computationally expensive P(ΞΈ|D) via MCMC/VI Most neural network training uses MLE or MAP. Full Bayesian is used for safety-critical & small-data settings.

MLE and MAP both produce point estimates β€” a single "best" ΞΈ. Full Bayesian inference goes further: it computes the entire posterior distribution P(ΞΈ | D). Instead of saying "the best weight is 0.42", Bayesian inference says "the weight is probably between 0.35 and 0.49, with 95% probability." This is uncertainty quantification.

🎲

Conjugate Priors

Special prior-likelihood pairs where the posterior has the same family as the prior. Beta-Bernoulli, Gaussian-Gaussian, Dirichlet-Categorical. These give closed-form posteriors β€” no sampling needed.

πŸ”—

MCMC Sampling

When the posterior has no closed form, draw samples from it using Markov Chain Monte Carlo. Metropolis-Hastings, Hamiltonian MC (HMC). Gold standard for accuracy, but slow for large models.

⚑

Variational Inference

Approximate the true posterior with a simpler distribution (e.g., Gaussian). Optimise parameters of the approximation to minimise KL divergence. Fast and scalable β€” used in VAEs.

MethodOutputCostWhen to Use
MLEPoint estimate ΞΈ*O(training)Large datasets, standard deep learning
MAPPoint estimate ΞΈ*O(training)When regularisation helps (almost always)
MCMCSamples from P(ΞΈ|D)Very expensiveSmall datasets, need calibrated uncertainty
Variational InferenceApproximate P(ΞΈ|D)ModerateVAEs, Bayesian neural nets, latent variable models
When to go full Bayesian: (1) Small datasets where overfitting is severe, (2) Safety-critical applications (medical, autonomous driving) where you need calibrated uncertainty β€” "I don't know" is a valid and necessary output, (3) Active learning β€” decide which data to label next based on model uncertainty.

Two philosophical camps interpret probability differently. Frequentists see probability as long-run frequency of events; parameters are fixed but unknown. Bayesians treat parameters as random variables with distributions that represent degrees of belief. Most deep learning is frequentist MLE in practice, but Bayesian ideas (priors, posterior, uncertainty) are increasingly important.

Frequentist
Bayesian
Parameters: Fixed but unknown constants
Probability: Long-run frequency of events
Estimation: MLE β€” maximise P(data | ΞΈ)
Uncertainty: Confidence intervals (frequentist coverage)
Hypothesis testing: p-values, null hypothesis
Regularisation: Ad-hoc penalty term
Deep learning: Standard SGD + weight decay
Parameters: Random variables with distributions
Probability: Degree of belief / uncertainty
Estimation: Compute posterior P(ΞΈ | data)
Uncertainty: Credible intervals (direct probability)
Hypothesis testing: Model comparison via Bayes factors
Regularisation: Natural consequence of the prior
Deep learning: BNNs, MC Dropout, ensembles

Probabilistic Graphical Models (PGMs) represent complex joint distributions as graphs. Nodes are random variables, edges encode conditional dependencies. PGMs make the structure of a probabilistic model explicit β€” you can literally see which variables depend on which.

➑️

Bayesian Networks (Directed)

  • Directed acyclic graph (DAG)
  • Arrows encode causal/conditional relationships
  • P(A,B,C) = P(A) Β· P(B|A) Β· P(C|A,B)
  • D-separation determines conditional independence
  • Examples: medical diagnosis, spam filters, causal reasoning
↔️

Markov Random Fields (Undirected)

  • Undirected edges = symmetric correlations
  • No causal direction implied
  • Clique potentials define the joint
  • Markov blanket: a node is independent of all others given its neighbours
  • Examples: image segmentation, physics simulations
BAYESIAN NETWORK (DAG) Season Rain Sprin- kler Wet Grass = observed HIDDEN MARKOV MODEL (HMM) h₁ hβ‚‚ h₃ hβ‚„ o₁ oβ‚‚ o₃ oβ‚„ = hidden state = observed Precursor to RNNs
Modern connection: An autoregressive LLM is an implicit PGM β€” it factors the joint probability of a sequence as P(x₁, xβ‚‚, …, xβ‚™) = P(x₁) Β· P(xβ‚‚|x₁) Β· P(x₃|x₁,xβ‚‚) Β· … This is exactly the chain rule of probability applied to a directed graphical model where each token depends on all previous tokens.

EM solves MLE when there are latent (hidden) variables that prevent direct optimisation. The classic example: fitting a Gaussian Mixture Model (GMM) when you don't know which Gaussian generated each data point. EM alternates between two steps until convergence:

E-step: Given current parameters, compute the expected value of the latent variables (soft assignments β€” "this point is 80% from Gaussian 1, 20% from Gaussian 2"). M-step: Given the soft assignments, update parameters (means, variances, mixing weights) to maximise the expected log-likelihood. Repeat until convergence. Guaranteed to increase likelihood at each step (but may converge to local optimum).

Initialise ΞΈ
β†’
E-step: compute E[latent | data, ΞΈ]
β†’
M-step: update ΞΈ to max E[log L]
β†’
Converged?
β†’
Done
AlgorithmLatent VariableApplication
GMM + EMCluster assignmentClustering, density estimation, speaker diarisation
HMM + Baum-WelchHidden statesSpeech recognition (pre-deep-learning), genomics
LDA + EM/VITopic assignmentTopic modelling in text
VAE trainingLatent code zImage generation, representation learning
Chapter 2.4 β€” Key Takeaways
  • Neural network training IS MLE: your choice of loss function encodes your distributional assumption β€” MSE = Gaussian, cross-entropy = categorical/Bernoulli.
  • MSE loss = Gaussian output assumption; cross-entropy = categorical/Bernoulli assumption β€” the math decides the loss, not convention.
  • L2 regularisation = Gaussian prior on weights (MAP); L1 = Laplace prior β€” regularisation is not a trick, it's a Bayesian statement about what you believe.
  • Full Bayesian inference gives posterior distributions β€” enables uncertainty quantification for safety-critical and small-data settings.
  • PGMs encode conditional independence; HMMs were the sequence model before RNNs β€” autoregressive LLMs are implicit directed graphical models.
2.5
Chapter 2.5
Information Theory β€” Measuring Knowledge

Every loss function in deep learning has its roots in information theory. Cross-entropy loss β€” the training objective for every LLM β€” is literally the cross-entropy between the true data distribution and the model's predicted distribution. Understanding entropy, cross-entropy, and KL divergence reveals why we train the way we do.

Entropy measures uncertainty, surprise, or unpredictability. Claude Shannon defined it in 1948: given a random variable X with possible outcomes, how much "information" does each outcome carry? The less likely an event, the more surprising (informative) it is when it occurs.

πŸ’‘

Information Content

  • I(x) = βˆ’logβ‚‚ P(x) bits
  • P = 0.5 (fair coin): I = 1 bit
  • P = 0.99 (almost certain): I = 0.015 bits
  • P = 0.001 (rare event): I = 9.97 bits
  • Rare events carry more information
πŸ“Š

Entropy = Expected Information

  • H(X) = βˆ’Ξ£ P(x) Β· log P(x)
  • Average surprise across all outcomes
  • Uniform distribution β†’ maximum entropy
  • Peaked distribution β†’ low entropy
  • Measured in bits (logβ‚‚) or nats (ln)
Shannon Entropy H(X) = βˆ’Ξ£ P(x) Β· log P(x) The expected information content of a random variable. High entropy = uncertain, spread-out distribution. Low entropy = predictable, peaked distribution. For a uniform distribution over n outcomes: H = logβ‚‚(n) bits (maximum).
LLM Perplexity: Perplexity = 2^(entropy) = e^(cross-entropy loss). A language model with perplexity 20 means it's "as uncertain as if choosing uniformly among 20 tokens" at each step. Lower perplexity = better model. GPT-4 has perplexity around 8–12 on standard benchmarks.
UNIFORM (H = 3 bits) Maximum entropy 8 equally likely outcomes PEAKED (H β‰ˆ 0.5 bits) P=0.9 Low entropy one dominant outcome BERNOULLI H(p) H = 1 bit p=0 p=0.5 p=1 max uncertainty at p=0.5 H(p)

Cross-entropy H(P, Q) measures how many bits distribution Q needs to encode samples from distribution P. If Q is a perfect model of P (Q = P), then H(P, Q) = H(P) β€” the minimum. If Q is wrong, it wastes extra bits. Training a model minimises cross-entropy between the true labels and model predictions.

Cross-Entropy β€” The Universal Training Loss H(P, Q) = βˆ’Ξ£ P(x) Β· log Q(x) P = true distribution (labels), Q = model's predicted distribution. For one-hot labels, this simplifies to L = βˆ’log Q(correct class) = βˆ’log(p_correct). This IS the classification loss for every neural network.
Worked Example: Cross-Entropy for One Prediction
01
True label: class 2 β†’ one-hot y = [0, 0, 1, 0]
02
Model softmax output: Q = [0.1, 0.2, 0.6, 0.1]
03
H(P, Q) = βˆ’(0Β·log 0.1 + 0Β·log 0.2 + 1Β·log 0.6 + 0Β·log 0.1)
04
H(P, Q) = βˆ’log(0.6) = 0.51 nats
05
If model had predicted P(class 2) = 0.99 β†’ loss = βˆ’log(0.99) = 0.01 βœ“
06
If model had predicted P(class 2) = 0.01 β†’ loss = βˆ’log(0.01) = 4.60 βœ— harsh!
πŸ“¦

Categorical Cross-Entropy

  • Multi-class: one-hot labels, softmax outputs
  • L = βˆ’Ξ£ yα΅’ log(pα΅’) = βˆ’log(p_correct)
  • Used for: ImageNet, LLM next-token, NER
  • PyTorch: F.cross_entropy(logits, labels)
πŸ”˜

Binary Cross-Entropy

  • Binary: one sigmoid output
  • L = βˆ’[y log(p) + (1βˆ’y)log(1βˆ’p)]
  • Used for: spam detection, multi-label tasks
  • PyTorch: F.binary_cross_entropy_with_logits
import torch
import torch.nn.functional as F

logits = torch.tensor([1.2, 0.5, 3.1, 0.8]) # raw model outputs
y_true = torch.tensor(2) # correct class index = 2

loss = F.cross_entropy(logits.unsqueeze(0), y_true.unsqueeze(0))
# Internally: softmax(logits) β†’ [-log(p_class_2)]

p_class2 = F.softmax(logits, dim=0)[2]
print(f"P(correct class) = {p_class2:.3f}") # higher = lower loss
print(f"Cross-entropy loss = {loss:.3f}")
Why cross-entropy penalises confident wrong predictions so harshly: βˆ’log(0.99) = 0.01 (confident and correct β†’ tiny loss). βˆ’log(0.01) = 4.60 (confident and wrong β†’ massive loss). The logarithm makes the penalty asymmetric β€” being confidently wrong is 460Γ— worse than being confidently right is good. This is exactly the behaviour we want.

Kullback-Leibler divergence measures the "extra cost" of using distribution Q when the true distribution is P. It's the gap between cross-entropy and true entropy β€” the inefficiency penalty for using the wrong model.

KL Divergence & The Fundamental Relationship D_KL(P β€– Q) = Ξ£ P(x) Β· log(P(x) / Q(x)) Always β‰₯ 0. Equals 0 only when P = Q exactly. NOT symmetric: D_KL(Pβ€–Q) β‰  D_KL(Qβ€–P). H(P, Q) = H(P) + D_KL(P β€– Q) Cross-entropy = true entropy + KL divergence. Since H(P) is fixed, minimising cross-entropy = minimising KL divergence between model and data.
πŸ”‘

Key Properties

  • D_KL β‰₯ 0 always (Gibbs' inequality)
  • D_KL = 0 iff P = Q
  • Not a true metric: asymmetric, no triangle inequality
  • Forward KL (D_KL(Pβ€–Q)): mode-covering β€” Q tries to cover all of P
  • Reverse KL (D_KL(Qβ€–P)): mode-seeking β€” Q locks onto one mode of P
πŸ€–

Where KL Appears in AI

  • VAE loss: reconstruction + KL(posterior β€– prior)
  • RLHF/PPO: KL penalty between policy and reference
  • Knowledge distillation: KL(teacher β€– student)
  • Diffusion models: KL between forward/reverse
  • Training objective: minimising CE = minimising KL
Q β‰ˆ P (small KL) D_KL β‰ˆ 0.02 good model fit Q shifted (medium KL) D_KL β‰ˆ 0.5 model partially wrong Q very different (large KL) D_KL β‰ˆ 3.2 model very wrong P (true) Q (model) KL gap
Forward vs Reverse KL: Forward KL D_KL(Pβ€–Q) penalises Q for putting zero probability where P has mass β†’ Q covers all modes of P (mode-covering). Reverse KL D_KL(Qβ€–P) penalises Q for putting mass where P has zero β†’ Q concentrates on one mode (mode-seeking). VAEs use forward KL. Variational inference often uses reverse KL β€” which is why it can miss modes.

Mutual information I(X;Y) measures how much knowing Y reduces your uncertainty about X. Unlike correlation, MI captures any statistical relationship β€” not just linear ones.

Mutual Information I(X; Y) = H(X) βˆ’ H(X|Y) = H(Y) βˆ’ H(Y|X) How many bits of information Y provides about X (and vice versa). I(X;Y) = 0 means X and Y are independent. I(X;Y) = H(X) means Y completely determines X.
πŸ”

Feature Selection

Select features with highest MI to the target label. Unlike correlation, MI detects non-linear relationships. Used in preprocessing for classical ML.

πŸ–ΌοΈ

Contrastive Learning

SimCLR, CLIP, and BYOL maximise MI between representations of augmented views of the same data. InfoNCE loss is a lower bound on MI.

πŸ”¬

Information Bottleneck

Compress input X into representation Z while preserving MI with target Y. The theoretical framework for why deep networks learn good representations.

INFORMATION-THEORETIC VENN DIAGRAM H(X|Y) uncertainty in X after Y I(X;Y) mutual information H(Y|X) uncertainty in Y after X H(X) H(Y) I(X;Y) = H(X) βˆ’ H(X|Y) I(X;Y) = H(Y) βˆ’ H(Y|X) H(X,Y) = H(X) + H(Y) βˆ’ I(X;Y) Joint entropy = union area

Shannon's source coding theorem (1948): the optimal average code length for messages from a source with entropy H is at least H bits. You cannot compress below the entropy limit. This means a better probability model = better compression. Language models are, in a deep sense, compressors β€” a model with low perplexity assigns high probability to real text, which means it can encode that text in fewer bits.

Minimum Description Length (MDL) formalises model selection: the best model is the one that gives the shortest total description (model complexity + data encoded by model). This connects directly to the bias-variance tradeoff: a too-complex model has a long description; a too-simple model encodes data poorly. Bits-back coding shows VAEs are theoretically equivalent to lossless compression schemes β€” the ELBO loss is the expected code length.

Chapter 2.5 β€” Key Takeaways
  • Entropy H(X) = expected surprise: high entropy = uniform = uncertain; low entropy = peaked = predictable.
  • Cross-entropy loss = βˆ’log P(correct class) β€” the model is penalised for assigning low probability to the truth. Being confidently wrong is catastrophically expensive.
  • H(P, Q) = H(P) + D_KL(Pβ€–Q): minimising cross-entropy = minimising KL divergence from model to data distribution.
  • KL divergence appears in VAE loss, RLHF PPO penalty, and knowledge distillation β€” it's the universal measure of distributional mismatch.
  • LLM perplexity = e^(cross-entropy) β€” directly measures how well the model predicts text. Lower perplexity = better language model.
2.6
Chapter 2.6
Optimisation β€” Finding the Minimum

Given a loss function L(ΞΈ), find the parameters ΞΈ that minimise it. In deep learning, loss landscapes are non-convex with millions of dimensions β€” we need smart optimisers, adaptive learning rates, and carefully tuned schedules. The learning rate is the single most important hyperparameter. Get it wrong and nothing else matters.

The loss landscape is a high-dimensional surface β€” we want the lowest point. The gradient βˆ‡L tells us which direction is "uphill" at our current position. Gradient descent takes a step in the opposite direction: downhill. Repeat until the gradient is near zero (we've reached a minimum or saddle point).

Gradient Descent Update Rule ΞΈ ← ΞΈ βˆ’ Ξ± Β· βˆ‡L(ΞΈ) ΞΈ = all parameters, Ξ± = learning rate (step size), βˆ‡L = gradient of loss. The minus sign = we go OPPOSITE to the gradient (downhill). Convergence criterion: β€–βˆ‡Lβ€– < Ξ΅.
βœ…

Ξ± Just Right

Smooth convergence toward minimum. Steps shrink naturally as gradient flattens near the bottom. Training loss decreases steadily.

πŸ’₯

Ξ± Too Large

Overshoot past the minimum. Loss oscillates or explodes. Training diverges β€” you see NaN in your loss. The #1 training failure mode.

🐌

Ξ± Too Small

Tiny steps, extremely slow progress. May get stuck in a saddle point or flat region. Training takes forever (or budget runs out first).

LEARNING RATE EFFECT ON GRADIENT DESCENT weight ΞΈ minimum Ξ± just right Ξ± too large overshoots! Ξ± too small barely moves

Computing the gradient over the entire dataset each step is accurate but slow. The solution: estimate the gradient from a random subset (mini-batch). The noise from subsampling is not a bug β€” it's a feature.

VariantBatch SizeGradient QualitySpeedWhen Used
Full-Batch GDN (all data)Exact gradientVery slow per stepSmall datasets, convex problems
Stochastic GD1 sampleVery noisy estimateFast per stepOnline learning, streaming data
Mini-Batch GD32–512Good estimateBest trade-offIndustry standard for all DL
🎲

Why Noise Helps

  • Mini-batch noise = implicit regularisation
  • Helps escape saddle points and sharp minima
  • Smaller batches β†’ flatter minima β†’ better generalisation
  • Large-batch training needs special tricks (LARS, LAMB)
πŸ’Ύ

Gradient Accumulation

  • GPU can only fit batch size B in memory
  • Accumulate gradients over K steps before updating
  • Effective batch size = B Γ— K
  • Simulates large batch without extra memory
LOSS TRAJECTORIES: FULL-BATCH vs SGD vs MINI-BATCH training steps loss Full-batch (smooth) SGD batch=1 (noisy) Mini-batch (balanced)

Vanilla SGD has two problems: it oscillates in steep dimensions and moves slowly in flat ones. Momentum fixes the first. Adaptive learning rates fix the second. Adam combines both β€” and it's the default for almost all deep learning.

Adam Optimiser β€” The Default Choice mβ‚œ = β₁·mβ‚œβ‚‹β‚ + (1βˆ’Ξ²β‚)Β·gβ‚œ First moment estimate (exponential moving average of gradients β€” momentum). vβ‚œ = Ξ²β‚‚Β·vβ‚œβ‚‹β‚ + (1βˆ’Ξ²β‚‚)Β·gβ‚œΒ² Second moment estimate (exponential moving average of squared gradients β€” adaptive rate). ΞΈβ‚œβ‚Šβ‚ = ΞΈβ‚œ βˆ’ Ξ± Β· mΜ‚β‚œ / (√vΜ‚β‚œ + Ξ΅) Update with bias-corrected moments. Defaults: β₁=0.9, Ξ²β‚‚=0.999, Ξ΅=1e-8. mΜ‚ and vΜ‚ are bias-corrected: mΜ‚β‚œ = mβ‚œ/(1βˆ’Ξ²β‚α΅—).
OptimiserKey IdeaProsWhen to Use
SGDBasic gradient stepSimple, well-understoodRarely used alone
SGD + MomentumAccumulate velocity Ξ²Β·v + βˆ‡LDampens oscillations, fasterCNNs, ResNets, fine-tuning
RMSPropPer-parameter adaptive LR via √(avg g²)Handles sparse gradientsRNNs, non-stationary objectives
AdamMomentum + RMSProp combinedFast, robust, low tuningDefault for most DL tasks
AdamWAdam + decoupled weight decayBetter generalisation than AdamAll LLM/Transformer training
AdafactorAdam with factored second momentsMemory efficient (no per-param state)Very large models (T5, PaLM)
OPTIMISER TRAJECTORIES ON 2D LOSS CONTOURS minimum SGD +Momentum Adam start SGD: oscillates +Momentum: smoother Adam: direct path
AdamW vs Adam: Original Adam applies weight decay inside the adaptive learning rate, which couples it with gradient statistics. AdamW decouples the weight decay β€” applying it directly to weights. This seemingly minor change significantly improves generalisation. Every major LLM (GPT, LLaMA, Gemini) uses AdamW.

A constant learning rate is rarely optimal. The standard practice: warm up from a very small LR, then gradually decay it. Warmup prevents early training instability (large initial gradients). Decay allows fine-grained convergence near the end.

ScheduleShapeWhen Used
ConstantFlat lineBaselines, very short training
Step DecayStaircase drops every N epochsResNet training (divide by 10 at epoch 30, 60)
Cosine AnnealingSmooth cosine curve from Ξ±_max to Ξ±_minLLM pre-training, standard modern choice
Warmup + CosineLinear rise then cosine decayAll Transformer training from scratch
OneCycleLRRise to peak then decay to near-zeroFast training with super-convergence
LEARNING RATE SCHEDULES training steps learning rate constant step decay cosine warmup+cos warmup
Why LLMs need warmup: At initialisation, gradients can be very large and unstable. Starting with a tiny learning rate (e.g., 1e-7) and linearly increasing to the target (e.g., 3e-4) over the first 1–5% of training steps prevents the model from making catastrophic early updates. After warmup, cosine annealing smoothly decays the LR. This recipe (warmup + cosine) is used for essentially all LLM pre-training.

A convex function has a single global minimum β€” gradient descent is guaranteed to find it. Linear regression (MSE) and logistic regression have convex losses. But deep neural networks are non-convex β€” the loss surface is filled with local minima, saddle points, and flat plateaus.

βœ…

The Good News

  • Deep networks rarely get stuck in bad local minima
  • Saddle points are far more common than local minima in high dimensions
  • Momentum and Adam help escape saddle points
  • Many local minima have similar loss values (the landscape is "benign")
πŸ“

Flat Minima Hypothesis

  • Flat (wide) minima generalise better than sharp (narrow) ones
  • Small perturbations in weights don't hurt performance
  • SAM optimiser: explicitly seeks flat minima
  • Small batch SGD implicitly finds flatter minima
CONVEX β€” GUARANTEED CONVERGENCE global minimum NON-CONVEX β€” DEEP LEARNING REALITY local min saddle global min flat region stuck?

Lagrange multipliers solve the problem of optimising f(x) subject to constraints g(x) = 0. The key insight: at the optimal point, the gradient of f must be parallel to the gradient of g. Introduce multiplier Ξ» and solve βˆ‡f = Ξ»βˆ‡g. The KKT conditions extend this to inequality constraints (g(x) ≀ 0).

Where constrained optimisation appears in AI: Support Vector Machines maximise the classification margin subject to correct classification constraints (a quadratic program). RLHF constrains the new policy to stay close to the reference policy (KL constraint). Safety-constrained RL limits reward optimisation to stay within safety bounds. Optimal transport problems are constrained linear programs.

Lagrangian β€” The Core Idea L(x, Ξ») = f(x) βˆ’ Ξ» Β· g(x) Optimise f(x) subject to g(x) = 0. At the solution: βˆ‚L/βˆ‚x = 0 and βˆ‚L/βˆ‚Ξ» = 0. The multiplier Ξ» measures how much the constraint "costs" in terms of the objective.

Newton's method uses the Hessian (second derivative matrix) to account for curvature. Instead of following the gradient blindly, it finds the optimal step size by solving βˆ‡Β²L Β· Ξ΄ = βˆ’βˆ‡L. In theory, it converges in far fewer steps. In practice, the Hessian is nΓ—n for n parameters β€” for GPT-3 with 175B parameters, that's 175B Γ— 175B entries. Completely infeasible.

Quasi-Newton methods (L-BFGS) approximate the Hessian from gradient history, avoiding the O(nΒ²) cost. They work well for small models (up to ~millions of parameters) and are sometimes used in fine-tuning. The natural gradient uses the Fisher information matrix as a metric β€” accounting for the geometry of probability distributions rather than Euclidean parameter space. K-FAC and Shampoo are practical approximations used in some large-scale training runs.

Chapter 2.6 β€” Key Takeaways
  • Gradient descent: step in βˆ’βˆ‡L direction; learning rate controls step size β€” the most critical hyperparameter to tune.
  • Mini-batch GD (B=32–512) is the industry standard: balances gradient accuracy with speed, and noise helps generalisation.
  • Adam = momentum + adaptive learning rates β€” the default choice for most deep learning tasks.
  • AdamW (Adam + decoupled weight decay) is preferred for all LLM and Transformer training.
  • Warmup + cosine annealing schedule is standard for training large models from scratch β€” prevents early instability.
  • Deep networks are non-convex, but saddle points (not local minima) are the main obstacle β€” momentum and adaptive methods help escape them.
2.7
Chapter 2.7
Graph Theory & Discrete Mathematics

Graphs encode relational structure β€” who connects to whom, what depends on what, which atoms bond to which. Graph neural networks extend deep learning to molecules, social networks, and knowledge bases. Self-attention in Transformers is a GNN on a fully-connected graph. This chapter is a concise reference for the graph concepts that matter in modern AI.

A graph G = (V, E) consists of vertices (nodes) V and edges (links) E. Edges can be directed (A β†’ B) or undirected (A β€” B), weighted or unweighted. Graphs are represented in code as adjacency matrices (dense, good for small graphs) or adjacency lists (sparse, good for large graphs).

πŸ‘₯

Social Networks

Undirected: users are nodes, friendships are edges. Community detection, influence propagation, friend recommendation.

🧬

Molecular Graphs

Atoms are nodes, bonds are edges. Drug discovery, protein structure prediction (AlphaFold), material science.

πŸ”„

Computation Graphs

Operations are nodes, data flow is edges. DAGs (no cycles). PyTorch autograd builds one every forward pass.

UNDIRECTED A B C D E social network DIRECTED + WEIGHTED A B C D 0.8 0.3 0.5 0.2 web page links DAG (COMPUTATION GRAPH) x W Wx b Wx+b PyTorch autograd
AlgorithmStrategyComplexityAI Application
BFSExplore level by levelO(V + E)Shortest path (unweighted), social network distance
DFSExplore deep first, backtrackO(V + E)Cycle detection, topological sort
DijkstraGreedy shortest path (weighted)O((V+E) log V)Robotics path planning, network routing
A*Dijkstra + heuristicO(b^d) worst caseGame AI, planning agents, maze solving
Topological SortLinearise a DAGO(V + E)Execution order in computation graphs
Graph ColouringAssign colours, no adjacent matchNP-hard (general)Scheduling, register allocation
# Topological sort via DFS (Kahn's algorithm variant)
from collections import deque

def topo_sort(graph): # graph = adjacency list
    in_deg = {n: 0 for n in graph}
    for u in graph:
        for v in graph[u]:
            in_deg[v] += 1
    q = deque(n for n in in_deg if in_deg[n] == 0)
    order = []
    while q:
        u = q.popleft(); order.append(u)
        for v in graph[u]:
            in_deg[v] -= 1
            if in_deg[v] == 0: q.append(v)
    return order

# DAG: x→Wx, W→Wx, Wx→Wx+b, b→Wx+b
dag = {"x": ["Wx"], "W": ["Wx"], "Wx": ["Wx+b"], "b": ["Wx+b"], "Wx+b": []}
print(topo_sort(dag)) # ['x', 'W', 'b', 'Wx', 'Wx+b']
PyTorch uses topological sort internally. Every forward pass builds a DAG of operations. When you call .backward(), PyTorch topologically sorts this DAG and walks it in reverse order, computing gradients at each node via the chain rule. This is why the order of operations matters for autograd.

Standard neural networks expect fixed-size inputs (vectors, grids). Graphs have variable numbers of nodes and edges with no fixed ordering. GNNs solve this with the message passing framework: each node updates its representation by aggregating information from its neighbours.

Message Passing β€” The Universal GNN Framework
01
Message: Each node computes a message from its features and sends to neighbours
02
Aggregate: Each node collects messages from all neighbours (sum, mean, or max)
03
Update: Each node updates its representation using aggregated messages + its own features
04
Repeat for K layers β†’ each node "sees" K hops of neighbourhood information
GNN VariantKey IdeaApplication
GCNSpectral convolution, normalised adjacencyNode classification, citation networks
GraphSAGESample + aggregate β€” scales to large graphsPinterest recommendation (billions of nodes)
GATAttention weights on edgesHeterogeneous graphs, knowledge graphs
MPNNGeneral message passing frameworkMolecular property prediction
GINMaximally expressive β€” as powerful as WL testGraph isomorphism, graph classification
MESSAGE PASSING IN GNNs β€” ONE LAYER 1. Messages n₁ nβ‚‚ n₃ v 2. Aggregate Ξ£ agg m₁ mβ‚‚ m₃ update 3. Updated Node v' v' = Οƒ(W Β· AGG(neighbours) + v) After K layers, each node "sees" information from K hops away
Self-attention = GNN on a fully-connected graph. In a Transformer, every token attends to every other token β€” this is equivalent to message passing on a complete graph where the attention weights are the edge weights. GNNs generalise Transformers to arbitrary graph structures.

Knowledge graphs store facts as (subject, predicate, object) triples. For example: (Paris, isCapitalOf, France), (France, isPartOf, Europe). Major KGs include Wikidata (100M+ entities), Google Knowledge Graph, and domain-specific KGs for medicine, law, and science.

KG embeddings like TransE learn vector representations where subject + relation β‰ˆ object in embedding space. This enables link prediction ("if we know Paris is in France and France is in Europe, can we infer Paris is in Europe?"). RAG with KGs retrieves relevant triples and injects them into LLM prompts β€” grounding generation in verified facts.

Limitations: KGs are brittle, require manual curation, can't handle uncertainty or nuance well, and struggle with temporal facts. Hybrid approaches (KG + LLM) are an active research frontier.

KNOWLEDGE GRAPH β€” (SUBJECT, PREDICATE, OBJECT) TRIPLES Paris France Europe London UK isCapitalOf isPartOf isCapitalOf isPartOf Highlighted triple: (Paris, isCapitalOf, France) TransE: Paris + isCapitalOf β‰ˆ France

Why symbolic AI died: The search space for reasoning grows exponentially with problem size β€” the combinatorial explosion. A game tree for Go has ~10¹⁷⁰ positions. Brute-force search is impossible. This is why statistical (learning-based) methods replaced rule-based AI.

For ML practitioners, Big-O notation matters most when reasoning about scaling. Standard Transformer attention is O(nΒ²) in sequence length β€” the main bottleneck for long-context models. Linear attention variants reduce this to O(n) but often sacrifice quality.

OperationComplexityExample in AI
Embedding lookupO(1)Token embedding in LLMs
Matrix multiply (nΓ—n)O(nΒ³)Dense layer forward pass
Self-attentionO(nΒ²Β·d)Transformer β€” main bottleneck
Linear attentionO(nΒ·dΒ²)Mamba, RWKV, RetNet
SortingO(n log n)Top-k token selection
k-NN search (brute)O(nΒ·d)Vector DB exact search
ANN search (HNSW)O(log n)Vector DB approximate search
Chapter 2.7 β€” Key Takeaways
  • Graphs encode relational structure β€” nodes, edges, and their properties represent social networks, molecules, computation flows, and knowledge bases.
  • Message passing is the universal GNN framework: aggregate neighbour information to update node representations β€” repeat for K layers to see K hops.
  • GNNs excel at molecular graphs (AlphaFold), knowledge graphs, and recommendation systems β€” self-attention is a GNN on a fully-connected graph.
  • Knowledge graphs store facts as (subject, predicate, object) triples β€” can augment LLM retrieval via RAG with structured facts.
  • Standard Transformer attention is O(nΒ²) in sequence length β€” the main scaling bottleneck driving research into linear attention alternatives.
Domain 2 β€” What You Now Know
  • Linear algebra: neural net layers = Wx + b; attention = QKα΅€; embeddings are vectors; tensors shape everything.
  • Calculus: backprop = chain rule applied recursively; gradient = direction of steepest ascent on loss surface.
  • Probability: model outputs are distributions; LLM tokens are sampled from a categorical distribution over 50K+ items.
  • Bayesian inference: MSE = Gaussian MLE; cross-entropy = Bernoulli/categorical MLE; L2 regularisation = Gaussian prior (MAP).
  • Information theory: CE loss = βˆ’log(p_correct); minimising cross-entropy = minimising KL divergence from model to data.
  • Optimisation: Adam = momentum + adaptive LR; warmup + cosine annealing is the standard LLM training recipe.
  • Graphs: GNNs use message passing; knowledge graphs store (s, p, o) triples; self-attention = GNN on complete graph.
  • These are not prerequisites β€” they are the explanation for why ML works.