AI Foundation ยท Domain 04 ยท Chapter 4.1

Neural Networks โ€” Perceptron to MLP

How a single artificial neuron scales into the multi-layer networks that power modern AI

4.1
Chapter 4.1
Neural Networks โ€” From Perceptron to MLP

A neural network is not a brain simulation. It is a function approximator โ€” a mathematical machine that learns to map inputs to outputs by adjusting millions of numerical parameters. The biological metaphor is useful for intuition; the mathematics is what actually works.

Long before computers existed, scientists observed that the human brain processes information through a vast network of interconnected cells called neurons. Each biological neuron receives chemical signals through branching fibres called dendrites, integrates those signals in its cell body (the soma), and โ€” if the combined signal exceeds an internal threshold โ€” fires an electrical impulse along its axon to downstream cells. This "integrate and fire" mechanism, repeated across roughly 86 billion neurons with trillions of connections, gives rise to everything from reflex actions to abstract reasoning.

In 1943, McCulloch and Pitts created the first mathematical model of a neuron: a binary threshold unit that sums its inputs and outputs 1 if the sum exceeds a fixed threshold, 0 otherwise. The mapping from biology to mathematics is direct: dendrites become numeric inputs, synaptic strengths become weights, the soma becomes a weighted summation, and the axon firing becomes an activation function. This abstraction โ€” inputs → weighted sum → activation โ€” is still the foundation of every neural network today.

The analogy has important limits. Biological neurons communicate via discrete spikes; artificial neurons use continuous real-valued outputs. Biological learning involves complex biochemical processes; artificial networks learn by gradient descent on a loss function. The phrase "inspired by, not modelled after" is exactly right. Deep learning borrowed the high-level architecture of layered computation and discarded almost everything else in favour of mathematical tractability.

Biological vs Artificial Neuron โ€” the inspiration and the abstraction
BIOLOGICAL NEURON Dendrites (inputs) Soma Σ + threshold axon terminals (output) signal direction ARTIFICIAL NEURON x₁ x₂ x₃ Inputs w₁ w₂ w₃ Σ then f(·) Weighted sum + bias ŷ Activation output = f(w₁x₁ + w₂x₂ + w₃x₃ + b)

In 1957, Frank Rosenblatt at the Cornell Aeronautical Laboratory built the Perceptron โ€” the first machine specifically designed to learn from examples. The idea was elegantly simple: represent a decision-making unit as a weighted sum of inputs passed through a step function. If the total weighted input exceeds a threshold, the unit fires (outputs 1); otherwise it stays silent (outputs 0). Crucially, the weights could be adjusted automatically when the unit made a mistake โ€” this was the first learning algorithm for a neural model.

The perceptron structure has four components. First, numeric inputs x₁, x₂, …, xₙ โ€” these could be pixel intensities, sensor readings, or any measurable feature. Second, a weight wᵢ for each input, representing how important that feature is. Third, a bias b that shifts the decision boundary independently of the inputs. Fourth, a step activation function that converts the raw weighted sum into a binary decision.

The perceptron learning rule is the ancestor of gradient descent. After every prediction, if the prediction was correct, do nothing. If the network predicted 0 but the true label was 1, increase each weight by a small fraction of the corresponding input. If the network predicted 1 but should have predicted 0, do the reverse. This simple rule has a remarkable theoretical guarantee: if the training data is linearly separable, the perceptron will converge to a correct solution in a finite number of steps โ€” the Perceptron Convergence Theorem.

Tracing the AND logic gate concretely: AND outputs 1 only when both inputs are 1. Start with all weights at 0. When we show (1,1)→1 and the network predicts 0 (since 0<0), we add the inputs to the weights. After a few cycles the perceptron settles at weights w₁=1, w₂=1, b=−1.5, which correctly separates AND's one positive case from the three negatives by the line x₁ + x₂ = 1.5.

Perceptron โ€” Decision Rule ŷ = step(w · x + b) = { 1 if w·x + b ≥ 0,  0 otherwise } w = weight vector  ·  x = input vector  ·  b = bias scalar  ·  ŷ = predicted class Update Rule (Perceptron Learning) wᵢ ← wᵢ + α(y − ŷ)xᵢ    for each weight i b ← b + α(y − ŷ) α = learning rate  ·  y = true label  ·  ŷ = predicted label  ·  (y−ŷ) ∈ {−1, 0, +1}
Perceptron โ€” AND gate implementation with learned weights
PERCEPTRON STRUCTURE b −1.5 x₁ x₂ w₁=1 w₂=1 Σ z=w·x+b step(z) 1 if z ≥ 0 ŷ Inputs Weighted Sum Activation Output AND GATE โ€” DECISION BOUNDARY x₁ x₂ 1 1 1 w·x+b=0 AND=0 AND=1
import numpy as np class Perceptron: def __init__(self, lr=0.1, n_epochs=10): self.lr = lr self.n_epochs = n_epochs def fit(self, X, y): self.w = np.zeros(X.shape[1]) self.b = 0.0 for epoch in range(self.n_epochs): for xi, yi in zip(X, y): y_hat = self.predict(xi) delta = self.lr * (yi - y_hat) self.w += delta * xi # update weights self.b += delta # update bias def predict(self, X): return np.where(np.dot(X, self.w) + self.b >= 0, 1, 0) # AND gate X = np.array([[0,0],[0,1],[1,0],[1,1]]) y = np.array([0, 0, 0, 1]) p = Perceptron(lr=0.1, n_epochs=10) p.fit(X, y) print(p.predict(X)) # [0, 0, 0, 1] ✓
⚠ Common Pitfall โ€” Perceptron

The convergence theorem only guarantees convergence if the data is linearly separable. For non-separable data, the algorithm loops forever, cycling through updates that never stabilise. Always set a maximum epoch limit and check whether loss has stopped decreasing.

In 1969, Marvin Minsky and Seymour Papert published Perceptrons, a rigorous mathematical analysis of what single-layer networks could and could not compute. Their central result was devastating: a single-layer perceptron cannot learn the XOR function. XOR outputs 1 when exactly one of two inputs is 1: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. The proof is geometric โ€” there is no single straight line that can separate the two positive cases from the two negative cases. They form a checkerboard that is impossible to bisect with one hyperplane.

The Minsky-Papert result triggered the first AI winter: funding dried up and neural network research sat dormant for roughly 15 years. The irony is that the paper itself pointed toward the solution โ€” adding hidden layers could overcome these limitations, but they doubted an efficient learning algorithm for such networks could be found. That algorithm โ€” backpropagation, popularised by Rumelhart, Hinton, and Williams in 1986 โ€” became the key that unlocked the field.

The geometric resolution is illuminating. With a hidden layer, the network first learns two intermediate linear boundaries: one that isolates (1,1) and one that isolates (0,0). The hidden layer outputs encode whether the input is in each region. The output layer then combines these hidden representations to produce the XOR decision โ€” a task that is linearly separable in the transformed space. This is the core insight of deep learning: each layer transforms the data into a representation where the next layer's job becomes easier.

XOR Problem โ€” why a single layer perceptron is fundamentally limited
AND ✓ Linearly Separable ✓ OR ✓ Linearly Separable ✓ XOR ✗ NOT Linearly Separable ✗ caused 15-year AI Winter

The XOR problem is not just a failure of the perceptron โ€” it is a proof that any single linear classifier has a fundamental expressiveness limit. The solution is not a better linear classifier. The solution is composition: learn intermediate nonlinear representations, then combine them.

⚠ Common Pitfall โ€” Linear Stacking

Even a deep stack of linear layers with no activation functions cannot solve XOR. A stack of linear transformations is itself a single linear transformation: W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂). Non-linear activation functions between layers are the essential ingredient โ€” without them, depth buys you nothing.

The Multi-Layer Perceptron (MLP) โ€” also called a feedforward network or fully connected network โ€” adds one or more hidden layers between the input and output. Every neuron in each layer connects to every neuron in the next layer. The key architectural decision is that a non-linear activation function is applied after each layer's weighted sum, breaking the linear chain that would otherwise collapse the whole network into a single affine transformation.

Each hidden layer acts as a feature detector. The first hidden layer learns combinations of the raw inputs โ€” in image recognition, this might correspond to edges or local contrasts. The second hidden layer learns combinations of those features โ€” perhaps corners and junctions. Deeper layers learn increasingly abstract concepts. This hierarchical representation learning is why depth is valuable: rather than memorising the training set, the network learns what to look for.

Architecture notation is typically given as a list of layer sizes. [4, 8, 8, 3] means 4 inputs, two hidden layers of 8 neurons, and 3 outputs. The total parameter count: for each layer, (inputs to that layer) × (neurons in that layer) weights plus one bias per neuron. A [4, 5, 4, 3] network has (4×5+5) + (5×4+4) + (4×3+3) = 25 + 24 + 15 = 64 parameters.

MLP [4→5→4→3] โ€” All connections, one highlighted forward path
w=0.8 w=0.6 w=0.9 x₁ x₂ x₃ x₄ Input (d=4) h₁ h₂ h₃ h₄ h₅ Hidden₁ (n=5) ReLU h₁ h₂ h₃ h₄ Hidden₂ (n=4) Softmax ŷ₁ ŷ₂ ŷ₃ Output (k=3) Parameter Count W₁: 4×5 = 20 W₂: 5×4 = 20 W₃: 4×3 = 12 → 52 weights total
⚠ Common Pitfall โ€” Forgetting Activation Functions

The most common beginner mistake when building an MLP is stacking nn.Linear layers without activation functions between them. Without non-linearity, no matter how many layers you add, the network can only learn linear decision boundaries. Always add nn.ReLU() between every pair of linear layers.

The forward pass is the computation that transforms an input vector into a prediction by passing it sequentially through each layer. Understanding the forward pass precisely โ€” including the shapes of every matrix and vector โ€” is essential for debugging, designing architectures, and reasoning about computational cost.

For each layer l, the computation has two steps. First, compute the pre-activation Z by multiplying the previous layer's output A by the weight matrix W and adding a bias b. Second, apply the activation function f element-wise to Z to produce the output A of this layer. The output of the final layer is the network's prediction.

Layer l โ€” Forward Pass Z⁽ˡ⁾ = A⁽ˡ⁻¹⁾ · W⁽ˡ⁾ + b⁽ˡ⁾ A⁽ˡ⁾ = f⁽ˡ⁾(Z⁽ˡ⁾) A⁽ˡ⁻¹⁾ = output of previous layer (or input X for l=1)  ·  W⁽ˡ⁾ = weight matrix [in × out] b⁽ˡ⁾ = bias vector [out]  ·  f⁽ˡ⁾ = activation function (ReLU, Softmax, etc.)
Numerical Forward Pass Example โ€” 2-layer MLP
Input: x = [2.0, 3.0]
W₁ = [[0.5, −0.3], [0.1, 0.8]],  b₁ = [0.1, −0.2]
Z₁ = x · W₁ + b₁
   = [2.0×0.5 + 3.0×0.1 + 0.1,  2.0×(−0.3) + 3.0×0.8 + (−0.2)]
   = [1.0 + 0.3 + 0.1,  −0.6 + 2.4 − 0.2]
Z₁ = [1.4, 1.6]
A₁ = ReLU(Z₁) = [max(0,1.4), max(0,1.6)]
A₁ = [1.4, 1.6] (both positive, ReLU passes through)
W₂ = [[0.4, 0.6], [−0.2, 0.3]],  b₂ = [0.0, 0.1]
Z₂ = A₁ · W₂ + b₂
   = [1.4×0.4 + 1.6×(−0.2) + 0.0,  1.4×0.6 + 1.6×0.3 + 0.1]
   = [0.56 − 0.32,  0.84 + 0.48 + 0.1]
Z₂ = [0.24, 1.42]
A₂ = Softmax([0.24, 1.42]) = exp([0.24,1.42]) / 5.408
A₂ = [0.235, 0.765] → 76.5% probability class 2, 23.5% class 1
import torch import torch.nn as nn class MLP(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super().__init__() self.net = nn.Sequential( nn.Linear(input_dim, hidden_dim), # W1: (input_dim x hidden_dim) nn.ReLU(), # non-linearity nn.Linear(hidden_dim, hidden_dim), # W2: (hidden_dim x hidden_dim) nn.ReLU(), nn.Linear(hidden_dim, output_dim) # W3: (hidden_dim x output_dim) ) def forward(self, x): return self.net(x) # sequential forward pass model = MLP(input_dim=784, hidden_dim=256, output_dim=10) x = torch.randn(32, 784) # batch of 32 MNIST-style images output = model(x) # shape: (32, 10) -- 10 class logits print(f"Output shape: {output.shape}") print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}") # Output shape: torch.Size([32, 10]) # Parameters: 269,322
⚠ Common Pitfall โ€” Shape Mismatches

The most common forward pass error is a matrix dimension mismatch. If layer l has input dimension d and output dimension k, then W has shape [d × k] and the output has shape [batch × k]. A common mistake is transposing weight matrices incorrectly or confusing input/output sizes when building layers manually without nn.Linear.

In 1989, George Cybenko proved a remarkable result about feedforward networks: a network with a single hidden layer using sigmoid-like activation functions, and sufficiently many neurons, can approximate any continuous function on a compact subset of ℜ² to arbitrary precision. Hornik (1991) extended this to any non-constant, bounded, continuous activation function. This result โ€” the Universal Approximation Theorem โ€” gives MLPs their theoretical power: they are, in principle, general-purpose function approximators.

The intuition is geometric. A single neuron with a step function carves out a half-space. With many neurons, you can approximate arbitrary regions. With smooth activations, you build up smooth functions by summing many "bumps". As you add more neurons, the approximation gets finer โ€” the diagram below shows a coarse 2-neuron step-wise approximation converging to the true function as width increases to 32.

The theorem has important caveats practitioners often overlook. It says a solution exists โ€” it says nothing about whether gradient descent will find it, how many samples are needed, or whether the required network is computationally feasible. In practice, a single very wide layer may require exponentially more neurons than a deeper network to represent the same function. This is the practical motivation for depth: depth enables exponentially more efficient representation. In CNNs, this manifests as hierarchical feature detection: edges in layer 1, textures in layer 2, object parts in layer 3.

Universal Approximation โ€” MLPs approximate any function given sufficient width
TARGET FUNCTION f(x) x f(x) True function f(x) MLP APPROXIMATIONS 2 neurons 8 neurons 32 neurons → smooth approx More neurons = smoother approximation

The Universal Approximation Theorem says any function can be represented โ€” it does not say gradient descent will find it. Expressiveness and learnability are different things. This is why depth, regularisation, and data quantity matter in practice.

Understanding the key design choices of an MLP โ€” and the consequences of setting them poorly โ€” is essential for practitioners. The table below summarises the principal hyperparameters, their typical ranges, and the effects of extreme values. Each will be explored in depth in subsequent chapters; develop an intuition for the tradeoffs now.

Hyperparameter Definition Typical Range Effect if Too Small Effect if Too Large
Number of layers (depth) How many hidden layers 2โ€“50+ (hundreds with residual connections) Underfits; limited representational power Vanishing gradients; harder to train without tricks
Width (neurons per layer) Nodes per hidden layer 64โ€“4096 Underfits; insufficient capacity Memory intensive; increased overfitting risk
Activation function Non-linearity applied between layers ReLU, GELU, Tanh, Sigmoid โ€” โ€” (see Chapter 4.2)
Batch size Samples per gradient update 16โ€“2048 Noisy gradients; slow wall-clock time Sharp minima; poor generalisation to test set
Learning rate Gradient step size (α) 1e-4 to 1e-2 Very slow convergence; appears stuck Divergence; NaN loss; oscillating training
Dropout rate Fraction of neurons randomly zeroed each step 0.1โ€“0.5 No regularisation; model memorises training data Too much information loss; underfitting
🔢

Parameter Count Formula

For each layer with d𝕪 inputs and dₒ𝕦𝕧 outputs:

  • Weights: d𝕪 × dₒ𝕦𝕧
  • Biases: dₒ𝕦𝕧
  • Total: ∑ (d𝕪 × dₒ𝕦𝕧 + dₒ𝕦𝕧)
🧱

Modern Architecture Scale

Reference points for context:

  • MNIST MLP: ~0.5M params
  • ResNet-50: ~25M params
  • GPT-2: ~117M params
  • GPT-4 (est.): ~1.8T params
🎯

Where MLPs Appear Today

MLPs are fundamental building blocks:

  • Feed-forward layers in Transformers
  • Classification heads in CNNs
  • Value/policy networks in RL
  • Embedding projections

∑ Chapter 4.1 Summary โ€” Neural Networks: Perceptron to MLP

  • Biological inspiration: dendrites (inputs) → soma (weighted sum + threshold) → axon (output); artificial neurons abstract this as inputs → weighted sum → activation function
  • Perceptron: ŷ = step(w·x + b) โ€” learns linear decision boundaries only; update rule wᵢ ← wᵢ + α(y−ŷ)xᵢ; converges if and only if data is linearly separable
  • XOR problem (Minsky & Papert, 1969): a single-layer network cannot solve non-linearly separable problems โ€” this caused the 15-year first AI winter
  • Solution: add hidden layers with non-linear activation functions โ€” each layer learns intermediate representations; without non-linearity, stacked linear layers collapse to a single linear layer
  • Forward pass: Z⁽ˡ⁾ = A⁽ˡ⁻¹⁾ · W⁽ˡ⁾ + b⁽ˡ⁾, then A⁽ˡ⁾ = f(Z⁽ˡ⁾) โ€” repeated layer by layer from input to output
  • Universal Approximation Theorem: an MLP with sufficient width can represent any continuous function โ€” but expressiveness ≠ learnability; depth makes representation exponentially more efficient
  • Parameter count = ∑ (layer_in × layer_out + layer_out) โ€” even small networks have thousands of trainable parameters; modern models have billions
4.2
Chapter 4.2
Activation Functions

Without non-linear activation functions, stacking any number of linear layers produces exactly one linear transformation. It is the activation function โ€” applied element-wise after every layer โ€” that gives neural networks their ability to learn arbitrarily complex mappings. Choosing the right activation is one of the most consequential architectural decisions you will make.

Consider two linear layers stacked directly: Z = Wโ‚‚(Wโ‚x + bโ‚) + bโ‚‚ = (Wโ‚‚Wโ‚)x + (Wโ‚‚bโ‚ + bโ‚‚). The result is another linear function with a combined weight matrix W = Wโ‚‚Wโ‚ and a combined bias. No matter how many linear layers you stack, the composition remains a single affine (linear + shift) transformation. This means a deep linear network is no more expressive than a logistic regression โ€” it can only learn straight hyperplane decision boundaries.

An activation function f applied between layers breaks this collapse: Aโ‚‚ = Wโ‚‚ ยท f(Wโ‚x + bโ‚) + bโ‚‚. Now the result is genuinely non-linear and the two layers are no longer collapsible into one. A good activation function must satisfy three practical requirements: it must be non-linear (obviously), differentiable almost everywhere (so gradients can flow during backpropagation), and computationally cheap (it is applied millions of times per forward pass).

Non-linearity is essential โ€” linear layers without activation collapse into one
NO ACTIVATION Linear Linear Wโ‚‚(Wโ‚x) = (Wโ‚‚Wโ‚)x = Wx โœ— No new expressive power WITH ReLU Linear ReLU Linear Wโ‚‚ยทReLU(Wโ‚x) โ€” non-linear! โœ“ Can learn non-linear functions OUTPUT vs INPUT Linear +Act

The sigmoid function was the default activation in neural networks throughout the 1980s and 1990s. It takes any real-valued input and squashes it into the range (0, 1), which made it a natural fit for modelling probabilities. The S-shaped curve rises steeply near z = 0 and flattens toward 0 for very negative inputs and toward 1 for very positive inputs. This flattening is the source of its central problem in deep networks.

The derivative of ฯƒ(z) is elegantly expressed as ฯƒ(z)(1 โˆ’ ฯƒ(z)). This has a maximum of 0.25 at z = 0, and falls to near-zero as |z| grows large. In a deep network, gradients are multiplied together as they propagate backward through layers. If most neurons saturate (i.e., z is large in magnitude), each multiplication by a derivative near 0 shrinks the gradient exponentially โ€” this is the vanishing gradient problem. A network with 10 sigmoid layers loses a gradient factor of 0.25ยนโฐ โ‰ˆ 0.0000001 before it reaches the first layer.

A second, subtler problem is that sigmoid outputs are never negative โ€” they are always in (0, 1). This means gradients are always the same sign (all positive or all negative), which causes zig-zag updates in weight space. Tanh, which we examine next, solves this by being zero-centred. Today, sigmoid is used almost exclusively at the output layer of binary classifiers (where you genuinely want a probability) and in the gating mechanisms of LSTMs.

Sigmoid ฯƒ(z) = 1 / (1 + eโˆ’z) Output range: (0, 1) ยท ฯƒ'(z) = ฯƒ(z)(1 โˆ’ ฯƒ(z)) ยท max derivative = 0.25 at z = 0
Sigmoid: S-curve output (0,1) with near-zero gradient in saturation zones
ฯƒ(z) โ€” SIGMOID OUTPUT 0.5 z 1.0 0 -6 0 +6 ฯƒ(-4)โ‰ˆ0.018 ฯƒ(0)=0.5 ฯƒ(4)โ‰ˆ0.982 Saturation gradientโ‰ˆ0 Saturation gradientโ‰ˆ0 ฯƒ'(z) โ€” DERIVATIVE z 0.25 -6 0 +6 max = 0.25 Near-zero Near-zero โ†’ vanishing gradients in deep nets
โš  Common Pitfall โ€” Sigmoid in Hidden Layers

Using sigmoid as the activation for hidden layers in deep networks almost always causes the vanishing gradient problem. If your training loss stops decreasing very early and the gradients of the first layers are near zero when inspected, this is the most likely cause. Switch to ReLU or GELU for all hidden layers; reserve sigmoid for binary classification output only.

Tanh (hyperbolic tangent) is a scaled and shifted version of sigmoid: tanh(z) = 2ฯƒ(2z) โˆ’ 1. It squashes inputs to (โˆ’1, +1) instead of (0, 1). The critical improvement over sigmoid is that tanh is zero-centred โ€” its outputs are balanced around zero. When the activation outputs are always positive (as in sigmoid), the gradient updates to all weights in the next layer always have the same sign. This forces the optimiser into a zig-zag path through weight space. Tanh's zero-centred outputs allow positive and negative gradients, enabling more direct paths toward the minimum.

Tanh still suffers from the vanishing gradient problem for large |z|, where the derivative tanh'(z) = 1 โˆ’ tanhยฒ(z) approaches zero. The maximum derivative is 1.0 at z = 0 โ€” four times larger than sigmoid's maximum of 0.25 โ€” which makes it somewhat less prone to gradient collapse. Tanh remains the preferred activation inside LSTM and GRU gates, where its zero-centered outputs help regulate cell state updates.

Tanh tanh(z) = (ez โˆ’ eโˆ’z) / (ez + eโˆ’z) = 2ฯƒ(2z) โˆ’ 1 Output range: (โˆ’1, +1) ยท tanh'(z) = 1 โˆ’ tanhยฒ(z) ยท max derivative = 1.0 at z = 0
Tanh vs Sigmoid โ€” zero-centered tanh has better gradient flow
0 +1 โˆ’1 0 โˆ’6 0 +6 Tanh โ€” zero-centred Sigmoid โ€” always positive tanh(z) โ€” range (โˆ’1,1) ฯƒ(z) โ€” range (0,1) tanh(z) = 2ฯƒ(2z) โˆ’ 1 outputs balanced around 0 โ†’ gradients can be + or โˆ’

The Rectified Linear Unit (ReLU) is disarmingly simple: pass the input through unchanged if it is positive, otherwise output zero. This single function โ€” introduced to deep learning at scale by AlexNet in 2012 โ€” transformed the field. Before ReLU, training deep networks beyond 5โ€“6 layers was nearly impossible due to vanishing gradients from sigmoid and tanh. ReLU's constant gradient of 1 for positive inputs means gradients flow freely through activated neurons, enabling networks of 50, 100, or even 1000 layers.

ReLU also introduces sparse activation: on average, about 50% of neurons output exactly zero for any given input. This sparsity provides implicit regularisation โ€” only the "relevant" neurons participate in each forward pass. However, this same property creates the Dead ReLU problem: if a neuron's pre-activation is always negative (e.g., because the bias drifts negative during training), its gradient is permanently zero and it never recovers. This can kill 10โ€“40% of neurons in poorly initialised or high-learning-rate networks.

Leaky ReLU fixes dead neurons by allowing a small negative slope ฮฑ (typically 0.01) for z < 0, ensuring the gradient is never exactly zero. ELU (Exponential Linear Unit) goes further with a smooth exponential curve for negative inputs, producing outputs closer to zero-mean โ€” which can improve convergence. PReLU (Parametric ReLU) treats ฮฑ as a learnable parameter, letting the network decide the optimal negative slope per channel.

ReLU Family Formulas ReLU:       f(z) = max(0, z)           f'(z) = 1 if z > 0 else 0 Leaky ReLU: f(z) = max(ฮฑz, z), ฮฑ=0.01  f'(z) = 1 if z > 0 else ฮฑ ELU:       f(z) = z if z > 0 else ฮฑ(ezโˆ’1)  ฮฑ typically 1.0 ฮฑ = negative slope coefficient ยท PReLU: same as Leaky ReLU but ฮฑ is a learned parameter
ReLU Family โ€” ReLU, Leaky ReLU, ELU, PReLU compared
0 โˆ’1.5 4 โˆ’4 0 +4 Dead neuron zone (ReLU) Always active (Leaky/ELU) ReLU Leaky ReLU (ฮฑ=0.01) ELU (smooth) PReLU (learned ฮฑ)
import torch import torch.nn as nn # All ReLU variants in PyTorch relu = nn.ReLU() leaky = nn.LeakyReLU(negative_slope=0.01) elu = nn.ELU(alpha=1.0) prelu = nn.PReLU() # alpha is learned parameter x = torch.randn(4) print("Input: ", x.tolist()) print("ReLU: ", relu(x).tolist()) # negatives become 0 print("LeakyReLU: ", leaky(x).tolist()) # negatives scaled by 0.01 print("ELU: ", elu(x).tolist()) # negatives โ†’ ฮฑ(e^z - 1) # Dead ReLU detection: count dead neurons after training def count_dead_relu(model, x_sample): dead = 0 total = 0 hooks = [] def hook(m, inp, out): nonlocal dead, total dead += (out == 0).sum().item() total += out.numel() for m in model.modules(): if isinstance(m, nn.ReLU): hooks.append(m.register_forward_hook(hook)) with torch.no_grad(): model(x_sample) for h in hooks: h.remove() return dead / total # fraction of dead neurons
โš  Common Pitfall โ€” Dead ReLU

If you use a high learning rate or bad weight initialisation, a large fraction of ReLU neurons can get stuck with permanently negative pre-activations โ€” the "dead ReLU" problem. Gradients through these neurons are exactly zero, so they never recover. Signs: training loss stops improving but there is no NaN; inspecting neuron outputs shows many always-zero activations. Fix: use Leaky ReLU, reduce learning rate, or use proper He initialisation (nn.init.kaiming_normal_).

The Gaussian Error Linear Unit (GELU) was introduced by Hendrycks and Gimpel (2016) and quickly became the dominant activation in Transformer-based models. The key motivation: ReLU has a hard kink at z = 0 โ€” the derivative jumps discontinuously from 0 to 1. GELU replaces this with a smooth curve by weighting the input by the probability that it is positive under a standard Gaussian distribution: f(z) = z ยท ฮฆ(z), where ฮฆ is the Gaussian CDF.

In practice, GELU is computed via a fast approximation: f(z) โ‰ˆ 0.5z(1 + tanh[โˆš(2/ฯ€)(z + 0.044715zยณ)]). This smooth transition means GELU has a continuous gradient everywhere, which empirically improves training stability for deep Transformers. GPT-2, GPT-3, BERT, BART, T5, and virtually every large language model published since 2019 uses GELU in its feed-forward sublayers.

Swish (also called SiLU, Sigmoid Linear Unit) is another smooth variant: f(z) = z ยท ฯƒ(z). The input gates itself โ€” neurons with large positive values pass through at full strength, while negative values are softly suppressed. Swish is used in EfficientNet, MobileNetV3, and several LLM variants. Mish extends this idea: f(z) = z ยท tanh(softplus(z)), and has achieved state-of-the-art performance on some computer vision benchmarks. All three share the property of being smooth, non-monotonic, and having a small negative dip near z โ‰ˆ โˆ’0.2, which provides a weak self-normalising property.

GELU & Swish GELU: f(z) = z ยท ฮฆ(z) โ‰ˆ 0.5z(1 + tanh[โˆš(2/ฯ€)(z + 0.044715zยณ)]) Swish/SiLU: f(z) = z ยท ฯƒ(z) = z / (1 + eโˆ’z) ฮฆ = Gaussian CDF ยท ฯƒ = sigmoid ยท both are smooth and differentiable everywhere
GELU vs ReLU โ€” smooth activation improves Transformer training
0 4 โˆ’0.5 โˆ’4 0 +4 GELU used in GPT, BERT, T5 ZOOM: z โˆˆ [โˆ’1, 0] ReLU: sharp kink GELU: smooth ReLU GELU (GPT/BERT) Swish/SiLU (EfficientNet)

GELU is to Transformers what ReLU is to CNNs: the empirically dominant choice. Its smooth gradient everywhere avoids dead neurons and allows stable training at great depth. If you are building any Transformer-based model โ€” language, vision, or multimodal โ€” start with GELU.

Softmax is not an activation function in the same sense as ReLU or GELU โ€” it is not applied element-wise independently to each neuron. Instead, it is a normalisation operation over an entire output vector, converting a vector of raw logits (unbounded real numbers) into a valid probability distribution. Every output is positive, and all outputs sum to exactly 1.0, making the output directly interpretable as class probabilities.

Softmax amplifies differences between logits. The largest logit receives a disproportionately high probability โ€” the exponentiation makes differences exponential before normalisation. With logits [3.0, 1.0, 0.5], the first class dominates the probability. Subtracting the maximum logit before computing exponentials โ€” max-trick โ€” prevents numerical overflow without changing the output: softmax(z โˆ’ max(z)) = softmax(z).

The temperature parameter T controls the sharpness of the distribution: softmax(z/T). With T โ†’ 0, the distribution collapses to a one-hot (greedy) selection of the highest logit. With T โ†’ โˆž, it becomes uniform. In LLM token sampling, temperature is a key knob: T = 0.7 gives creative but coherent text; T = 1.5 gives more random, diverse outputs. At training time T = 1 is almost always used.

Softmax p(k) = ezk / ฮฃโฑผ ezj   for k = 1, โ€ฆ, K With temperature T: p(k) = ezk/T / ฮฃโฑผ ezj/T z = logit vector (raw network output) ยท outputs โˆˆ (0,1) ยท ฮฃ p(k) = 1.0 exactly Numerical stability: compute softmax(z โˆ’ max(z)) to avoid e^large โ†’ overflow
Softmax Temperature โ€” T=0.1 (greedy) vs T=1.0 (standard) vs T=2.0 (diverse)
T = 0.1 โ€” VERY SHARP 1.0 0.5 C1 C2 C3 C4 near greedy 0.97 T = 1.0 โ€” STANDARD 1.0 0.5 C1 C2 C3 C4 standard softmax 0.60 0.22 T = 2.0 โ€” DIVERSE 1.0 0.5 C1 C2 C3 C4 flatter, more diverse 0.40 0.28

In PyTorch, never apply softmax before passing logits to nn.CrossEntropyLoss โ€” this loss already applies log-softmax internally for numerical stability. Applying softmax beforehand causes the loss to compute log(softmax(logits)), introducing numerical errors. Always pass raw logits to CrossEntropyLoss.

โš  Common Pitfall โ€” Softmax + CrossEntropyLoss Double Application

A very common bug: applying torch.softmax(logits) in the model's forward method, then passing the result to nn.CrossEntropyLoss. Since CrossEntropyLoss internally calls log_softmax, you end up computing log(softmax(logits)) instead of log_softmax(logits), which is numerically unstable and gives wrong gradients. Always output raw logits from the model.

The choice of activation function is one of the most important and most misunderstood hyperparameters. The practical rule is simple: use ReLU as your baseline for CNNs and general MLPs, switch to GELU for anything Transformer-based, use Sigmoid only at binary classification output, and use Softmax only at multi-class output. The table below summarises when and why to use each.

Activation Range Vanishing Gradient Zero-Centred Where Used Default Choice?
Sigmoid (0, 1) Yes โ€” severe No Binary output, LSTM gates Only for binary output
Tanh (โˆ’1, 1) Yes โ€” moderate Yes LSTM/GRU gates, RNNs Legacy RNNs only
ReLU [0, โˆž) No (positive) No CNNs, MLPs (pre-2018) โœ“ CNNs still
Leaky ReLU (โˆ’โˆž, โˆž) No Near When dead neurons are a problem Good fallback
GELU (โˆ’0.17, โˆž) No Near GPT, BERT, T5, Transformers โœ“ Transformers
Swish/SiLU (โˆ’0.28, โˆž) No Near EfficientNet, some LLMs โœ“ Modern CNNs
Softmax (0, 1), ฮฃ=1 โ€” No Multi-class output layer only Only for output

∑ Chapter 4.2 Summary โ€” Activation Functions

  • Without non-linear activations, stacking layers = still a single linear transformation โ€” depth buys nothing expressively
  • Sigmoid ฯƒ(z) = 1/(1+eโˆ’z): saturates โ†’ vanishing gradients; not zero-centred โ€” use only for binary classification output
  • Tanh: same saturation problem but zero-centred โ€” better gradient flow; still used in LSTM/GRU gates
  • ReLU: max(0,z) โ€” fast, no saturation for positive inputs, default for CNNs; suffers from Dead ReLU (permanently zero neurons)
  • Leaky ReLU fixes dead neurons with small negative slope ฮฑ โ€” good fallback when ReLU causes training issues
  • GELU: smooth ReLU variant f(z) = zยทฮฆ(z) โ€” used in GPT, BERT, and virtually all modern Transformers; smooth gradient everywhere
  • Softmax: multi-class output only โ€” temperature T controls sharpness of probability distribution; never apply before CrossEntropyLoss
4.3
Chapter 4.3
Backpropagation & Gradient Flow

Backpropagation is not magic โ€” it is the chain rule of calculus applied systematically to a computational graph. The genius is not the mathematics (which dates to Leibniz) but the engineering insight that all gradients in a network can be computed in a single backward pass, as cheaply as one forward pass. Without this, deep learning would be computationally impossible.

The central question of training is: "For every weight in the network, how much does the loss change if I nudge that weight by a tiny amount?" This quantity โ€” the partial derivative of the loss with respect to each weight โ€” is the gradient. To reduce the loss, we move each weight in the direction opposite to its gradient.

The naive approach is finite differences: for each weight w, compute loss(w + ฮต) โˆ’ loss(w) / ฮต. This gives an approximate gradient for that weight. The problem is scale. GPT-4 has an estimated 1.8 trillion parameters. Computing one gradient update this way requires 1.8 trillion forward passes โ€” at, say, 1 second per pass on a cluster, that is 57,000 years per update step. Completely impossible.

Backpropagation solves this by computing all gradients simultaneously in a single backward pass through the computational graph. The backward pass is no more expensive than the forward pass โ€” it visits the same operations in reverse. The key ingredient is the chain rule, which tells us how to compose local gradients as they flow backward from the loss to the inputs.

Finite differences: O(W) forward passes for W weights. Backpropagation: ONE backward pass for all W weights simultaneously. This efficiency gap โ€” many orders of magnitude โ€” is what makes modern deep learning possible.

Every computation a neural network performs can be represented as a directed acyclic graph (DAG). Each node in the graph is a mathematical operation โ€” addition, multiplication, exp, sigmoid, max. Each directed edge carries a tensor value from one operation to the next. The leaf nodes on the left are the inputs and weights; the single root node on the right is the scalar loss.

The forward pass is data flowing left to right through this graph โ€” compute zโ‚ = w ร— x, then zโ‚‚ = zโ‚ โˆ’ y, then L = zโ‚‚ยฒ. Each intermediate value is stored (this is why training uses more memory than inference). The backward pass is gradients flowing right to left โ€” starting with โˆ‚L/โˆ‚L = 1 and applying the chain rule at each node. Every node knows how to compute its local gradient (e.g., the gradient through a multiplication node is the other operand), and backprop just multiplies local gradients together along each path.

PyTorch builds this graph dynamically as you execute Python code โ€” every tensor operation with requires_grad=True records itself into the graph. When you call loss.backward(), PyTorch traverses the graph in reverse topological order and accumulates gradients into each leaf tensor's .grad attribute. JAX uses a slightly different approach (function transformation) but the computational graph concept is identical.

Computational Graph โ€” forward values and backward gradients for L = (wยทx โˆ’ y)ยฒ
FORWARD PASS โ€” compute values left โ†’ right w =2 x =3 y =5 ร— zโ‚=6 โˆ’ zโ‚‚=1 zยฒ L =1 BACKWARD PASS โ€” propagate gradients right โ†’ left โˆ‚L/โˆ‚L=1 โˆ‚L/โˆ‚zโ‚‚=2zโ‚‚=2 โˆ‚L/โˆ‚zโ‚=2 โˆ‚L/โˆ‚y=โˆ’2 โˆ‚L/โˆ‚zโ‚=2 โˆ‚L/โˆ‚w=2ยทx=6 โˆ‚L/โˆ‚x=2ยทw=4

The chain rule is calculus's rule for differentiating composed functions: if L = f(g(x)), then dL/dx = (dL/dg) ยท (dg/dx). In a neural network, every layer is a composed function. The loss is a composition of all the layer operations stacked together. Backprop is simply the chain rule applied methodically in reverse order through every layer.

For a single layer l with pre-activation Zโฝหกโพ = Aโฝหกโปยนโพ ยท Wโฝหกโพ + bโฝหกโพ and output Aโฝหกโพ = f(Zโฝหกโพ), the gradient of the loss with respect to the weights Wโฝหกโพ decomposes into three factors by the chain rule: how the loss changes with the activation, how the activation changes with the pre-activation (the derivative of the activation function), and how the pre-activation changes with the weights (which is simply Aโฝหกโปยนโพ). Multiplied together, these give the weight gradient for that layer.

The error signal ฮดโฝหกโพ is the gradient of the loss with respect to the pre-activation Zโฝหกโพ. It packages the chain rule product up to layer l. To propagate backward one more layer, we multiply ฮดโฝหกโพ by the weight matrix Wโฝหกโพ transposed (to "route" gradients back to the correct inputs), then element-wise multiply by f'(Zโฝหกโปยนโพ) โ€” the local derivative of the activation. This recursion continues all the way to the first layer.

Backpropagation โ€” Key Equations Weight gradient:  โˆ‚L/โˆ‚Wโฝหกโพ = ฮดโฝหกโพ ยท (Aโฝหกโปยนโพ)แต€ Bias gradient:   โˆ‚L/โˆ‚bโฝหกโพ = ฮดโฝหกโพ (sum over batch) Error signal:   ฮดโฝหกโพ = ((Wโฝหกโบยนโพ)แต€ ยท ฮดโฝหกโบยนโพ) โŠ™ f'(Zโฝหกโพ) ฮดโฝหกโพ = error signal at layer l  ยท  โŠ™ = element-wise multiply  ยท  f' = activation derivative Full chain: โˆ‚L/โˆ‚Wโฝหกโพ = โˆ‚L/โˆ‚Aโฝหกโพ ยท โˆ‚Aโฝหกโพ/โˆ‚Zโฝหกโพ ยท โˆ‚Zโฝหกโพ/โˆ‚Wโฝหกโพ
Gradient Flow โ€” backward signals multiply local gradients at each layer
Input Aโฝโฐโพ Hidden 1 Zโฝยนโพ, Aโฝยนโพ f'(Zโฝยนโพ) Hidden 2 Zโฝยฒโพ, Aโฝยฒโพ f'(Zโฝยฒโพ) Output Loss L Wโฝยนโพ Wโฝยฒโพ Wโฝยณโพ ฮดโฝยฒโพ=(Wโฝยณโพ)แต€ฮดโฝยณโพโŠ™f' ฮดโฝยนโพ=(Wโฝยฒโพ)แต€ฮดโฝยฒโพโŠ™f' โˆ‚L/โˆ‚Wโฝยนโพ=ฮดโฝยนโพยทAโฝโฐโพแต€ fades with each layer โˆ‚L/โˆ‚L=1 FORWARD (blue) ยท BACKWARD (red, opacity shows gradient magnitude) โˆ‚L/โˆ‚Wโฝยฒโพ=ฮดโฝยฒโพยท(Aโฝยนโพ)แต€ โˆ‚L/โˆ‚Wโฝยณโพ=ฮดโฝยณโพยท(Aโฝยฒโพ)แต€
import torch # Simple 2-layer network โ€” verify PyTorch autograd x = torch.tensor([[1.0, 2.0]]) # input (1ร—2) W1 = torch.randn(2, 3, requires_grad=True) b1 = torch.randn(3, requires_grad=True) W2 = torch.randn(3, 1, requires_grad=True) b2 = torch.randn(1, requires_grad=True) # Forward pass โ€” PyTorch silently builds the computational graph z1 = x @ W1 + b1 # (1ร—3) โ€” pre-activation a1 = torch.relu(z1) # (1ร—3) โ€” activation z2 = a1 @ W2 + b2 # (1ร—1) โ€” output logit loss = z2.sum() # scalar loss # Backward pass โ€” ONE call computes ALL gradients loss.backward() print(f"dL/dW1 shape: {W1.grad.shape}") # torch.Size([2, 3]) โ€” same as W1 print(f"dL/dW2 shape: {W2.grad.shape}") # torch.Size([3, 1]) โ€” same as W2 print(f"dL/db1 shape: {b1.grad.shape}") # torch.Size([3]) โ€” same as b1 # Check gradient hasn't been accumulated from a previous call # Always call optimizer.zero_grad() before loss.backward() in training loops!
โš  Common Pitfall โ€” Gradient Accumulation Bug

PyTorch accumulates (adds) gradients into .grad by default โ€” it does not overwrite them. If you call loss.backward() twice without calling optimizer.zero_grad() in between, the gradients double. The canonical training loop order is always: zero_grad โ†’ forward โ†’ loss โ†’ backward โ†’ step. Gradient accumulation over multiple mini-batches is intentional use of this behaviour, but it must be explicit.

Gradients propagate backward by multiplication. If the gradient at each layer is a number less than 1, repeated multiplication makes the product shrink exponentially. Sigmoid's maximum derivative is 0.25. In a 10-layer sigmoid network, the gradient arriving at layer 1 has been multiplied by at most 0.25 per layer โ€” giving 0.25ยนโฐ โ‰ˆ 9.5 ร— 10โปโท, effectively zero. The first layers receive no gradient signal and learn nothing while the last few layers update normally.

This is why networks deeper than 5โ€“6 layers were impractical before 2012. The symptom is clear in training: the loss decreases at first but then plateaus far above the optimal, and inspecting per-layer gradients shows near-zero values in the early layers. The activations in these layers also collapse โ€” either all outputs are near 0 or near 1 (for sigmoid), with near-zero variance.

The primary solutions in order of importance: (1) ReLU activations โ€” gradient is exactly 1 for positive inputs, breaking the exponential decay. (2) Residual connections (ResNet, Ch 4.5) โ€” add a "skip" path that carries gradients directly from the loss to early layers, bypassing the layer multiplications entirely. (3) Batch Normalisation (Ch 4.4) โ€” normalises activations to prevent saturation. (4) He initialisation โ€” initialises weights to maintain gradient scale across layers.

Gradient Magnitude โ€” Sigmoid vs ReLU across 10 layers
Sigmoid (max derivative = 0.25):
  Layer 10 (near output): gradient โ‰ˆ 0.25ยน = 0.250
  Layer 8: gradient โ‰ˆ 0.25ยณ = 0.016
  Layer 5: gradient โ‰ˆ 0.25โถ = 2.4 ร— 10โปโด
  Layer 1 (first layer): gradient โ‰ˆ 0.25ยนโฐ = 9.5 ร— 10โปโท  โ† effectively zero
ReLU (derivative = 1 for active neurons):
  Layer 10: gradient โ‰ˆ 1.0
  Layer 1: gradient โ‰ˆ 1.0  โ† same order of magnitude โ†’ learning in all layers
Vanishing Gradients โ€” why sigmoid kills deep network training
Gradient magnitude 1.0 0.5 0 โ† Layer 1 (early)             Layer 10 (near output) โ†’ Sigmoid โ€” gradient collapses to ~0 by layer 4 ReLU โ€” gradient stays ~constant across all layers โ‰ˆ0 1.0
โš  Common Pitfall โ€” Diagnosing Vanishing Gradients

The telltale sign: training loss stops improving very early, even with sufficient model capacity and data. To confirm, log the gradient norm per layer: for name, p in model.named_parameters(): print(name, p.grad.norm()). If early-layer norms are 10โปโถ or smaller while final-layer norms are ~1.0, you have a vanishing gradient problem. First fix: switch sigmoid โ†’ ReLU. Second fix: add residual connections.

The opposite pathology occurs when the gradient magnitudes grow exponentially as they propagate backward โ€” if the weight matrices have large singular values, each multiplication amplifies rather than shrinks the gradient. This is especially common in Recurrent Neural Networks (RNNs) processing long sequences: the gradient at time step 1 is the product of 100 Jacobian matrices, and if each has norm slightly above 1, the product explodes exponentially.

The symptom is unmistakable: the loss goes to NaN within the first few training steps, and weights become inf. The standard fix is gradient clipping: compute the global norm of all gradients, and if it exceeds a threshold, scale all gradients down proportionally. This preserves the direction of the gradient update but caps its magnitude. A clipping value of 1.0 is a widely used default.

Gradient Clipping if ||g|| > clip_value:   g โ† g ร— (clip_value / ||g||) g = concatenation of all gradient tensors as a flat vector  ยท  ||g|| = L2 norm Direction preserved โ€” only magnitude is capped. clip_value = 1.0 is the standard default.
import torch import torch.nn as nn # Canonical training loop with gradient clipping model = nn.LSTM(input_size=64, hidden_size=128, num_layers=2, batch_first=True) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) for x, y in dataloader: optimizer.zero_grad() # 1. clear old gradients output, _ = model(x) loss = criterion(output, y) loss.backward() # 2. compute gradients # 3. clip before step โ€” prevents exploding gradient NaN nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step() # 4. update weights # Monitor gradient norms to detect explosion early total_norm = 0.0 for p in model.parameters(): if p.grad is not None: total_norm += p.grad.data.norm(2).item() ** 2 total_norm = total_norm ** 0.5 print(f"Gradient norm: {total_norm:.4f}") # >10 โ†’ suspicious; >100 โ†’ exploding
๐Ÿ’ฅ

Symptoms of Explosion

  • Loss jumps to NaN
  • Weights become inf
  • Gradient norm > 100
  • Loss erratic, huge oscillations
๐Ÿ›ก๏ธ

Solutions

  • Gradient clipping (max_norm=1.0)
  • Lower learning rate
  • LSTM/GRU gating (Ch 4.6)
  • Layer normalisation
๐Ÿ”

Monitoring Gradients

  • Log gradient norm per step
  • Use WandB/TensorBoard
  • Check for inf/NaN in params
  • Early layers vs late layers

∑ Chapter 4.3 Summary โ€” Backpropagation & Gradient Flow

  • Backprop answers: how does the loss change w.r.t. every single weight โ€” in one backward pass, as cheap as one forward pass
  • Computational graph: every operation is a node; forward pass computes values; gradients flow backward through edges via the chain rule
  • Chain rule: โˆ‚L/โˆ‚Wโฝหกโพ = ฮดโฝหกโพ ยท (Aโฝหกโปยนโพ)แต€ โ€” upstream error signal ร— input activations transposed
  • Vanishing: sigmoid derivatives multiply to near-zero in deep networks (0.25ยนโฐ โ‰ˆ 10โปโท) โ†’ ReLU, residual connections, BatchNorm solve this
  • Exploding: large weight matrices multiply gradients to NaN loss โ†’ clip_grad_norm_(max_norm=1.0) is the standard fix
  • PyTorch autograd: dynamic computation graph โ€” .backward() computes all gradients; always call zero_grad() before each backward pass
4.4
Chapter 4.4
Training Deep Networks

A neural network architecture is only half the story. The other half is the engineering that makes it trainable: how weights are initialised, how activations are kept stable, how overfitting is controlled, and how the optimiser navigates the loss landscape. These techniques are what separate a network that converges from one that never learns at all.

Before a single training example is shown, every weight must be given a starting value. This choice has enormous consequences. If all weights start at zero, every neuron in a layer computes exactly the same function and receives exactly the same gradient โ€” no matter how many epochs you train, all neurons in a layer remain identical forever. This is the symmetry breaking problem: weights must differ to learn different features.

Initialising with random values breaks symmetry, but the variance of those values is critical. If weights are too small, activations shrink exponentially with depth โ€” by layer 10, inputs have collapsed to near zero and there is no gradient signal. If weights are too large, activations explode exponentially โ€” inputs saturate sigmoid/tanh and gradients vanish, or the network numerically overflows. The goal is to choose a variance that keeps activation magnitudes approximately stable across all layers.

Xavier/Glorot initialisation (Glorot & Bengio, 2010) derives the optimal variance analytically for linear activations and symmetric non-linearities like Tanh. It sets the weight variance to 2/(nแตขโ‚™ + nโ‚’แตคโ‚œ), balancing the signal variance across both forward and backward passes. He/Kaiming initialisation (He et al., 2015) adjusts for the fact that ReLU kills half of all activations (setting them to zero), which halves the effective variance. He init compensates by scaling up by โˆš2, using variance 2/nแตขโ‚™. For any ReLU-based network, He initialisation is the correct default.

Weight Initialisation Formulas Xavier (Glorot):  W ~ N(0, 2/(nแตขโ‚™+nโ‚’แตคโ‚œ))   โ† for Tanh / Sigmoid He (Kaiming):    W ~ N(0, 2/nแตขโ‚™)        โ† for ReLU (most common) LeCun:           W ~ N(0, 1/nแตขโ‚™)        โ† for SELU nแตขโ‚™ = fan-in (inputs to neuron) ยท nโ‚’แตคโ‚œ = fan-out (outputs from neuron)
Weight Initialisation โ€” variance stability across 10 layers
Layer depth โ†’ 1 10 0 1.0 100+ โœ“ Target Too large (explodes) Too small (vanishes) Zeros init โ€” symmetry failure Xavier He (ReLU) Too large init โ†’ exploding activations Too small init โ†’ vanishing activations Xavier init โ†’ stable variance He init (ReLU) โ†’ stable variance
import torch.nn as nn layer = nn.Linear(256, 128) # Xavier uniform โ€” default for linear/tanh layers nn.init.xavier_uniform_(layer.weight) # He / Kaiming โ€” correct for ReLU networks (most common in practice) nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu') nn.init.zeros_(layer.bias) # GPT-2 style โ€” small normal, used by many modern Transformers nn.init.normal_(layer.weight, mean=0.0, std=0.02) # Apply He init to all Linear layers in a model def init_weights(module): if isinstance(module, nn.Linear): nn.init.kaiming_normal_(module.weight, nonlinearity='relu') if module.bias is not None: nn.init.zeros_(module.bias) model.apply(init_weights) # applies recursively to all submodules
โš  Common Pitfall โ€” Wrong Init for Activation

Using Xavier init with ReLU or He init with Tanh gives suboptimal results. The mismatch causes activation variance to drift across layers. Rule: He/Kaiming for ReLU/LeakyReLU/GELU networks, Xavier/Glorot for Tanh/Sigmoid networks. When in doubt, use He โ€” most modern networks use ReLU-family activations.

Ioffe and Szegedy (2015) diagnosed a key training instability they called internal covariate shift: as the parameters of layer l change during training, the distribution of inputs seen by layer l+1 shifts. The later layer must constantly readjust to its changing input distribution, slowing convergence. Their solution โ€” Batch Normalisation โ€” normalises each layer's pre-activation values across the mini-batch, forcing the distribution to approximately N(0,1) regardless of what the previous layer learned.

The normalisation has four steps. First, compute the mean ฮผ_B and variance ฯƒยฒ_B of the current mini-batch. Second, subtract the mean and divide by the standard deviation to get xฬ‚ โ€” a zero-mean, unit-variance vector. Third โ€” and critically โ€” apply a learnable scale ฮณ and shift ฮฒ: y = ฮณxฬ‚ + ฮฒ. These learned parameters let the network undo the normalisation if that is optimal; without them, BatchNorm would permanently constrain every layer's activations to N(0,1), which is too restrictive.

At inference time there is no mini-batch, so BatchNorm uses running statistics โ€” exponential moving averages of ฮผ_B and ฯƒยฒ_B accumulated during training โ€” to normalise. This is why you must call model.eval() before inference: it switches BatchNorm from batch statistics to running statistics. Forgetting this is one of the most common and damaging bugs in deep learning practice.

Batch Normalisation โ€” Forward Pass ฮผ_B = (1/m) ฮฃ xแตข           (batch mean, m = batch size) ฯƒยฒ_B = (1/m) ฮฃ (xแตขโˆ’ฮผ_B)ยฒ  (batch variance) xฬ‚แตข = (xแตขโˆ’ฮผ_B) / โˆš(ฯƒยฒ_B+ฮต)  (normalise, ฮตโ‰ˆ1e-5 for numerical stability) yแตข = ฮณยทxฬ‚แตข + ฮฒ             (scale & shift โ€” ฮณ, ฮฒ are LEARNED) ฮณ, ฮฒ initialised to 1 and 0 โ€” network learns optimal scale/shift during training
Batch Normalisation Position and Effect on Layer Activations
Linear Z = XW + b Batch Norm ฮผ_B ฯƒยฒ_B xฬ‚ ฮณxฬ‚+ฮฒ ฮณ,ฮฒ learned Activation ReLU / GELU Next Layer Pre-BN: varying distributions Post-BN: stable ~N(0,1) consistent across layers
โš  Common Pitfall โ€” model.train() vs model.eval()

Forgetting to call model.eval() before inference causes BatchNorm to use the mini-batch statistics of a single inference batch (which may be size 1) instead of the running statistics accumulated during training. With batch size 1, the batch mean equals the input, normalised output is always zero, and predictions are garbage. Always: model.train() during training, model.eval() during evaluation and inference.

Srivastava et al. (2014) introduced dropout as a computationally cheap approximation to training an ensemble of exponentially many networks. During each forward pass, every neuron is independently deactivated with probability p. The remaining (1โˆ’p) fraction of neurons process the input and update normally. At inference, all neurons are active โ€” but since the network was trained with only (1โˆ’p) of neurons active on average, the outputs are scaled down by (1โˆ’p) to keep the expected activation magnitude consistent. In practice, inverted dropout is used: scale activations up by 1/(1โˆ’p) during training so no adjustment is needed at inference.

The theoretical justification has three complementary perspectives. The ensemble view: with N neurons, there are 2^N possible sub-networks; dropout samples a different one each forward pass, and inference approximates their average. The co-adaptation view: neurons cannot rely on specific other neurons being present, so they learn more independent, redundant features. The noise injection view: randomly zeroing neurons adds multiplicative noise, acting like a data augmentation that prevents the network from memorising specific training patterns.

Practical guidance: dropout rates of 0.1โ€“0.2 work well for earlier or convolutional layers; 0.3โ€“0.5 for large fully connected layers. Dropout is rarely applied to convolutional feature maps (DropBlock is preferred there). In Transformer models, dropout is applied after attention and after the feed-forward sublayer with rates of 0.1 being standard. For very large models, lower dropout rates (0.05โ€“0.1) are preferred as the model already has strong regularisation from scale.

Dropout โ€” random deactivation during training, full network at inference
TRAINING (p=0.5 dropout) hโ‚ hโ‚ƒ โœ• โœ• โœ• ลท random 50% dropped per forward pass Input Hidden (dropout) INFERENCE (no dropout) hโ‚ hโ‚‚ hโ‚ƒ hโ‚„ hโ‚… ลท all neurons active โ€” inverted dropout scales during training
โš  Common Pitfall โ€” Dropout at Wrong Places

Do not apply standard Dropout after every layer indiscriminately. Applying it after BatchNorm can interfere with BN's running statistics. Applying it in convolutional layers often hurts performance (use DropBlock instead). Applying it at the output layer is always wrong. Standard rule: dropout in the fully connected classifier head only (or after transformer attention layers at p=0.1).

Stochastic Gradient Descent (SGD) updates each weight by subtracting a fraction of its gradient: ฮธ โ† ฮธ โˆ’ ฮฑโˆ‡L. Plain SGD oscillates badly in directions with high curvature (the narrow valleys common in deep loss landscapes) and moves too slowly in flat directions. SGD with Momentum adds a velocity term that accumulates gradient history, smoothing oscillations and accelerating through flat regions. It remains the preferred optimiser for training ResNets and CNNs on image classification.

Adam (Kingma & Ba, 2014) computes an adaptive learning rate per parameter: it tracks the first moment (mean of gradients) and second moment (mean of squared gradients) and uses their ratio to scale each parameter's update independently. A parameter whose gradient has been consistently large gets a smaller effective step; a parameter with small, consistent gradients gets a larger step. This makes Adam dramatically faster to converge on most problems and largely insensitive to the global learning rate choice.

AdamW (Loshchilov & Hutter, 2019) fixes a subtle mathematical bug in Adam's weight decay implementation. In Adam, L2 regularisation (weight decay) was applied to the gradient before the adaptive scaling โ€” which means the actual weight penalty is scaled by the adaptive term and varies per parameter. AdamW decouples weight decay from the gradient update, applying it directly to the parameters after the Adam step: ฮธ โ† ฮธ โˆ’ ฮปฮธ (separately from the Adam gradient term). This is now the mandatory default for training large language models and Transformers.

AdamW Update Rule m_t = ฮฒโ‚m_{t-1} + (1โˆ’ฮฒโ‚)g_t       (1st moment โ€” gradient mean) v_t = ฮฒโ‚‚v_{t-1} + (1โˆ’ฮฒโ‚‚)g_tยฒ     (2nd moment โ€” gradient variance) mฬ‚_t = m_t/(1โˆ’ฮฒโ‚แต—),  vฬ‚_t = v_t/(1โˆ’ฮฒโ‚‚แต—)  (bias correction) ฮธ_t = ฮธ_{t-1} โˆ’ ฮฑ(mฬ‚_t/โˆš(vฬ‚_t+ฮต) + ฮปฮธ_{t-1}) ฮฒโ‚=0.9 ยท ฮฒโ‚‚=0.999 ยท ฮต=1e-8 ยท ฮป=weight decay (typical 1e-2) ยท ฮฑ=learning rate
Optimiser Trajectories โ€” SGD vs Momentum vs Adam on ill-conditioned surface
โ˜… minimum start SGD โ€” zig-zag oscillations SGD + Momentum Adam โ€” adaptive per-param LR Adam handles the elongated surface better โ†’ fewer steps Elongated contours = ill-conditioned surface โ€” common in deep networks
OptimiserBest ForTypical LRWeight DecayNotes
SGDLegacy CNNs0.01โ€“0.1via L2 penaltyRequires careful LR schedule
SGD + MomentumCV fine-tuning, ResNets0.01โ€“0.1via L2momentum=0.9 standard
AdamPrototyping, NLP1e-4 to 3e-4Broken โ€” use AdamWDefault for quick experiments
AdamWTransformers, LLMs1e-4 to 3e-4Decoupled (1e-2)Mandatory for modern models

A fixed learning rate is rarely optimal throughout training. Early in training, large steps are desirable โ€” the network is far from a good solution and can afford rough updates. Later in training, large steps overshoot the loss minimum and cause oscillation โ€” smaller steps are needed for fine-grained convergence. Learning rate schedules adjust the learning rate automatically over the course of training.

The warmup + cosine annealing schedule has become the dominant approach for Transformer training. For the first few percent of training steps (the "warmup"), the learning rate increases linearly from near-zero to the target learning rate. This protects the model from large, destabilising gradient updates at the start of training, when the parameters are random and gradients are noisy. After warmup, the learning rate follows a cosine curve from the peak down to a small minimum โ€” providing a smooth, continuous decay that typically outperforms staircase schedules.

LR Schedules โ€” Warmup+Cosine is the Transformer standard
Training steps โ†’ LR warmup ends Constant Step decay Cosine Warmup+Cosine (Transformer standard) cosine decay โ†’ Constant Step decay (CNN fine-tuning) Cosine annealing โ† warmup โ†’

Putting it all together: a production-quality PyTorch training loop that incorporates gradient clipping, the model.train()/eval() switch, a validation loop, and a cosine learning rate schedule. Every line is intentional โ€” understanding why each piece is there is as important as knowing what it does.

import torch import torch.nn as nn from torch.utils.data import DataLoader def train_epoch(model, loader, optimizer, criterion, device, clip_grad=1.0): model.train() # dropout ON, BN uses batch stats total_loss, correct = 0.0, 0 for X, y in loader: X, y = X.to(device), y.to(device) logits = model(X) # 1. forward loss = criterion(logits, y) optimizer.zero_grad() # 2. clear grads loss.backward() # 3. backprop nn.utils.clip_grad_norm_(model.parameters(), clip_grad) # 4. clip optimizer.step() # 5. update total_loss += loss.item() correct += (logits.argmax(1) == y).sum().item() return total_loss / len(loader), correct / len(loader.dataset) def evaluate(model, loader, criterion, device): model.eval() # dropout OFF, BN uses running stats total_loss, correct = 0.0, 0 with torch.no_grad(): # disable grad tracking โ†’ saves memory for X, y in loader: X, y = X.to(device), y.to(device) logits = model(X) total_loss += criterion(logits, y).item() correct += (logits.argmax(1) == y).sum().item() return total_loss / len(loader), correct / len(loader.dataset) # Setup device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = MLP(784, 256, 10).to(device) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2) criterion = nn.CrossEntropyLoss() scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) best_val_loss, best_epoch = float('inf'), 0 for epoch in range(50): train_loss, train_acc = train_epoch(model, train_loader, optimizer, criterion, device) val_loss, val_acc = evaluate(model, val_loader, criterion, device) scheduler.step() if val_loss < best_val_loss: # early stopping / best model best_val_loss = val_loss torch.save(model.state_dict(), 'best_model.pt') if epoch % 5 == 0: print(f"Epoch {epoch:3d}: train={train_loss:.4f} val_acc={val_acc:.4f}")
Training vs Validation Loss โ€” detecting overfitting and early stopping
Epoch (0 โ†’ 50) 2.5 1.0 0.0 overfitting begins good fit zone overfitting zone โ† save best model Training loss (always decreases) Validation loss (rises after overfitting) 0 15 30 45 50

Beyond dropout and BatchNorm, a range of regularisation techniques are routinely used in modern deep learning. The right combination depends on the task, architecture, and dataset size. The table below summarises the most important ones with their use cases and PyTorch APIs.

TechniqueMechanismWhen to UsePyTorch API
L2 / Weight Decay Add ฮปโ€–Wโ€–ยฒ penalty to loss โ†’ shrinks weights Always โ€” small ฮป (1e-4 to 1e-2) weight_decay= in optimizer
Dropout Randomly zero neurons during training FC layers, Transformers (p=0.1โ€“0.5) nn.Dropout(p=0.3)
Batch Normalisation Normalise activations per mini-batch After linear/conv, before activation nn.BatchNorm1d/2d
Layer Normalisation Normalise across features (not batch) Transformers โ€” no batch-size dependency nn.LayerNorm
Data Augmentation Random transforms of training inputs Image tasks (flip, crop, colour jitter) torchvision.transforms
Early Stopping Stop when validation loss stops improving Always โ€” monitor val_loss with patience Manual or PyTorch Lightning
Label Smoothing Soften hard 0/1 targets to ฮต/(K-1) Classification โ€” prevents overconfidence nn.CrossEntropyLoss(label_smoothing=0.1)

∑ Chapter 4.4 Summary โ€” Training Deep Networks

  • He init for ReLU: W ~ N(0, โˆš(2/nแตขโ‚™)) โ€” prevents vanishing/exploding activations before training even starts
  • BatchNorm: normalise per mini-batch โ†’ stable training, higher LR tolerance, less sensitivity to initialisation; always call model.eval() at inference
  • Dropout: randomly drop p of neurons each forward pass โ†’ ensemble effect, prevents co-adaptation; use in FC layers at p=0.1โ€“0.5
  • AdamW: Adam with decoupled weight decay โ€” the mandatory standard optimiser for Transformer and LLM training (ฮฒโ‚=0.9, ฮฒโ‚‚=0.999, ฮป=1e-2)
  • Warmup + cosine annealing: protect early training instability then smoothly decay LR โ€” standard for all large-scale training runs
  • Training loop order: zero_grad โ†’ forward โ†’ loss โ†’ backward โ†’ clip โ†’ step; run evaluate() separately with model.eval() and torch.no_grad()
4.5
Chapter 4.5
Convolutional Neural Networks

Convolutional Neural Networks are not a minor variation on the MLP. They encode a powerful prior about visual data โ€” that meaningful patterns are local and translation-invariant โ€” directly into the architecture. This inductive bias, combined with weight sharing, reduces parameters by orders of magnitude while improving generalisation. The result transformed computer vision from hand-crafted features to end-to-end learning.

A standard 224ร—224 RGB image contains 224 ร— 224 ร— 3 = 150,528 individual pixel values. A single hidden layer of 1,024 neurons in a fully connected MLP requires 150,528 ร— 1,024 = 154 million weights โ€” for the first layer alone, before any useful representation has been learned. Scale this to the thousands of neurons in a real network and you have a parameter count that dwarfs the available training data, making the network impossible to train effectively.

The parameter count is only the first problem. A deeper issue is that the MLP treats every pixel as equally related to every other pixel โ€” it has no concept of spatial locality. A cat's eye in the top-left corner and a cat's eye in the bottom-right corner are unrelated to the MLP; it must learn to recognise them as independent patterns, requiring separate learned features for every possible position. Images have three structural properties the MLP ignores: local structure (nearby pixels are more related than distant ones), translation invariance (a cat is a cat wherever it is), and compositionality (parts compose into objects).

CNNs address all three with a single idea: weight sharing via convolution. Instead of connecting each pixel to each neuron, a CNN applies a small learned filter (e.g., 3ร—3) across the entire image. The same 27 weights (3ร—3ร—3 channels) are reused at every spatial position. This reduces the first layer's parameters from 154 million to a few hundred, while the filter learns to detect the same feature (an edge, a colour gradient) wherever it appears.

MLP vs CNN โ€” weight sharing reduces parameters by 10,000ร—
MLP โ€” PARAMETER EXPLOSION 224ร—224ร—3 1024 neurons 154,140,672 weights โ€” first layer only CNN โ€” WEIGHT SHARING 224ร—224ร—3 3ร—3 feature map 3ร—3 filter same weights used at EVERY position 3ร—3ร—3 = 27 weights 27 weights โ€” 5,700,000ร— fewer

A convolution applies a small learnable filter โ€” called a kernel โ€” to an input feature map by sliding it across every position and computing an element-wise dot product at each location. At position (i, j), the output value is the sum of all products between the kernel weights and the corresponding input patch. If the kernel has learned to detect horizontal edges, positions where horizontal edges are present produce large activations; other positions produce small ones. With 64 different kernels, you get 64 different feature maps, each detecting a different pattern.

Three hyperparameters control the output size. Kernel size K (typically 3ร—3 in modern networks): larger kernels see more context but use more parameters. Stride S: how many pixels the kernel jumps between applications. Stride=2 halves the spatial resolution. Padding P: zeros added around the input. "Same" padding (P=(K-1)/2) preserves the input spatial size, which is standard for 3ร—3 convolutions.

The parameter count scales with kernel size and channel counts, not with image size โ€” this is the core efficiency of CNNs. A 3ร—3 conv layer with 64 input channels and 128 output channels has 3 ร— 3 ร— 64 ร— 128 + 128 = 73,856 parameters regardless of whether the input is 32ร—32 or 512ร—512. This size-invariance is why a single CNN trained on 224ร—224 images can be applied to any input resolution at inference.

Convolution Formulas Output size: W_out = โŒŠ(W_in โˆ’ K + 2P) / SโŒ‹ + 1 Params per layer: K ร— K ร— C_in ร— C_out + C_out (bias) K = kernel size ยท P = padding ยท S = stride ยท C_in/C_out = input/output channels Example: 3ร—3 conv, 64โ†’128 ch, no bias: 3ร—3ร—64ร—128 = 73,728 weights
Convolution: sliding a 3ร—3 filter across input to produce feature map
INPUT 5ร—5 1 2 0 1 0 3 1 2 0 1 0 1 1 2 0 2 0 1 3 1 1 2 0 1 2 active patch (0,0) next position โ†’ KERNEL 3ร—3 0 1 0 0 1 0 0 1 0 (detects vertical edges) same weights at every position patch ยท kernel = 0ยท1+1ยท1+1ยท1+ 2ยท1+1ยท1+2ยท1 = 8 (sum of col 2) OUTPUT 3ร—3 4 3 1 5 6 2 3 7 2 โ† from patch (0,0) slide โ†’ W_out=(5-3+0)/1+1 = 3 3ร—3 feature map โ˜… Same 9 kernel weights used at all 9 positions โ€” this is weight sharing
import torch.nn as nn # Conv2d(in_channels, out_channels, kernel_size, stride, padding) conv = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1) # 'same' padding # Parameter count: 3ร—3ร—3ร—64 + 64 = 1,792 (vs 154M for MLP on 224ร—224ร—3) params = sum(p.numel() for p in conv.parameters()) import torch x = torch.randn(1, 3, 224, 224) # batch=1, C=3, H=224, W=224 out = conv(x) # shape: (1, 64, 224, 224) โ€” same spatial size print(f"Input: {x.shape}") # [1, 3, 224, 224] print(f"Output: {out.shape}") # [1, 64, 224, 224] print(f"Params: {params}") # 1,792 # Stride=2 halves spatial resolution conv_s2 = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1) out2 = conv_s2(out) print(f"Stride=2 output: {out2.shape}") # [1, 128, 112, 112]

After convolution extracts local features, the spatial dimensions of the feature maps are often larger than necessary โ€” and carrying large feature maps through many layers is expensive. Pooling layers reduce the spatial size while retaining the most important information. Max pooling (the dominant choice) partitions the feature map into non-overlapping windows and takes the maximum value in each. With a 2ร—2 window and stride 2, max pooling halves both height and width, reducing the number of activations by 4ร— while making features more invariant to small spatial shifts.

The intuition behind max pooling: each feature map cell contains an activation measuring "how strongly is this feature present at this position?" The maximum within a 2ร—2 region answers "was this feature present anywhere in this region?" โ€” a coarser, position-invariant question that is still useful for recognition. An eye is an eye whether it is 2 pixels to the left or right. Max pooling discards that 2-pixel difference.

Global Average Pooling (GAP) is a key modern innovation: instead of pooling 2ร—2 regions, it averages the entire spatial extent of each channel into a single scalar. Applied after the last convolutional block, GAP converts a [B, C, H, W] tensor into [B, C], replacing the large fully-connected layers that were responsible for most parameters in early CNN architectures. ResNet and subsequent models use GAP as the bridge between convolutional features and the classification head.

Max Pooling 2ร—2 โ€” reduces spatial dimensions by 2ร—
INPUT 4ร—4 1 3 5 6 2 4 1 2 3 2 1 4 1 3 2 1 โ†’ max pool 2ร—2, s=2 OUTPUT 2ร—2 6 4 4 3 max(1,3,5,6)=6 max(2,4,1,2)=4 Max Pooling Benefits โ†’ 4ร—4 becomes 2ร—2 (75% reduction) โ†’ Translation invariant within window โ†’ Retains strongest activation โ†’ No learned parameters nn.MaxPool2d(kernel_size=2, stride=2)

A typical CNN follows a regular pattern: alternating convolution blocks and pooling layers, progressively reducing the spatial dimensions while increasing the number of feature channels. The spatial compression concentrates local features into increasingly compact representations. The channel expansion gives the network more "vocabulary" for describing what it sees. The final stage converts the 3D feature tensor into a class prediction via either a series of fully-connected layers or Global Average Pooling.

The hierarchical feature learning in CNNs is perhaps their most important property. Visualisation studies show that early layers (Layer 1-2) learn to detect simple patterns: oriented edges, colour gradients, and textures. Middle layers (Layer 3-4) detect parts: corners, curves, texture patches that resemble scales, fur, or brickwork. Late layers detect whole objects or object parts: faces, wheels, paws. This hierarchy emerges from training alone โ€” it is not hand-crafted. The network discovers it is useful by virtue of gradient descent on classification loss.

CNN Architecture โ€” Spatial compression + Feature depth expansion
224ยณ Input 55ยฒ ร—64 27ยฒ ร—128 13ยฒ ร—256 7ยฒร—512 GAP FC 4096 FC 4096 1000 Softmax Spatial dims shrink โ†’ Channel depth grows โ†’ Conv + ReLU block Max Pooling Fully Connected Softmax output AlexNet-style layout
CNN Feature Hierarchy โ€” from pixels to objects through learned abstractions
LAYER 1 โ€” Edges LAYER 2 โ€” Textures LAYER 3 โ€” Parts LAYER 4 โ€” Objects horizontal vertical diagonal colour grad. โ†’ corner grid texture curve stripes โ†’ eye wheel ear snout โ†’ dog face car โ† increasing abstraction ยท increasing receptive field โ†’

By 2014, the empirical pattern was clear: deeper networks should perform better, because more layers can learn more complex functions. Attempts to train networks with 20โ€“30 layers consistently produced worse results than 10โ€“15 layer networks โ€” not just on validation, but on the training set. This degradation problem was not overfitting. It meant the optimiser was fundamentally unable to train very deep networks, even when additional capacity should have helped.

He et al.'s insight was deceptively simple. If a shallower network achieves some accuracy A, then a deeper network that copies the shallower network's layers and sets all additional layers to identity (f(x) = x) should achieve at least accuracy A. But gradient descent cannot easily learn the identity mapping โ€” pushing all weights in a layer toward zero is hard, because zero weights produce zero outputs (not the input x). The residual block makes this easy by reformulating the learning objective: instead of learning f(x), the block learns the residual r(x) = f(x) โˆ’ x. The shortcut connection adds the original input directly: output = r(x) + x. Now learning the identity is trivial โ€” just set r(x) = 0.

The practical impact was enormous. ResNet-152 (152 layers, 2015) achieved 3.57% Top-5 error on ImageNet โ€” surpassing human-level performance (~5%). The skip connection also dramatically improves gradient flow: gradients can propagate directly from the loss to any earlier layer through the shortcut path, bypassing the multiplicative chain that causes vanishing gradients. This is why skip connections appear in virtually every modern architecture โ€” Transformers include them as a core component.

Residual Block Standard: y = F(x, W)          (learn full mapping) Residual: y = F(x, W) + x      (learn the change only) If F(x) = 0:   y = x            (identity โ€” trivial to learn) F(x, W) = two conv layers with BN and ReLU between them
Residual Block โ€” skip connections solve the deep network degradation problem
Input x Conv 3ร—3 โ†’ BN โ†’ ReLU Conv 3ร—3 โ†’ BN โ† F(x, W) (learned residual) + x (shortcut) ReLU โ†’ F(x) + x Why it works โ†’ If optimal change = 0: set F(x) = 0 โ†’ y = x โœ“ โ†’ Gradients flow directly back through shortcut โ†’ ResNet-50 = 50 layers stacked: 16M params โ†’ 3.57% ImageNet error
ResNet vs Plain CNN โ€” Skip connections enable depth without degradation
Network depth (layers) โ†’ 30% 15% 0% 0 8 16 22 50 110 152 AlexNet VGG Inception ResNet-50 ResNet-152 Plain CNN degrades โ†’ Plain CNN โ€” worse with depth ResNet โ€” better with depth (skip connections)
import torch import torch.nn as nn class ResidualBlock(nn.Module): def __init__(self, channels): super().__init__() self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False) self.bn1 = nn.BatchNorm2d(channels) self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False) self.bn2 = nn.BatchNorm2d(channels) self.relu = nn.ReLU(inplace=True) def forward(self, x): identity = x # save input for skip connection out = self.relu(self.bn1(self.conv1(x))) out = self.bn2(self.conv2(out)) out = out + identity # F(x) + x โ€” the residual connection out = self.relu(out) return out # Use pre-trained ResNet-50 from torchvision import torchvision.models as models resnet = models.resnet50(weights='IMAGENET1K_V2') # Replace classifier head for custom task (e.g. 10 classes) resnet.fc = nn.Linear(resnet.fc.in_features, 10) x = torch.randn(1, 3, 224, 224) print(f"ResNet-50 output: {resnet(x).shape}") # [1, 10]
โš  Common Pitfall โ€” Mismatched Skip Connection Dimensions

The skip connection adds x directly to F(x). This requires x and F(x) to have the same shape. When a residual block changes the number of channels or uses stride > 1 (to downsample), the shortcut must include a 1ร—1 convolution (called a "projection shortcut") to match dimensions: self.shortcut = nn.Conv2d(in_ch, out_ch, 1, stride=stride). Forgetting this causes a shape mismatch error at the addition step.

The history of CNNs on ImageNet is a story of architectural innovations compounding: each milestone introduced one key idea that is now ubiquitous. AlexNet proved GPU-trained deep networks could work. VGG showed depth with only 3ร—3 convolutions was sufficient and more principled. Inception introduced parallel multi-scale filters. ResNet introduced skip connections enabling extreme depth. EfficientNet used Neural Architecture Search to jointly scale depth, width, and resolution. Vision Transformers (ViT) ultimately replaced convolutions entirely with attention โ€” showing that the convolutional inductive bias, while useful, is not necessary given enough data.

ArchitectureYearDepthKey InnovationImageNet Top-5
LeNet-519985First successful CNN for digits~25% (MNIST era)
AlexNet20128GPU training, ReLU, Dropout15.3% โ€” 11% improvement
VGG-16/19201416โ€“19Only 3ร—3 convolutions throughout7.3%
GoogleNet/Inception201422Inception modules, global avg pool6.7%
ResNet-50/152201550โ€“152Residual skip connections3.57% โ€” superhuman
DenseNet-1212017121โ€“264Dense connections (all-to-all)3.46%
EfficientNet-B72019VariableNeural Architecture Search (NAS)2.9%
Vision Transformer (ViT)2020VariablePure self-attention โ€” no convolution2.0%+

The receptive field of a neuron in layer l is the region of the original input image that can influence that neuron's activation. A neuron in the first conv layer with a 3ร—3 kernel sees a 3ร—3 region. A neuron in the second conv layer sees a 5ร—5 region (each of its 9 input cells saw a 3ร—3 region, overlapping to cover 5ร—5). With each additional 3ร—3 conv layer, the receptive field grows by 2 in each dimension. After k layers of 3ร—3 convolutions: receptive field = 2k + 1 pixels.

Pooling layers and strided convolutions multiply the receptive field growth. After a 2ร— pooling layer, subsequent convolutional layers grow the receptive field twice as fast. This is why deep CNNs develop neurons in later layers that respond to large, complex objects: they have receptive fields spanning the entire image. The final convolutional layer in ResNet-50 has a theoretical receptive field of 483ร—483 โ€” larger than the 224ร—224 input โ€” ensuring every output cell has seen the full input context.

๐Ÿ“

Receptive Field Growth

  • 1 conv (3ร—3): RF = 3ร—3
  • 2 convs (3ร—3): RF = 5ร—5
  • 3 convs (3ร—3): RF = 7ร—7
  • k convs: RF = (2k+1)ร—(2k+1)
  • Pooling 2ร—: doubles growth rate
๐Ÿ”

Why Large RF Matters

  • Small RF โ†’ misses global context
  • Object recognition needs full object
  • Dilated conv: large RF without depth
  • Attention (ViT): global RF layer 1
  • ResNet-50 RF > input size โœ“
โšก

PyTorch CNN in 10 Lines

nn.Sequential(
 nn.Conv2d(3, 64, 3, padding=1),
 nn.BatchNorm2d(64),
 nn.ReLU(),
 nn.MaxPool2d(2),
 nn.AdaptiveAvgPool2d(1),
 nn.Flatten(),
 nn.Linear(64, num_classes))

∑ Chapter 4.5 Summary โ€” Convolutional Neural Networks

  • Convolution: slide a learned filter across input โ†’ weight sharing = same feature detector everywhere โ€” 154M MLP params โ†’ 27 CNN params for first layer
  • Kร—Kร—C_inร—C_out parameters per layer โ€” size-invariant: same params for 32ร—32 or 512ร—512 inputs
  • CNN hierarchy: edges (L1) โ†’ textures (L2) โ†’ parts (L3) โ†’ objects (L4+) โ€” all learned automatically
  • Max pooling: keep maximum per 2ร—2 window โ†’ translation invariance + spatial compression; Global Average Pooling replaces FC layers
  • ResNet skip connections: y = F(x) + x โ€” reformulate as residual learning โ†’ solves degradation, enables 150+ layer networks, gradients flow freely
  • ResNet (2015) โ†’ EfficientNet โ†’ ViT (2020): Transformers now rival CNNs on vision โ€” attention replaces convolution with global receptive field from layer 1
4.6
Chapter 4.6
Recurrent Neural Networks & LSTMs

The RNN and LSTM are not obsolete โ€” they are the conceptual bedrock of sequence modelling. Understanding why RNNs struggle with long-range dependencies, and how LSTMs solve this with gating, is the essential preparation for understanding why the Transformer replaced them. Every concept in attention mechanisms traces directly back to this chapter.

An MLP processes each input independently. Feed it the word "bank" and it produces a prediction โ€” but it has no way to know whether the previous words were "river" or "money". CNNs add local spatial context via convolution windows, but they still process a fixed-size input with no persistent state across positions. Neither architecture is suited to data where order matters and length varies: text, speech, time series, video.

The semantic difference between "The dog bit the man" and "The man bit the dog" lies entirely in word order โ€” same vocabulary, opposite meaning. Processing each word independently destroys this information. A model needs to carry a memory of what it has already seen as it processes each new token. This is the core motivation for the Recurrent Neural Network: maintain a hidden state hǕₜ that accumulates information from all previous time steps.

RNN vs MLP โ€” hidden state enables sequential memory
MLP โ€” NO MEMORY "The" "dog" "bit" MLP MLP MLP each token independent โ€” no context RNN โ€” HIDDEN STATE MEMORY RNN RNN RNN "The" "dog" "bit" h0 h1 h2 hidden state carries context forward

The vanilla RNN cell takes the current input xₜ and previous hidden state hₜ₋₁, applies a weighted sum, and passes through tanh. The same weight matrices Wₕ and Wₓ are used at every single time step โ€” weight sharing across time, analogous to how a CNN shares weights across space. Unrolling through T steps creates an effective computational graph T layers deep: a sequence of 100 words = 100 effective layers = severe vanishing gradient risk.

Vanilla RNN hₜ = tanh(Wₕ·hₜ₋₁ + Wₓ·xₜ + b) yₜ = W𝙪·hₜ + b𝙪 Same Wₕ, Wₓ, W𝙪 at ALL time steps · effective depth = sequence length T
RNN Unrolled โ€” same weights at every time step, hidden state flows forward
Same weights W used at EVERY time step h0 t=1 h1 t=2 h2 t=3 h3 t=4 h4 t=5 "The" "cat" "sat" "on" "the" y1 y2 y3 y4 y5 Unrolled depth = T steps โ€” same gradient vanishing risk as T-layer deep network
import torch, torch.nn as nn rnn = nn.RNN(input_size=64, hidden_size=128, num_layers=2, batch_first=True) x = torch.randn(8, 30, 64) # (batch, seq_len, features) output, h_n = rnn(x) print(f"output: {output.shape}") # [8, 30, 128] โ€” all hidden states print(f"h_n: {h_n.shape}") # [2, 8, 128] โ€” final, 2 layers

BPTT applies backprop to the unrolled RNN graph. To update Wₕ, gradients must flow from the loss at step T back through every time step, multiplying by the Jacobian ∂hₜ/∂hₜ₋₁ = Wₕᵀ · diag(tanh’(·)) at each step. Across T steps: product of T such matrices. If ‖Wₕ‖ < 1, the product shrinks exponentially โ€” vanishing gradient. If > 1, it grows โ€” exploding. Practical limit: vanilla RNNs cannot reliably learn dependencies beyond ~10โ€“20 steps.

BPTT Vanishing Gradients โ€” early time steps receive no learning signal
t=1 t=2 t=3 t=4 t=5 t=6 Loss 0.25 0.06 0.016 0.004 ~0 no learning full gradient ||Wh|| < 1 โ†’ gradient x 0.25 per step โ†’ essentially zero after 5 steps Solution: LSTM cell state gradient highway (Section 4.6.4)
⚠ Common Pitfall โ€” Truncated BPTT

Training on very long sequences with full BPTT is expensive (memory scales linearly with sequence length). Truncated BPTT splits sequences into chunks and backpropagates only within each chunk, carrying the hidden state forward without gradients. In PyTorch, detach the hidden state between chunks: h = h.detach() before each new chunk.

Hochreiter & Schmidhuber (1997) added a cell state cₜ โ€” a horizontal "highway" running through all time steps with only element-wise operations. Gradients flowing through cₜ are multiplied by learned scalar gate values (not weight matrices), dramatically reducing vanishing. Three sigmoid gates ∈ (0,1) control information flow: forget gate fₜ decides what to erase from cₜ₋₁; input gate iₜ decides what new candidate c̃ₜ to write; output gate oₜ decides what to expose as hₜ.

The cell state update is the key: cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ โ€” only addition and element-wise multiply, no matrix multiplication. The forget gate can learn to stay near 1 for important long-range information, creating an almost unimpeded gradient path over hundreds of steps.

LSTM Equations fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf)   (Forget gate) iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi)   (Input gate) c̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc)  (Candidate values) cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ      (Cell state โ€” GRADIENT HIGHWAY) oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo)   (Output gate) hₜ = oₜ ⊙ tanh(cₜ)            (Hidden state output) σ = sigmoid ∈ (0,1) · ⊙ = element-wise multiply · [h,x] = concatenation
LSTM Cell โ€” three gates control what to forget, remember, and output
cₜ₋₁ cₜ CELL STATE HIGHWAY โ€” gradient flows with minimal decay x + σ → fₜ Forget gate σ → iₜ Input gate tanh→c̃ₜ Candidate x σ → oₜ Output gate tanh(cₜ) x hₜ hₜ₋₁ and xₜ feed all gates
import torch, torch.nn as nn lstm = nn.LSTM(input_size=64, hidden_size=128, num_layers=2, batch_first=True, dropout=0.2) x = torch.randn(8, 50, 64) output, (h_n, c_n) = lstm(x) print(f"output: {output.shape}") # [8, 50, 128] print(f"h_n: {h_n.shape}") # [2, 8, 128] print(f"c_n: {c_n.shape}") # [2, 8, 128] # Truncated BPTT โ€” detach BOTH states between chunks h = (h_n.detach(), c_n.detach()) # prevents OOM on long sequences # Bidirectional LSTM bilstm = nn.LSTM(64, 128, bidirectional=True, batch_first=True) out, _ = bilstm(x) print(f"BiLSTM: {out.shape}") # [8, 50, 256] = 128x2 (fwd + bwd)
⚠ Common Pitfall โ€” Forgetting to Detach Cell State

When processing long sequences in chunks, detach both h_n AND c_n: h = (h_n.detach(), c_n.detach()). Forgetting to detach c_n keeps the computation graph alive across chunks causing unbounded memory growth until OOM โ€” one of the most common LSTM training bugs.

Cho et al. (2014) introduced the GRU as a simplified LSTM with only 2 gates: an update gate zₜ (blend old vs new hidden state) and a reset gate rₜ (how much past state to use for candidate). No separate cell state โ€” just one hidden vector. Fewer parameters, faster training, often competitive with LSTM.

GRU Equations zₜ = σ(Wz·[hₜ₋₁, xₜ])         (Update gate) rₜ = σ(Wr·[hₜ₋₁, xₜ])         (Reset gate) h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ])   (Candidate hidden state) hₜ = (1−zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ    (Final: blend old and new) 2 gates vs LSTM’s 3 · no separate cell state · fewer parameters · use as first choice
LSTM
GRU
3 gates: forget, input, output
2 gates: update, reset
Separate cell state + hidden state
Single hidden state
More parameters per layer
Fewer parameters โ€” faster training
Better for very long sequences
Competitive on most tasks
Standard NLP 2014โ€“2018
Use first โ€” switch to LSTM if needed

Seq2Seq (Sutskever et al. 2014) combined two RNNs: an encoder compressing the input into a fixed context vector, and a decoder generating output conditioned on it. The fundamental flaw: the entire input โ€” regardless of length โ€” is squeezed into one fixed-size vector. Bahdanau et al. (2015) fixed this with attention: at each decoder step compute a weighted sum over ALL encoder states cₜ = ∑ αₜᴵ · hᴵ. This is the direct predecessor of Transformer self-attention in Ch 4.7.

Seq2Seq with Bahdanau Attention โ€” decoder attends to relevant encoder states
ENCODER enc1 enc2 enc3 enc4 h1 h2 h3 h4 "I" "love" "Paris" "!" DECODER dec1 dec2 dec3 "J'aime" "Paris" "!" 0.05 0.10 0.75 0.10 Generating "Paris": attends to h3 (alpha=0.75) cₜ = sum alphaₜᴵ x hᴵ Direct ancestor: Ch 4.7 Transformer

The attention mechanism did not replace the RNN in 2015 โ€” it made the RNN dramatically better. It took Vaswani et al. (2017) to ask: "What if we remove the RNN and use attention exclusively?" The answer was the Transformer. Chapter 4.7 completes this story.

∑ Chapter 4.6 Summary โ€” RNNs & LSTMs

  • RNN: hidden state hₜ = tanh(Wₕhₜ₋₁ + Wₓxₜ + b) โ€” same weights each step, memory flows forward through hₜ
  • BPTT: gradients multiply through T Jacobians → vanishing/exploding for long sequences (practical limit ~10–20 steps vanilla RNN)
  • LSTM: cell state cₜ = gradient highway; 3 gates (forget, input, output) control what to erase, write, expose
  • LSTM key: cₜ = fₜ⊙cₜ₋₁ + iₜ⊙c̃ₜ — only element-wise ops on gradient path → no vanishing through cell state
  • GRU: 2 gates (update, reset), single hidden state, fewer parameters — competitive with LSTM; use first
  • Seq2Seq + Bahdanau attention: decoder attends to all encoder states — solved the bottleneck; direct ancestor of Transformer self-attention (Ch 4.7)
4.7
Chapter 4.7
The Transformer Architecture

“Attention Is All You Need” (Vaswani et al., 2017) is the most consequential paper in the history of AI. It removed the recurrence entirely and showed that attention alone — applied in parallel across all tokens — outperforms every RNN variant at every scale. Every major AI system since 2018 is built on this architecture. Understanding it fully is not optional.

The RNN’s fundamental flaw is its sequential nature. To process token t, you must first finish token t−1. This means training on a sequence of 10,000 tokens requires 10,000 sequential steps — no amount of hardware parallelism can help. Training GPT-3 (which processes sequences of 2,048 tokens) on an RNN would be computationally impossible at scale. The Transformer abolishes this constraint: all tokens are processed simultaneously, turning a sequential problem into a parallel matrix multiplication problem that GPUs excel at.

The second flaw is information decay. Even with LSTM gating, information from 500 tokens ago is weakly represented in the current hidden state. In contrast, the Transformer’s attention mechanism creates a direct path between any two tokens regardless of distance. The word “it” 300 tokens after “the cat” can attend directly to “cat” with no intermediate steps — the path length is always 1. This is why Transformers handle long documents, code files, and entire books in ways that RNNs fundamentally cannot.

RNN Limitations
Transformer Solutions
Sequential computation — O(n) passes required
Fully parallel — all tokens simultaneously
Vanishing gradients across long sequences
Direct attention: any token to any token
Fixed-size context bottleneck
Full context at every layer
Maximum path length = n steps
Constant path length = 1 step

Self-attention is the mechanism that allows each token to gather information from all other tokens in the sequence. For each token, three vectors are computed by applying learned linear projections: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“what information do I carry?”). The attention weight between token i and token j is computed as the dot product of token i’s Query with token j’s Key, scaled by √dₖ (to prevent the dot products from growing large and saturating the softmax), then normalised via softmax across all tokens. The output for token i is the weighted sum of all Value vectors.

The scaling by √dₖ is critical. For dₖ=64, a random unit vector has dot product with another of approximately 8 in expectation. Without scaling, these large values push the softmax into near-zero gradient regions. Dividing by √64=8 normalises the variance and keeps gradients healthy. This is why the formula specifically includes the √dₖ denominator.

Scaled Dot-Product Attention Attention(Q,K,V) = softmax(QKᵀ / √dₖ) · V Step 1: Project inputs — Q = XWᵈ, K = XWᴷ, V = XWᵝ Step 2: Compute scores — S = QKᵀ / √dₖ    (shape: n×n attention matrix) Step 3: Normalise — A = softmax(S)           (row-wise, sums to 1) Step 4: Aggregate — Output = A·V             (weighted sum of values)
Self-Attention: Q, K, V projections and attention weight computation
INPUT "The" "cat" "sat" "on" Wᵈ Q₁ Q₂ Q₃ Q₄ Wᴷ K₁ K₂ K₃ K₄ Wᵝ V₁ V₂ V₃ V₄ Attention scores = QKᵀ/√dₖ .42 .22 .18 .18 .12 .65 .10 .13 .15 .38 .35 .12 .15 .20 .38 .27 "The" "cat" "sat" "on" "The" "cat" "sat" "on" ← "cat" attends most to itself Output = softmax(S)·V Out₁ Out₂ Out₃ Out₄ Each output = weighted sum of ALL Values Out₂ = 0.12·V₁ + 0.65·V₂ + 0.10·V₃ + 0.13·V₄ "cat" borrows mostly from itself (65%) but still reads 12% from "The", etc. Q×Kᵀ is O(n²) — the quadratic cost of attention n=2048 tokens → 4M attention scores per head n=128k tokens → 16B scores (why long-ctx is hard)
import torch import torch.nn.functional as F import math def scaled_dot_product_attention(Q, K, V, mask=None): # Q, K, V: (batch, heads, seq_len, d_k) d_k = Q.size(-1) scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) # causal masking attn = F.softmax(scores, dim=-1) # normalise over keys return torch.matmul(attn, V), attn # output + weights # Example: seq_len=10, d_k=64, batch=2, heads=8 Q = torch.randn(2, 8, 10, 64) K = torch.randn(2, 8, 10, 64) V = torch.randn(2, 8, 10, 64) out, weights = scaled_dot_product_attention(Q, K, V) print(f"Output: {out.shape}") # [2, 8, 10, 64] print(f"Weights: {weights.shape}") # [2, 8, 10, 10] — 10×10 attention matrix
⚠ Common Pitfall — Forgetting the Causal Mask in Decoder

The decoder must not attend to future tokens during training (it would “cheat” by reading the answer). Apply a causal mask: a lower-triangular matrix where position i can only attend to positions ≤i. In PyTorch: mask = torch.tril(torch.ones(n,n)). Forgetting this mask means the model sees the target during training but not during inference — causing a catastrophic train/eval mismatch where generated text is gibberish.

A single attention head can only attend to information from one representational subspace at a time. Multi-head attention runs h independent attention functions in parallel, each with its own learned projection matrices Wᴵᵈ, Wᴵᴷ, Wᴵᵝ. Each head learns to attend to a different type of relationship: one head might track syntactic subject-verb agreement, another resolves coreferences (“he” → “John”), another focuses on positional proximity, and yet another captures semantic similarity. All h head outputs are concatenated and projected back to dᵐᵒᵑᵉℹ via Wᵊ — a learned combination of what each head discovered.

The dimension of each head is dₖ = dᵐᵒᵑᵉℹ / h, so the total computation is the same as a single attention with dᵐᵒᵑᵉℹ dimensions. GPT-3 uses 96 attention heads with dᵐᵒᵑᵉℹ=12,288, giving each head a 128-dimensional subspace. This is one of the key scaling choices: more heads = more types of relationships the model can simultaneously track.

Multi-Head Attention MultiHead(Q,K,V) = Concat(head₁,...,headₕ) · Wᵊ headᴵ = Attention(QWᴵᵈ, KWᴵᴷ, VWᴵᵝ) dₖ = dᵐᵒᵑᵉℹ/h · GPT-3: dᵐᵒᵑᵉℹ=12288, h=96 heads, dₖ=128 per head
Multi-Head Attention — h parallel attention mechanisms with different projections
Input X Attention (Head 1) W₁ᵈ, W₁ᴷ, W₁ᵝ Attention (Head 2) W₂ᵈ, W₂ᴷ, W₂ᵝ Attention (Head 3) W₃ᵈ, W₃ᴷ, W₃ᵝ Attention (Head 4) W₄ᵈ, W₄ᴷ, W₄ᵝ Concat h×dₖ = dᵐᵒᵑᵉℹ head₁||head₂||head₃||head₄ Wᵊ proj MHA Output Each head specialises: syntax / coreference / position / semantics Wᵊ learns the optimal combination of what each head discovered

Attention is permutation invariant: swap any two tokens and the attention scores are identical (just reordered). The Transformer has no inherent concept of token order — “cat sat mat” and “mat sat cat” would produce the same attention weights if not corrected. To inject positional information, the original paper adds a fixed sinusoidal positional encoding to each token embedding before the first attention layer.

The sinusoidal encoding uses different frequencies for different embedding dimensions: PE(pos, 2i) = sin(pos/10000^(2i/d)) and PE(pos, 2i+1) = cos(pos/10000^(2i/d)). Each position gets a unique vector. The model can learn to extract relative position from these encodings because PE(pos+k) can be expressed as a linear function of PE(pos) — the Transformer can learn to do relative position arithmetic via attention.

Modern LLMs use Rotary Positional Embeddings (RoPE) instead — used in LLaMA, GPT-NeoX, and most recent architectures. RoPE encodes position by rotating the Q and K vectors by an angle proportional to position before computing dot products. The key advantage: attention scores naturally depend on the relative position between tokens (pos_i − pos_j), not absolute positions — enabling better generalisation to longer sequences than seen during training.

Sinusoidal Positional Encoding — unique position fingerprint added to each token
Sinusoidal PE heatmap — x-axis: embedding dimension, y-axis: sequence position dim 0 dim 1 dim 2 dim 3 dim 4 pos 0 pos 8 pos 18 sin ≈ +1 (high) sin ≈ 0 sin ≈ -1 (low) Early dims: high frequency (fast oscillation per position) Later dims: low frequency (slow oscillation per position) Each position = unique encoding vector PE(pos, 2i) = sin(pos/10000^(2i/d)) PE(pos, 2i+1) = cos(pos/10000^(2i/d)) Added to token embedding BEFORE first attention layer Modern: RoPE (LLaMA) encodes relative positions

The complete Transformer block repeats the same structure N times: Multi-Head Attention → Add&Norm → Feed-Forward Network → Add&Norm. The Feed-Forward Network is a two-layer MLP applied independently to each token position: FFN(x) = max(0, xW₁+b₁)W₂+b₂, expanding from dᵐᵒᵑᵉℹ to 4·dᵐᵒᵑᵉℹ then back. This 4× expansion and contraction lets the model compute complex non-linear transformations per token. The Add&Norm step adds the input as a residual connection and applies Layer Normalisation — enabling stable training at depth and solving the vanishing gradient problem (Ch 4.3).

The original Transformer had two components. The encoder is bidirectional: every token attends to all other tokens in both directions. It produces contextualised representations of the input sequence. The decoder is causal: each output token attends only to previously generated tokens (enforced by the causal mask), plus cross-attention to all encoder outputs. This asymmetry is fundamental: the encoder understands the full input at once, while the decoder generates output token-by-token, attending to what it has already produced.

Full Transformer — Encoder (left) and Decoder (right) with all components
ENCODER (Nx blocks) Token Embedding + Positional Encoding Multi-Head Self-Attention Add & Norm (residual + LayerNorm) Feed-Forward Network dᵐᵒᵑᵉℹ → 4dᵐᵒᵑᵉℹ → dᵐᵒᵑᵉℹ (per position) Add & Norm (residual + LayerNorm) ×N Encoder Output (K, V for cross-attn) K, V to cross-attn DECODER (Ny blocks) Target Embedding + Positional Encoding Masked Multi-Head Self-Attention (causal) Add & Norm Cross-Attention Q from decoder, K,V from encoder Add & Norm Feed-Forward Network dᵐᵒᵑᵉℹ → 4dᵐᵒᵑᵉℹ → dᵐᵒᵑᵉℹ Add & Norm ×N Linear → Softmax → Next token prob

The original Transformer (2017) had both an encoder and decoder. Within a year, two research groups realised that you could use just one half and pretrain it on massive text corpora to create a general-purpose language model. Google Brain introduced BERT (Bidirectional Encoder Representations from Transformers, 2018) using the encoder only, pretrained with Masked Language Modelling: randomly mask 15% of tokens and train the model to predict them using bidirectional context. OpenAI introduced GPT (Generative Pre-trained Transformer, 2018) using the decoder only, pretrained with standard next-token prediction. These two approaches define the landscape of modern NLP.

Transformer Variants — Encoder-only, Decoder-only, Encoder-Decoder
BERT (Encoder-only) Self-Attention Bidirectional ↔ all tokens [CLS] The cat sat [MASK] Pre-train: predict [MASK] Use: Classification, NER, QA RoBERTa, DistilBERT, ALBERT GPT (Decoder-only) Masked Self-Attn (causal) → left-to-right only The cat sat on Predict: "the" Use: Generation, Chatbots, LLMs GPT-2/3/4, LLaMA, Claude, Gemini T5 (Encoder-Decoder) Enc K,V Dec translate: "The cat" → "Le chat" Use: Translation, Summarisation T5, BART, mT5
ModelArchitectureAttentionPre-training TaskBest For
BERT / RoBERTa Encoder only Bidirectional Masked Language Model (MLM) Classification, NER, QA
GPT-2/3/4, LLaMA Decoder only Causal (L→R) Next token prediction Generation, chatbots, LLMs
T5 / BART Encoder-Decoder Enc: bidirectional, Dec: causal Text-to-text / denoising Translation, summarisation

“Attention Is All You Need” (2017) is the most consequential paper in the history of AI. BERT and GPT were both published in 2018. By 2022, GPT-3 demonstrated few-shot learning at a scale nobody had anticipated. Every major AI system since 2018 — GPT-4, Claude, Gemini, LLaMA, Stable Diffusion, AlphaFold 2, Whisper — is built on the Transformer.

∑ Chapter 4.7 Summary — The Transformer

  • Self-attention: Attention(Q,K,V) = softmax(QKᵀ/√dₖ)·V — each token attends to all others in one parallel operation
  • Parallel processing: all tokens computed simultaneously → massive GPU efficiency gain over RNNs — enabled training at GPT-3/4 scale
  • Multi-head: h parallel attention heads → each learns different relationship patterns (syntax, coreference, position, semantics)
  • Positional encoding: must be added — attention alone is position-blind; modern LLMs use RoPE for relative position
  • Transformer block: Multi-Head Attention → Add&Norm → FFN → Add&Norm — residual connections at every step
  • BERT=encoder (bidirectional, understanding tasks), GPT=decoder (causal, generation), T5=encoder-decoder (seq2seq tasks)
4.8
Chapter 4.8
Transfer Learning & Fine-Tuning

Pre-training a large model once on vast data, then adapting it cheaply to specific tasks, is the defining paradigm of modern AI. Without transfer learning there would be no GPT-4, no BERT, no Stable Diffusion — the compute required to train each from scratch would be prohibitive. Understanding when to freeze, when to fine-tune, and when to use LoRA separates practitioners from theorists.

Training a large neural network from scratch requires two things that most practitioners do not have: millions of labelled examples and millions of GPU-hours. GPT-3 cost approximately $4.6M to pre-train; BERT took 4 days on 64 TPU v3 chips. Transfer learning solves this by splitting the problem into two phases. Pre-training: train on a large, general dataset (entire internet text, 1.2M ImageNet images, all of Wikipedia) until the model learns rich, reusable representations. Fine-tuning (or adaptation): start from those learned weights and update them toward a specific downstream task — with far less data and compute.

The foundational insight is that early layers of neural networks learn general features that transfer across tasks. In CNNs, layer 1 universally detects oriented edges regardless of whether the network was trained for cats, cars, or faces. In language models, early layers build syntactic representations applicable to any NLP task. This hierarchy of generality — general features at the bottom, task-specific at the top — is what makes transfer learning work. The analogy: a radiologist who spent years in medical school (pre-training on general anatomy) can specialise in chest X-ray reading (fine-tuning on specific task data) far faster than someone starting from scratch.

Transfer Learning Strategies — Feature Extraction vs Fine-Tuning
FROM SCRATCH Layer 5 (random) Layer 4 (random) Layer 3 (random) Layer 2 (random) Layer 1 (random) Needs millions examples GPU-days of compute FEATURE EXTRACTION NEW HEAD 🔥 (trainable) Layer 4 🔒 FROZEN Layer 3 🔒 FROZEN Layer 2 🔒 FROZEN Layer 1 🔒 FROZEN Train <1% of parameters Minutes — works with 100s examples FULL FINE-TUNING Task Head (high LR) Layer 4 (small LR) ↺ Layer 3 (smaller LR) ↺ Layer 2 (tiny LR) ↺ Layer 1 (min LR) ↺ Best quality, ~10× less data than training from scratch

Feature extraction freezes every parameter of the pre-trained backbone and trains only a small task-specific head attached to the top. Because no gradients need to flow through the frozen backbone, forward passes do not require gradient tracking — making this dramatically faster and less memory-intensive than fine-tuning. For image tasks, the head is typically a linear classifier or small MLP on top of the backbone’s pooled output. For text tasks with BERT, the [CLS] token embedding (position 0 of the last hidden state) serves as a fixed-size sentence representation, since BERT is trained to pack sentence-level information into it during pre-training (via the Next Sentence Prediction objective).

Feature extraction works best when your task is similar to the pre-training distribution and you have limited labelled data. If your task is quite different from pre-training (e.g., medical imaging from a model pre-trained on natural images), the frozen features may not be informative enough — full fine-tuning is needed. The key question: do the features the backbone learned happen to be useful for your task?

from transformers import BertModel, BertTokenizer import torch, torch.nn as nn tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') # FREEZE all BERT parameters — no gradients flow through backbone for param in model.parameters(): param.requires_grad = False # Only the task head is trainable classifier = nn.Linear(768, 2) # binary sentiment: 768-dim BERT output → 2 classes text = "The model extracts semantic features." inputs = tokenizer(text, return_tensors='pt') with torch.no_grad(): # no gradients needed for frozen backbone outputs = model(**inputs) # [CLS] token embedding = sentence representation cls_embedding = outputs.last_hidden_state[:, 0, :] # shape: (1, 768) logits = classifier(cls_embedding) # shape: (1, 2) print(f"Embedding: {cls_embedding.shape}") # (1, 768) print(f"Logits: {logits.shape}") # (1, 2) # Trainable params: only the 768×2 + 2 = 1,538 classifier params trainable = sum(p.numel() for p in classifier.parameters()) frozen = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable:,} / Total: {frozen+trainable:,}") # 1,538 / 110M

Full fine-tuning updates all weights of the pre-trained model on the downstream task. The critical constraint is the learning rate. Pre-trained weights encode years of training signal from massive datasets — a large learning rate will destroy this knowledge within a few gradient steps in a process called catastrophic forgetting. The pre-training knowledge disappears as new task gradients overwrite it. The standard remedy: use a learning rate 10–100× smaller than the original pre-training LR (typically 2e-5 to 5e-5 for BERT/GPT-sized models, vs 3e-4 for pre-training).

Layer-wise learning rate decay (LLRD) refines this further: assign progressively smaller learning rates to earlier layers. The final layer gets the full (small) LR; each preceding layer gets the LR multiplied by a decay factor (typically 0.9 per layer). This preserves the most general representations in early layers while allowing later layers to adapt more aggressively. Google’s ULMFiT (Howard & Ruder, 2018) pioneered this technique and it remains standard practice for fine-tuning large Transformers.

Fine-Tuning Learning Rate — catastrophic forgetting vs stable adaptation
LR = 1e-3 (too high) — CATASTROPHIC FORGETTING Epochs 100% 50% 0% Task accuracy Pre-train retention ↓ LR = 2e-5 (appropriate) — STABLE ADAPTATION Epochs 100% 50% 0% warmup ← pre-train knowledge preserved
from transformers import BertForSequenceClassification, BertTokenizer from torch.optim import AdamW from transformers import get_cosine_schedule_with_warmup model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) # Layer-wise LR decay: earlier layers get smaller LR no_decay = ['bias', 'LayerNorm.weight'] optimizer_grouped_parameters = [ # Last transformer layer: full LR {'params': [p for n,p in model.bert.encoder.layer[11].named_parameters()], 'lr': 2e-5}, # Middle layers: slightly reduced {'params': [p for n,p in model.bert.encoder.layer[6].named_parameters()], 'lr': 1e-5}, # Classifier head: highest LR (task-specific) {'params': model.classifier.parameters(), 'lr': 3e-5}, ] optimizer = AdamW(optimizer_grouped_parameters, weight_decay=0.01) # Warmup 10% of steps then cosine decay total_steps = 1000 scheduler = get_cosine_schedule_with_warmup( optimizer, num_warmup_steps=100, num_training_steps=total_steps) # Fine-tuning loop (standard — see Ch 4.4 training loop) for batch in train_loader: outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() scheduler.step() optimizer.zero_grad()
⚠ Common Pitfall — Catastrophic Forgetting

Using a learning rate ≥ 1e-4 for fine-tuning a pre-trained Transformer will typically cause catastrophic forgetting within 1–2 epochs. The model learns your task quickly but destroys its general language understanding. Signs: training loss drops fast but evaluation on any other task collapses. Fix: use LR ≤ 5e-5, add warmup (100–500 steps), and monitor performance on a held-out validation set from the original task distribution.

Full fine-tuning a 7B-parameter model requires the same GPU memory as pre-training it — typically 80GB+ in fp16. For 175B (GPT-3) or 540B (PaLM) models, full fine-tuning is simply impossible on any commercially available hardware. Parameter-Efficient Fine-Tuning (PEFT) methods address this by updating only a tiny fraction of parameters — usually 0.1–5% — while keeping the rest frozen, achieving near-full-fine-tuning quality at a fraction of the cost.

Adapter layers (Houlsby et al., 2019) insert small bottleneck networks between Transformer layers — project down to a small dimension, apply non-linearity, project back up. Only adapter parameters are trained. Prefix tuning (Li & Liang, 2021) prepends learnable virtual tokens to the Key and Value matrices at every layer — the model sees these as additional context but they are just learned parameter vectors. Prompt tuning simply prepends soft tokens to the input embedding — the simplest form, effective only for large models (>10B parameters).

MethodTrainable ParamsStorage per TaskQualityInference Overhead
Full fine-tuning100%Full model copyBestNone
Adapter layers0.5–5%Small adapterGoodSmall (extra forward pass)
Prefix tuning0.1–1%Prefix vectorsModerateSmall (extra KV)
Prompt tuning<0.01%Just promptsGood (>10B only)None
LoRA0.1–1%Low-rank matricesVery goodNone (merge at inference)
QLoRA0.1–1%Even smallerGoodNone (4-bit base model)

Hu et al. (2021) observed that the weight matrices of large language models have low intrinsic rank — meaning their information content can be well-approximated with far fewer dimensions than the full d×d matrix. The hypothesis is that fine-tuning induces weight updates खW that also have low intrinsic rank: the task adaptation doesn’t require updating all d² values independently, because the update lies in a lower-dimensional subspace.

LoRA exploits this by decomposing खW = A·B, where A is d×r and B is r×d with r ≪ d (typically r=4, 8, or 16). The original weight W is frozen. Only A and B are trained. At the end, the adaptation is merged: W’ = W + (α/r)·A·B — a simple matrix addition — so inference has zero added latency. For a typical d=4096 matrix: full खW = 16.7M parameters; LoRA r=8: A+B = 2×4096×8 = 65,536 parameters (256× smaller).

LoRA — Low-Rank Decomposition Standard: W’ = W + खW           (update full d×d — expensive) LoRA:     W’ = W + (α/r)·A·B    (A: d×r, B: r×d — cheap) r = rank (4/8/16/32) · α = scaling · A~N(0,1) · B=0 initially (खW=0 at start) Merge at inference: W’ = W + (α/r)AB — zero latency overhead
LoRA — Low-rank decomposition trains A and B instead of full खW
FULL FINE-TUNING खW d×d matrix d=4096: 16.7M params vs LORA DECOMPOSITION W FROZEN 🔒 d×d (unchanged) + A d×r × B r×d A trained B trained r=8: A+B = 65,536 params (256× smaller) PARAM COMPARISON 100% (16.7M) 0.4% (65K) ← 0.2% r=4 (32K) ← LoRA merges into W at inference → zero latency overhead Store only A & B per adapter = small file per task
from peft import LoraConfig, get_peft_model, TaskType from transformers import AutoModelForCausalLM # Load base model (7B parameters) model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf") # Configure LoRA lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=8, # rank — controls expressivity vs size lora_alpha=32, # scaling factor α target_modules=["q_proj", "v_proj"], # Q and V in attention only lora_dropout=0.1, bias="none" ) # Wrap model — adds A and B matrices, freezes everything else peft_model = get_peft_model(model, lora_config) peft_model.print_trainable_parameters() # trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622 # QLoRA: LoRA on 4-bit quantised model — fine-tune 7B on a single 24GB GPU from transformers import BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16) model_4bit = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=bnb_config) # Then apply LoRA as before — trains adapters in fp16 on 4-bit base
⚠ Common Pitfall — Choosing LoRA Rank Too High or Too Low

Rank r is the main hyperparameter. Too low (r=1–2): insufficient expressivity, model underfits the task. Too high (r=64+): approaches full fine-tuning cost with no inference merge advantage. Standard starting point: r=8 for most tasks, try r=4 if memory is tight, r=16–32 if quality is insufficient. Also critical: only applying LoRA to Q and V (not K) is a common default but applying to all projection matrices (Q, K, V, O) often improves quality with minimal extra cost.

The final alignment step that transforms a raw language model into a helpful assistant is RLHF — the technique behind ChatGPT, Claude, and Gemini. A pre-trained LLM knows how to predict text distributions but has no concept of “helpful” or “harmful”. RLHF shapes the model’s outputs toward human preferences through a three-stage pipeline: supervised fine-tuning on demonstrations, training a reward model from pairwise human judgements, and then using reinforcement learning to optimise the LLM to maximise the learned reward.

The RL stage uses Proximal Policy Optimisation (PPO) with a KL divergence constraint: the policy (the LLM being trained) cannot stray too far from the SFT baseline. Without this constraint, the model learns to “game” the reward model — generating outputs that score high on the RM but are not actually helpful (reward hacking). The KL term penalises outputs that diverge greatly from the pre-RLHF distribution, maintaining general capability while nudging toward preferred behaviour. Anthropic’s Constitutional AI (CAI) replaces human annotators in Stage 2 with AI-generated critiques, scaling the process to millions of preference pairs.

RLHF Pipeline — SFT → Reward Modelling → PPO Optimisation
STAGE 1: SFT Supervised Fine-Tuning Human demos → fine-tune base LLM on examples Output: SFT Model STAGE 2: REWARD MODEL Humans rank: response A vs B Train RM to predict preferences RM(prompt, response) → scalar Output: Reward Model RM STAGE 3: PPO RL Optimisation Sample → score with RM Update policy + KL constraint Output: ChatGPT / Claude KL(policy ‖ SFT) < ε prevents reward hacking Days of human annotation Thousands of pairwise rankings GPU-days of RL training
📚

SFT Data Requirements

  • High-quality human demos
  • InstructGPT: ~13k examples
  • Diverse tasks and formats
  • Quality >> quantity
⚖️

Reward Model

  • Same architecture as LLM
  • Final layer: scalar score
  • InstructGPT: ~33k comparisons
  • Bradley-Terry ranking model
🤖

Modern Alternatives

  • DPO: skip RL, direct preference
  • Constitutional AI (Claude)
  • RLAIF: AI feedback at scale
  • Reward-free: RLVR (reasoning)

∑ Chapter 4.8 Summary — Transfer Learning & Fine-Tuning

  • Transfer learning: pre-train on large data, adapt cheaply — reuse expensive knowledge; early layers learn universal features
  • Feature extraction: freeze backbone, train head only — minutes not days, works with hundreds of labelled examples
  • Fine-tuning: update all weights with small LR (2e-5 to 5e-5) — catastrophic forgetting risk; use warmup + LLRD
  • LoRA: खW = A·B, rank r ≪ d — train 0.06% of parameters, match full fine-tuning quality, zero inference overhead
  • RLHF: SFT → Reward Model → PPO with KL constraint — the recipe behind ChatGPT, Claude, and Gemini
  • QLoRA = LoRA on 4-bit quantised base model — fine-tune 70B models on a single 24GB consumer GPU
4.9
Chapter 4.9
Generative Models — VAEs, GANs & Diffusion

Generative models learn the distribution of data well enough to create new data indistinguishable from real examples. Every AI-generated image, synthesised voice, and hallucinated protein structure is the output of a generative model. VAEs introduced structured latent spaces, GANs introduced adversarial training, and diffusion models combined the best of both — producing the current state of the art.

Most models covered so far are discriminative: given an input x, predict a label y. They learn P(y|x) — the conditional distribution of outputs given inputs. A discriminative model draws a decision boundary in input space but has no model of what the input data actually looks like.

Generative models instead learn P(x) — the distribution of the data itself. Once you have a good model of P(x), you can sample from it: draw a new x that was never in the training set but looks like it could have been. This is qualitatively different from classification: you are not deciding which bucket an input belongs to, you are learning what valid inputs look like and manufacturing new ones. Applications span every domain: faces, voices, molecules, code, music, 3D shapes.

Discriminative Models
Generative Models
"Is this a cat?" → Yes/No
"Generate a cat image" → new image
Learns P(y|x) — conditional
Learns P(x) or P(x|y) — joint/marginal
One direction: input → label
Creates new data by sampling
Simpler, more data-efficient
Harder to train, more expressive
Classification, regression, NER
Image synthesis, text gen, drug design

Kingma & Welling (2013) introduced the VAE as the first principled neural generative model. A regular autoencoder compresses input x to a latent code z, then reconstructs x — useful for compression but not generation, because the latent space has unpredictable gaps: points between training examples decode to garbage. The VAE fixes this by making the encoder probabilistic: instead of outputting a point z, it outputs a Gaussian distribution N(μ, σ²). The network is then trained to ensure this distribution stays close to a standard normal N(0, I) (via a KL divergence penalty), forcing all latent representations to occupy a continuous, organised neighbourhood around the origin.

The reparameterisation trick is the key engineering insight that makes training possible. You cannot backpropagate through a sampling operation z ~ N(μ, σ²) directly, because sampling is stochastic. The trick: write z = μ + σ·ε where ε ~ N(0,1) is sampled externally. Now μ and σ are deterministic outputs of the encoder, gradients can flow through them, and the stochasticity is isolated in ε which has no parameters to update. This simple algebraic rearrangement is what makes VAE training feasible.

VAE Loss (ELBO — Evidence Lower Bound) L = E[log P(x|z)] − KL(q(z|x) ∥ N(0,I)) = Reconstruction loss − ½∑(1 + log σ² − μ² − σ²) Reconstruction: how faithfully does the decoder recreate x from z? KL: how close is q(z|x) to N(0,I)? Regularises the latent space. Reparameterisation: z = μ + σ·ε, ε~N(0,1) — makes sampling differentiable
VAE — Probabilistic encoder outputs distribution, decoder generates from samples
x input Encoder qφ(z|x) μ σ ε~N(0,1) z latent z = μ + σ·ε (reparameterisation trick) Decoder pθ(x|z) recon. ||x - x̂||² recon loss KL(q(z|x)||N(0,1)) GENERATION 1. sample z~N(0,1) 2. decode z → x no encoder needed!

The VAE’s KL penalty forces all class clusters in latent space to overlap near the origin, eliminating gaps. This creates a structured, continuous latent space where every point decodes to a plausible output. You can smoothly interpolate between any two encoded examples: a straight line from the latent code of a young face to the code of an old face passes through intermediate codes that decode to faces of intermediate age. This is qualitatively impossible with regular autoencoders, which have large empty voids between training examples.

VAE Latent Space — structured, continuous, enables smooth interpolation
REGULAR AUTOENCODER — gaps between clusters Class A Class B EMPTY — decodes to garbage VAE LATENT SPACE — continuous, interpolable interpolation path every point decodes ✓ A B C D KL penalty → clusters overlap near N(0,I) → no gaps

Goodfellow et al. (2014) proposed a radically different generative approach: instead of maximising a likelihood, frame generation as a two-player game. A Generator G takes random noise z as input and outputs synthetic data G(z). A Discriminator D receives either a real data point x or a fake G(z) and must decide which is which. G is trained to fool D; D is trained to detect fakes. Neither player sees the other’s loss function directly — they only see each other’s outputs. The result of this adversarial dynamic, when it works, is a Generator that produces data indistinguishable from real training examples — because any distinguishable fake will be caught by D and penalised.

The theoretical optimum is a Nash equilibrium: G generates data exactly matching the true distribution P(x), and D can do no better than random guessing (P(real) = 0.5 for all inputs). In practice, reaching this equilibrium is notoriously difficult. The training is unstable, sensitive to hyperparameters, and prone to mode collapse — covered in the next section.

GAN Objective — Minimax Game minᵀ maxᴰ [Eₓ[log D(x)] + Eₓ[log(1 − D(G(z)))]] D maximises: correctly labelling real x as real, fake G(z) as fake G minimises: making D output 1 (real) for fake data (i.e., fool D) Nash equilibrium: D(x) = 0.5 everywhere — G matches true data distribution
GAN — Generator vs Discriminator in a minimax game
z~N(0,I) random noise Generator G z → G(z) wants D to output 1 Fake G(z) Real x training data Discriminator D real or fake? wants to detect fakes P(real) ∈ [0, 1] G loss: -log(D(G(z))) — gradient pushes G to fool D D loss: -[log D(x) + log(1-D(G(z)))] Nash eq: D(x)=0.5 everywhere
🎨

Notable GAN Variants

  • DCGAN (2015) — CNN-based GAN
  • cGAN — conditional generation
  • CycleGAN — image-to-image
  • StyleGAN (2018–22) — faces
⚠️

Training Challenges

  • Mode collapse (next section)
  • Vanishing D gradient
  • D/G balance is fragile
  • No convergence guarantee
🔧

Stabilisation Techniques

  • WGAN — Wasserstein distance
  • Spectral normalisation
  • Progressive growing
  • Gradient penalty (WGAN-GP)

Mode collapse is the most notorious GAN failure mode. If G discovers one type of output that consistently fools D, it stops exploring the rest of the distribution and produces the same output (or a small set) regardless of the input noise. The discriminator adapts, G shifts to another single mode, and training cycles without covering the full data distribution. The GAN loss gives no signal that this is happening — the loss values look normal while G has abandoned 90% of the training distribution.

Training instability arises from the balance requirement: D and G must improve at similar rates. If D becomes too powerful early in training, G receives near-zero gradients and cannot improve (D correctly labels everything with high confidence, so the loss for G becomes flat). If G outpaces D early, D provides no meaningful feedback. The WGAN (Wasserstein GAN) addresses both problems by replacing the original loss with the Wasserstein distance, which provides smoother gradients and is more robust to D/G imbalance.

GAN Mode Collapse — and recovery with stabilisation techniques
Epoch 10 — Diverse 6 varied outputs ✓ Epoch 50 — MODE COLLAPSE ⚠ G only generates one mode — ignores rest of distribution Epoch 200 — Well-Trained WGAN-GP / spectral norm ✓

Ho et al. (2020) introduced Denoising Diffusion Probabilistic Models (DDPM), which now power Stable Diffusion, DALL-E 3, Midjourney, and Sora. The key idea is elegant: define a forward process that gradually corrupts data by adding Gaussian noise over T steps until the image becomes pure noise, then train a neural network to learn the reverse process — predicting what noise was added at each step, and thus denoising one step at a time. Generation is simply running the reverse process starting from random noise.

Unlike GANs, diffusion training is stable: the objective is a simple regression (predict the noise added at step t), there is no adversarial game to balance, and the model sees all noise levels during training. Unlike VAEs, there is no explicit latent space to constrain, so the generative quality is not limited by the bottleneck. The tradeoff is sampling speed: generating one image requires T=100–1000 denoising steps, each requiring a full forward pass through the network. Modern techniques like DDIM (deterministic sampling) and SDXL-Turbo (distillation) reduce this to 1–4 steps, largely eliminating the speed disadvantage.

Diffusion Model — Forward and Reverse Forward: q(xₜ|xₜ₋₁) = N(xₜ; √(1−βₜ)xₜ₋₁, βₜI)   (add noise, fixed) Efficient: q(xₜ|x₀) = N(xₜ; √α̅ₜ x₀, (1−α̅ₜ)I)        (jump to step t) Reverse: pθ(xₜ₋₁|xₜ) = N(xₜ₋₁; μθ(xₜ,t), Σθ(xₜ,t))  (learned) Training: minimise E[||ε − εθ(xₜ, t)||²] — predict the noise added at step t
Diffusion Model — forward noise addition and learned reverse denoising
FORWARD PROCESS (fixed) — q(xₜ|xₜ₋₁) adds Gaussian noise 🌼 x₀ (clean) +β₁ ░▒ x₁ +β₂ ▓▓ xₜ₋₂ +β... xₜ₋₁ +βᵀ ███ xᵀ (pure noise) REVERSE PROCESS (learned) — pθ(xₜ₋₁|xₜ) removes noise ███ xᵀ (sample) εθ xₜ₋₁ xₜ₋₂ x₁ 🌼 x₀ (generated!) εθ(xₜ, t) predicts noise at each step U-Net or Transformer
PropertyVAEGANDiffusion
Training stabilityStableUnstable (adversarial)Stable
Sample qualityBlurry (over-smooth)Sharp (when works)State-of-the-art
Latent spaceStructured, continuousLess structuredNot explicit
Sampling speedFast (1 pass)Fast (1 pass)Slow (T=100-1000)
ControllabilityGood (interpolation)ModerateExcellent (conditioning)
Mode coverageGoodMode collapse riskGood
Best use todayCompression, anomaly detectionVideo gen, GAN editingImage/video synthesis SOTA
ExamplesVQ-VAE, VQ-VAE-2StyleGAN-3, BigGANStable Diffusion, DALL-E 3, Sora
Generative Model Timeline — from VAE to Diffusion dominance
VAE 2013 GAN 2014 DCGAN 2015 StyleGAN 2018 DDPM 2020 milestone Stable Diffusion 2022 Sora Video 2024 Current SOTA GAN Era (2014–2020) Diffusion Era (2020–present)

∑ Chapter 4.9 Summary — Generative Models

  • Generative models learn P(x) — the data distribution itself — enabling new data synthesis; discriminative models learn P(y|x) (labels from inputs)
  • VAE: encoder → (μ, σ) → z = μ + σ·ε → decoder; KL penalty forces continuous structured latent space — enables interpolation and generation
  • Reparameterisation trick: z = μ + σ·ε, ε~N(0,1) — makes sampling differentiable for backpropagation
  • GAN: Generator G(z) fools Discriminator D(x) — minimax game; Nash eq: D(x)=0.5 everywhere; notorious for mode collapse and training instability
  • WGAN, spectral normalisation, gradient penalty — stabilisation techniques that largely solved GAN training (StyleGAN-3, BigGAN)
  • Diffusion: forward process adds Gaussian noise over T steps; reverse process (εθ) learns to denoise — training objective: E[||ε − εθ(xₜ,t)||²]
  • Diffusion = current SOTA for image/video generation — Stable Diffusion, DALL-E 3, Sora — stable training, no mode collapse, excellent conditioning

🎓 Domain 4 Complete — Deep Learning & Neural Networks

  • Ch 4.1 Perceptron to MLP: weighted sum + step function. XOR killed neural nets for 15 years; hidden layers + non-linearity solved it by learning hierarchical representations.
  • Ch 4.2 Activation Functions: ReLU for CNNs, GELU for Transformers — sigmoid only for binary output. Non-linearity is what makes stacked layers more powerful than one.
  • Ch 4.3 Backpropagation: chain rule through the computational graph. Vanishing gradients: sigmoid → 0.25ᴿ per layer; ReLU fixes this with gradient=1 for positive inputs.
  • Ch 4.4 Training Deep Networks: He init, BatchNorm, Dropout, AdamW + warmup–cosine LR — the engineering stack that makes 100+ layer networks trainable in practice.
  • Ch 4.5 CNNs: local receptive fields + weight sharing. ResNet y=F(x)+x solved depth degradation — enabled going from 16 to 152+ layers without degradation.
  • Ch 4.6 RNNs & LSTMs: hidden state = sequential memory. LSTM gating (forget/input/output) solves vanishing gradients; attention preview leads directly to the Transformer.
  • Ch 4.7 Transformer: Attention(Q,K,V)=softmax(QKᵀ/√dₖ)V — parallel, direct long-range dependencies. GPT=decoder, BERT=encoder, T5=both.
  • Ch 4.8 Transfer Learning: pre-train then adapt. LoRA trains 0.1% of parameters with zero inference overhead. RLHF (SFT→RM→PPO) creates aligned helpful LLMs.
  • Ch 4.9 Generative Models: VAE = structured latent space. GAN = adversarial game. Diffusion = learn to reverse Gaussian noise — Stable Diffusion, DALL-E 3, Sora.

Domain 4 is the mathematical engine behind every frontier AI system. The Transformer (Ch 4.7) is the single most important architecture in AI today — GPT-4, Claude, Gemini, DALL-E, AlphaFold, Whisper, and virtually every LLM runs on it. Domain 5 (NLP & LLMs) explores what happens when you scale the Transformer to trillions of tokens. Domain 8 (Agentic AI) shows what happens when you give it tools, memory, and the ability to act in the world.