AI Foundation · Domain 04 · Chapter 4.1

Neural Networks — Perceptron to MLP

How a single artificial neuron scales into the multi-layer networks that power modern AI

4.1

Chapter 4.1

Neural Networks — From Perceptron to MLP

A neural network is not a brain simulation. It is a function approximator — a mathematical machine that learns to map inputs to outputs by adjusting millions of numerical parameters. The biological metaphor is useful for intuition; the mathematics is what actually works.

Biological Inspiration Introductory

Long before computers existed, scientists observed that the human brain processes information through a vast network of interconnected cells called neurons. Each biological neuron receives chemical signals through branching fibres called dendrites, integrates those signals in its cell body (the soma), and — if the combined signal exceeds an internal threshold — fires an electrical impulse along its axon to downstream cells. This "integrate and fire" mechanism, repeated across roughly 86 billion neurons with trillions of connections, gives rise to everything from reflex actions to abstract reasoning.

In 1943, McCulloch and Pitts created the first mathematical model of a neuron: a binary threshold unit that sums its inputs and outputs 1 if the sum exceeds a fixed threshold, 0 otherwise. The mapping from biology to mathematics is direct: dendrites become numeric inputs, synaptic strengths become weights, the soma becomes a weighted summation, and the axon firing becomes an activation function. This abstraction — inputs → weighted sum → activation — is still the foundation of every neural network today.

The analogy has important limits. Biological neurons communicate via discrete spikes; artificial neurons use continuous real-valued outputs. Biological learning involves complex biochemical processes; artificial networks learn by gradient descent on a loss function. The phrase "inspired by, not modelled after" is exactly right. Deep learning borrowed the high-level architecture of layered computation and discarded almost everything else in favour of mathematical tractability.

Biological vs Artificial Neuron — the inspiration and the abstraction

The Perceptron In-depth

In 1957, Frank Rosenblatt at the Cornell Aeronautical Laboratory built the Perceptron — the first machine specifically designed to learn from examples. The idea was elegantly simple: represent a decision-making unit as a weighted sum of inputs passed through a step function. If the total weighted input exceeds a threshold, the unit fires (outputs 1); otherwise it stays silent (outputs 0). Crucially, the weights could be adjusted automatically when the unit made a mistake — this was the first learning algorithm for a neural model.

The perceptron structure has four components. First, numeric inputs x₁, x₂, …, xₙ — these could be pixel intensities, sensor readings, or any measurable feature. Second, a weight wᵢ for each input, representing how important that feature is. Third, a bias b that shifts the decision boundary independently of the inputs. Fourth, a step activation function that converts the raw weighted sum into a binary decision.

The perceptron learning rule is the ancestor of gradient descent. After every prediction, if the prediction was correct, do nothing. If the network predicted 0 but the true label was 1, increase each weight by a small fraction of the corresponding input. If the network predicted 1 but should have predicted 0, do the reverse. This simple rule has a remarkable theoretical guarantee: if the training data is linearly separable, the perceptron will converge to a correct solution in a finite number of steps — the Perceptron Convergence Theorem.

Tracing the AND logic gate concretely: AND outputs 1 only when both inputs are 1. Start with all weights at 0. When we show (1,1)→1 and the network predicts 0 (since 0<0), we add the inputs to the weights. After a few cycles the perceptron settles at weights w₁=1, w₂=1, b=−1.5, which correctly separates AND's one positive case from the three negatives by the line x₁ + x₂ = 1.5.

Perceptron — Decision Rule ŷ = step(w · x + b) = { 1 if w·x + b ≥ 0, 0 otherwise } w = weight vector · x = input vector · b = bias scalar · ŷ = predicted class Update Rule (Perceptron Learning) wᵢ ← wᵢ + α(y − ŷ)xᵢ for each weight i b ← b + α(y − ŷ) α = learning rate · y = true label · ŷ = predicted label · (y−ŷ) ∈ {−1, 0, +1}

Perceptron — AND gate implementation with learned weights

import numpy as np class Perceptron: def __init__(self, lr=0.1, n_epochs=10): self.lr = lr self.n_epochs = n_epochs def fit(self, X, y): self.w = np.zeros(X.shape[1]) self.b = 0.0 for epoch in range(self.n_epochs): for xi, yi in zip(X, y): y_hat = self.predict(xi) delta = self.lr * (yi - y_hat) self.w += delta * xi # update weights self.b += delta # update bias def predict(self, X): return np.where(np.dot(X, self.w) + self.b >= 0, 1, 0) # AND gate X = np.array([[0,0],[0,1],[1,0],[1,1]]) y = np.array([0, 0, 0, 1]) p = Perceptron(lr=0.1, n_epochs=10) p.fit(X, y) print(p.predict(X)) # [0, 0, 0, 1] ✓

⚠ Common Pitfall — Perceptron

The convergence theorem only guarantees convergence if the data is linearly separable. For non-separable data, the algorithm loops forever, cycling through updates that never stabilise. Always set a maximum epoch limit and check whether loss has stopped decreasing.

The XOR Problem & MLP Motivation In-depth

In 1969, Marvin Minsky and Seymour Papert published Perceptrons, a rigorous mathematical analysis of what single-layer networks could and could not compute. Their central result was devastating: a single-layer perceptron cannot learn the XOR function. XOR outputs 1 when exactly one of two inputs is 1: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. The proof is geometric — there is no single straight line that can separate the two positive cases from the two negative cases. They form a checkerboard that is impossible to bisect with one hyperplane.

The Minsky-Papert result triggered the first AI winter: funding dried up and neural network research sat dormant for roughly 15 years. The irony is that the paper itself pointed toward the solution — adding hidden layers could overcome these limitations, but they doubted an efficient learning algorithm for such networks could be found. That algorithm — backpropagation, popularised by Rumelhart, Hinton, and Williams in 1986 — became the key that unlocked the field.

The geometric resolution is illuminating. With a hidden layer, the network first learns two intermediate linear boundaries: one that isolates (1,1) and one that isolates (0,0). The hidden layer outputs encode whether the input is in each region. The output layer then combines these hidden representations to produce the XOR decision — a task that is linearly separable in the transformed space. This is the core insight of deep learning: each layer transforms the data into a representation where the next layer's job becomes easier.

XOR Problem — why a single layer perceptron is fundamentally limited

The XOR problem is not just a failure of the perceptron — it is a proof that any single linear classifier has a fundamental expressiveness limit. The solution is not a better linear classifier. The solution is composition: learn intermediate nonlinear representations, then combine them.

⚠ Common Pitfall — Linear Stacking

Even a deep stack of linear layers with no activation functions cannot solve XOR. A stack of linear transformations is itself a single linear transformation: W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂). Non-linear activation functions between layers are the essential ingredient — without them, depth buys you nothing.

Multi-Layer Perceptron (MLP) In-depth

The Multi-Layer Perceptron (MLP) — also called a feedforward network or fully connected network — adds one or more hidden layers between the input and output. Every neuron in each layer connects to every neuron in the next layer. The key architectural decision is that a non-linear activation function is applied after each layer's weighted sum, breaking the linear chain that would otherwise collapse the whole network into a single affine transformation.

Each hidden layer acts as a feature detector. The first hidden layer learns combinations of the raw inputs — in image recognition, this might correspond to edges or local contrasts. The second hidden layer learns combinations of those features — perhaps corners and junctions. Deeper layers learn increasingly abstract concepts. This hierarchical representation learning is why depth is valuable: rather than memorising the training set, the network learns what to look for.

Architecture notation is typically given as a list of layer sizes. [4, 8, 8, 3] means 4 inputs, two hidden layers of 8 neurons, and 3 outputs. The total parameter count: for each layer, (inputs to that layer) × (neurons in that layer) weights plus one bias per neuron. A [4, 5, 4, 3] network has (4×5+5) + (5×4+4) + (4×3+3) = 25 + 24 + 15 = 64 parameters.

MLP [4→5→4→3] — All connections, one highlighted forward path

⚠ Common Pitfall — Forgetting Activation Functions

The most common beginner mistake when building an MLP is stacking nn.Linear layers without activation functions between them. Without non-linearity, no matter how many layers you add, the network can only learn linear decision boundaries. Always add nn.ReLU() between every pair of linear layers.

The Forward Pass In-depth

The forward pass is the computation that transforms an input vector into a prediction by passing it sequentially through each layer. Understanding the forward pass precisely — including the shapes of every matrix and vector — is essential for debugging, designing architectures, and reasoning about computational cost.

For each layer l, the computation has two steps. First, compute the pre-activation Z by multiplying the previous layer's output A by the weight matrix W and adding a bias b. Second, apply the activation function f element-wise to Z to produce the output A of this layer. The output of the final layer is the network's prediction.

Layer l — Forward Pass Z⁽ˡ⁾ = A⁽ˡ⁻¹⁾ · W⁽ˡ⁾ + b⁽ˡ⁾ A⁽ˡ⁾ = f⁽ˡ⁾(Z⁽ˡ⁾) A⁽ˡ⁻¹⁾ = output of previous layer (or input X for l=1) · W⁽ˡ⁾ = weight matrix [in × out] b⁽ˡ⁾ = bias vector [out] · f⁽ˡ⁾ = activation function (ReLU, Softmax, etc.)

Numerical Forward Pass Example — 2-layer MLP

Input: x = [2.0, 3.0]

W₁ = [[0.5, −0.3], [0.1, 0.8]], b₁ = [0.1, −0.2]

Z₁ = x · W₁ + b₁

= [2.0×0.5 + 3.0×0.1 + 0.1, 2.0×(−0.3) + 3.0×0.8 + (−0.2)]

= [1.0 + 0.3 + 0.1, −0.6 + 2.4 − 0.2]

Z₁ = [1.4, 1.6]

A₁ = ReLU(Z₁) = [max(0,1.4), max(0,1.6)]

A₁ = [1.4, 1.6] (both positive, ReLU passes through)

W₂ = [[0.4, 0.6], [−0.2, 0.3]], b₂ = [0.0, 0.1]

Z₂ = A₁ · W₂ + b₂

= [1.4×0.4 + 1.6×(−0.2) + 0.0, 1.4×0.6 + 1.6×0.3 + 0.1]

= [0.56 − 0.32, 0.84 + 0.48 + 0.1]

Z₂ = [0.24, 1.42]

A₂ = Softmax([0.24, 1.42]) = exp([0.24,1.42]) / 5.408

A₂ = [0.235, 0.765] → 76.5% probability class 2, 23.5% class 1

import torch import torch.nn as nn class MLP(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super().__init__() self.net = nn.Sequential( nn.Linear(input_dim, hidden_dim), # W1: (input_dim x hidden_dim) nn.ReLU(), # non-linearity nn.Linear(hidden_dim, hidden_dim), # W2: (hidden_dim x hidden_dim) nn.ReLU(), nn.Linear(hidden_dim, output_dim) # W3: (hidden_dim x output_dim) ) def forward(self, x): return self.net(x) # sequential forward pass model = MLP(input_dim=784, hidden_dim=256, output_dim=10) x = torch.randn(32, 784) # batch of 32 MNIST-style images output = model(x) # shape: (32, 10) -- 10 class logits print(f"Output shape: {output.shape}") print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}") # Output shape: torch.Size([32, 10]) # Parameters: 269,322

⚠ Common Pitfall — Shape Mismatches

The most common forward pass error is a matrix dimension mismatch. If layer l has input dimension d and output dimension k, then W has shape [d × k] and the output has shape [batch × k]. A common mistake is transposing weight matrices incorrectly or confusing input/output sizes when building layers manually without nn.Linear.

Universal Approximation Theorem Core

In 1989, George Cybenko proved a remarkable result about feedforward networks: a network with a single hidden layer using sigmoid-like activation functions, and sufficiently many neurons, can approximate any continuous function on a compact subset of ℜ² to arbitrary precision. Hornik (1991) extended this to any non-constant, bounded, continuous activation function. This result — the Universal Approximation Theorem — gives MLPs their theoretical power: they are, in principle, general-purpose function approximators.

The intuition is geometric. A single neuron with a step function carves out a half-space. With many neurons, you can approximate arbitrary regions. With smooth activations, you build up smooth functions by summing many "bumps". As you add more neurons, the approximation gets finer — the diagram below shows a coarse 2-neuron step-wise approximation converging to the true function as width increases to 32.

The theorem has important caveats practitioners often overlook. It says a solution exists — it says nothing about whether gradient descent will find it, how many samples are needed, or whether the required network is computationally feasible. In practice, a single very wide layer may require exponentially more neurons than a deeper network to represent the same function. This is the practical motivation for depth: depth enables exponentially more efficient representation. In CNNs, this manifests as hierarchical feature detection: edges in layer 1, textures in layer 2, object parts in layer 3.

Universal Approximation — MLPs approximate any function given sufficient width

The Universal Approximation Theorem says any function can be represented — it does not say gradient descent will find it. Expressiveness and learnability are different things. This is why depth, regularisation, and data quantity matter in practice.

Network Anatomy & Hyperparameters Core

Understanding the key design choices of an MLP — and the consequences of setting them poorly — is essential for practitioners. The table below summarises the principal hyperparameters, their typical ranges, and the effects of extreme values. Each will be explored in depth in subsequent chapters; develop an intuition for the tradeoffs now.

Hyperparameter	Definition	Typical Range	Effect if Too Small	Effect if Too Large
Number of layers (depth)	How many hidden layers	2–50+ (hundreds with residual connections)	Underfits; limited representational power	Vanishing gradients; harder to train without tricks
Width (neurons per layer)	Nodes per hidden layer	64–4096	Underfits; insufficient capacity	Memory intensive; increased overfitting risk
Activation function	Non-linearity applied between layers	ReLU, GELU, Tanh, Sigmoid	—	— (see Chapter 4.2)
Batch size	Samples per gradient update	16–2048	Noisy gradients; slow wall-clock time	Sharp minima; poor generalisation to test set
Learning rate	Gradient step size (α)	1e-4 to 1e-2	Very slow convergence; appears stuck	Divergence; NaN loss; oscillating training
Dropout rate	Fraction of neurons randomly zeroed each step	0.1–0.5	No regularisation; model memorises training data	Too much information loss; underfitting

🔢

Parameter Count Formula

For each layer with d𝕪 inputs and dₒ𝕦𝕧 outputs:

Weights: d𝕪 × dₒ𝕦𝕧
Biases: dₒ𝕦𝕧
Total: ∑ (d𝕪 × dₒ𝕦𝕧 + dₒ𝕦𝕧)

🧱

Modern Architecture Scale

Reference points for context:

MNIST MLP: ~0.5M params
ResNet-50: ~25M params
GPT-2: ~117M params
GPT-4 (est.): ~1.8T params

🎯

Where MLPs Appear Today

MLPs are fundamental building blocks:

Feed-forward layers in Transformers
Classification heads in CNNs
Value/policy networks in RL
Embedding projections

∑ Chapter 4.1 Summary — Neural Networks: Perceptron to MLP

Biological inspiration: dendrites (inputs) → soma (weighted sum + threshold) → axon (output); artificial neurons abstract this as inputs → weighted sum → activation function
Perceptron: ŷ = step(w·x + b) — learns linear decision boundaries only; update rule wᵢ ← wᵢ + α(y−ŷ)xᵢ; converges if and only if data is linearly separable
XOR problem (Minsky & Papert, 1969): a single-layer network cannot solve non-linearly separable problems — this caused the 15-year first AI winter
Solution: add hidden layers with non-linear activation functions — each layer learns intermediate representations; without non-linearity, stacked linear layers collapse to a single linear layer
Forward pass: Z⁽ˡ⁾ = A⁽ˡ⁻¹⁾ · W⁽ˡ⁾ + b⁽ˡ⁾, then A⁽ˡ⁾ = f(Z⁽ˡ⁾) — repeated layer by layer from input to output
Universal Approximation Theorem: an MLP with sufficient width can represent any continuous function — but expressiveness ≠ learnability; depth makes representation exponentially more efficient
Parameter count = ∑ (layer_in × layer_out + layer_out) — even small networks have thousands of trainable parameters; modern models have billions

4.2

Chapter 4.2

Activation Functions

Without non-linear activation functions, stacking any number of linear layers produces exactly one linear transformation. It is the activation function — applied element-wise after every layer — that gives neural networks their ability to learn arbitrarily complex mappings. Choosing the right activation is one of the most consequential architectural decisions you will make.

Why Non-Linearity Matters Core

Consider two linear layers stacked directly: Z = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂). The result is another linear function with a combined weight matrix W = W₂W₁ and a combined bias. No matter how many linear layers you stack, the composition remains a single affine (linear + shift) transformation. This means a deep linear network is no more expressive than a logistic regression — it can only learn straight hyperplane decision boundaries.

An activation function f applied between layers breaks this collapse: A₂ = W₂ · f(W₁x + b₁) + b₂. Now the result is genuinely non-linear and the two layers are no longer collapsible into one. A good activation function must satisfy three practical requirements: it must be non-linear (obviously), differentiable almost everywhere (so gradients can flow during backpropagation), and computationally cheap (it is applied millions of times per forward pass).

Non-linearity is essential — linear layers without activation collapse into one

Sigmoid In-depth

The sigmoid function was the default activation in neural networks throughout the 1980s and 1990s. It takes any real-valued input and squashes it into the range (0, 1), which made it a natural fit for modelling probabilities. The S-shaped curve rises steeply near z = 0 and flattens toward 0 for very negative inputs and toward 1 for very positive inputs. This flattening is the source of its central problem in deep networks.

The derivative of σ(z) is elegantly expressed as σ(z)(1 − σ(z)). This has a maximum of 0.25 at z = 0, and falls to near-zero as |z| grows large. In a deep network, gradients are multiplied together as they propagate backward through layers. If most neurons saturate (i.e., z is large in magnitude), each multiplication by a derivative near 0 shrinks the gradient exponentially — this is the vanishing gradient problem. A network with 10 sigmoid layers loses a gradient factor of 0.25¹⁰ ≈ 0.0000001 before it reaches the first layer.

A second, subtler problem is that sigmoid outputs are never negative — they are always in (0, 1). This means gradients are always the same sign (all positive or all negative), which causes zig-zag updates in weight space. Tanh, which we examine next, solves this by being zero-centred. Today, sigmoid is used almost exclusively at the output layer of binary classifiers (where you genuinely want a probability) and in the gating mechanisms of LSTMs.

Sigmoid σ(z) = 1 / (1 + e^−z) Output range: (0, 1) · σ'(z) = σ(z)(1 − σ(z)) · max derivative = 0.25 at z = 0

Sigmoid: S-curve output (0,1) with near-zero gradient in saturation zones

⚠ Common Pitfall — Sigmoid in Hidden Layers

Using sigmoid as the activation for hidden layers in deep networks almost always causes the vanishing gradient problem. If your training loss stops decreasing very early and the gradients of the first layers are near zero when inspected, this is the most likely cause. Switch to ReLU or GELU for all hidden layers; reserve sigmoid for binary classification output only.

Tanh Core

Tanh (hyperbolic tangent) is a scaled and shifted version of sigmoid: tanh(z) = 2σ(2z) − 1. It squashes inputs to (−1, +1) instead of (0, 1). The critical improvement over sigmoid is that tanh is zero-centred — its outputs are balanced around zero. When the activation outputs are always positive (as in sigmoid), the gradient updates to all weights in the next layer always have the same sign. This forces the optimiser into a zig-zag path through weight space. Tanh's zero-centred outputs allow positive and negative gradients, enabling more direct paths toward the minimum.

Tanh still suffers from the vanishing gradient problem for large |z|, where the derivative tanh'(z) = 1 − tanh²(z) approaches zero. The maximum derivative is 1.0 at z = 0 — four times larger than sigmoid's maximum of 0.25 — which makes it somewhat less prone to gradient collapse. Tanh remains the preferred activation inside LSTM and GRU gates, where its zero-centered outputs help regulate cell state updates.

Tanh tanh(z) = (e^z − e^−z) / (e^z + e^−z) = 2σ(2z) − 1 Output range: (−1, +1) · tanh'(z) = 1 − tanh²(z) · max derivative = 1.0 at z = 0

Tanh vs Sigmoid — zero-centered tanh has better gradient flow

ReLU Family In-depth

The Rectified Linear Unit (ReLU) is disarmingly simple: pass the input through unchanged if it is positive, otherwise output zero. This single function — introduced to deep learning at scale by AlexNet in 2012 — transformed the field. Before ReLU, training deep networks beyond 5–6 layers was nearly impossible due to vanishing gradients from sigmoid and tanh. ReLU's constant gradient of 1 for positive inputs means gradients flow freely through activated neurons, enabling networks of 50, 100, or even 1000 layers.

ReLU also introduces sparse activation: on average, about 50% of neurons output exactly zero for any given input. This sparsity provides implicit regularisation — only the "relevant" neurons participate in each forward pass. However, this same property creates the Dead ReLU problem: if a neuron's pre-activation is always negative (e.g., because the bias drifts negative during training), its gradient is permanently zero and it never recovers. This can kill 10–40% of neurons in poorly initialised or high-learning-rate networks.

Leaky ReLU fixes dead neurons by allowing a small negative slope α (typically 0.01) for z < 0, ensuring the gradient is never exactly zero. ELU (Exponential Linear Unit) goes further with a smooth exponential curve for negative inputs, producing outputs closer to zero-mean — which can improve convergence. PReLU (Parametric ReLU) treats α as a learnable parameter, letting the network decide the optimal negative slope per channel.

ReLU Family Formulas ReLU: f(z) = max(0, z) f'(z) = 1 if z > 0 else 0 Leaky ReLU: f(z) = max(αz, z), α=0.01 f'(z) = 1 if z > 0 else α ELU: f(z) = z if z > 0 else α(e^z−1) α typically 1.0 α = negative slope coefficient · PReLU: same as Leaky ReLU but α is a learned parameter

ReLU Family — ReLU, Leaky ReLU, ELU, PReLU compared

import torch import torch.nn as nn # All ReLU variants in PyTorch relu = nn.ReLU() leaky = nn.LeakyReLU(negative_slope=0.01) elu = nn.ELU(alpha=1.0) prelu = nn.PReLU() # alpha is learned parameter x = torch.randn(4) print("Input: ", x.tolist()) print("ReLU: ", relu(x).tolist()) # negatives become 0 print("LeakyReLU: ", leaky(x).tolist()) # negatives scaled by 0.01 print("ELU: ", elu(x).tolist()) # negatives → α(e^z - 1) # Dead ReLU detection: count dead neurons after training def count_dead_relu(model, x_sample): dead = 0 total = 0 hooks = [] def hook(m, inp, out): nonlocal dead, total dead += (out == 0).sum().item() total += out.numel() for m in model.modules(): if isinstance(m, nn.ReLU): hooks.append(m.register_forward_hook(hook)) with torch.no_grad(): model(x_sample) for h in hooks: h.remove() return dead / total # fraction of dead neurons

⚠ Common Pitfall — Dead ReLU

If you use a high learning rate or bad weight initialisation, a large fraction of ReLU neurons can get stuck with permanently negative pre-activations — the "dead ReLU" problem. Gradients through these neurons are exactly zero, so they never recover. Signs: training loss stops improving but there is no NaN; inspecting neuron outputs shows many always-zero activations. Fix: use Leaky ReLU, reduce learning rate, or use proper He initialisation (nn.init.kaiming_normal_).

GELU & Modern Activations Core

The Gaussian Error Linear Unit (GELU) was introduced by Hendrycks and Gimpel (2016) and quickly became the dominant activation in Transformer-based models. The key motivation: ReLU has a hard kink at z = 0 — the derivative jumps discontinuously from 0 to 1. GELU replaces this with a smooth curve by weighting the input by the probability that it is positive under a standard Gaussian distribution: f(z) = z · Φ(z), where Φ is the Gaussian CDF.

In practice, GELU is computed via a fast approximation: f(z) ≈ 0.5z(1 + tanh[√(2/π)(z + 0.044715z³)]). This smooth transition means GELU has a continuous gradient everywhere, which empirically improves training stability for deep Transformers. GPT-2, GPT-3, BERT, BART, T5, and virtually every large language model published since 2019 uses GELU in its feed-forward sublayers.

Swish (also called SiLU, Sigmoid Linear Unit) is another smooth variant: f(z) = z · σ(z). The input gates itself — neurons with large positive values pass through at full strength, while negative values are softly suppressed. Swish is used in EfficientNet, MobileNetV3, and several LLM variants. Mish extends this idea: f(z) = z · tanh(softplus(z)), and has achieved state-of-the-art performance on some computer vision benchmarks. All three share the property of being smooth, non-monotonic, and having a small negative dip near z ≈ −0.2, which provides a weak self-normalising property.

GELU & Swish GELU: f(z) = z · Φ(z) ≈ 0.5z(1 + tanh[√(2/π)(z + 0.044715z³)]) Swish/SiLU: f(z) = z · σ(z) = z / (1 + e^−z) Φ = Gaussian CDF · σ = sigmoid · both are smooth and differentiable everywhere

GELU vs ReLU — smooth activation improves Transformer training

GELU is to Transformers what ReLU is to CNNs: the empirically dominant choice. Its smooth gradient everywhere avoids dead neurons and allows stable training at great depth. If you are building any Transformer-based model — language, vision, or multimodal — start with GELU.

Softmax In-depth

Softmax is not an activation function in the same sense as ReLU or GELU — it is not applied element-wise independently to each neuron. Instead, it is a normalisation operation over an entire output vector, converting a vector of raw logits (unbounded real numbers) into a valid probability distribution. Every output is positive, and all outputs sum to exactly 1.0, making the output directly interpretable as class probabilities.

Softmax amplifies differences between logits. The largest logit receives a disproportionately high probability — the exponentiation makes differences exponential before normalisation. With logits [3.0, 1.0, 0.5], the first class dominates the probability. Subtracting the maximum logit before computing exponentials — max-trick — prevents numerical overflow without changing the output: softmax(z − max(z)) = softmax(z).

The temperature parameter T controls the sharpness of the distribution: softmax(z/T). With T → 0, the distribution collapses to a one-hot (greedy) selection of the highest logit. With T → ∞, it becomes uniform. In LLM token sampling, temperature is a key knob: T = 0.7 gives creative but coherent text; T = 1.5 gives more random, diverse outputs. At training time T = 1 is almost always used.

Softmax p(k) = e^z_k / Σⱼ e^z_j for k = 1, …, K With temperature T: p(k) = e^z_k/T / Σⱼ e^z_j/T z = logit vector (raw network output) · outputs ∈ (0,1) · Σ p(k) = 1.0 exactly Numerical stability: compute softmax(z − max(z)) to avoid e^large → overflow

Softmax Temperature — T=0.1 (greedy) vs T=1.0 (standard) vs T=2.0 (diverse)

In PyTorch, never apply softmax before passing logits to nn.CrossEntropyLoss — this loss already applies log-softmax internally for numerical stability. Applying softmax beforehand causes the loss to compute log(softmax(logits)), introducing numerical errors. Always pass raw logits to CrossEntropyLoss.

⚠ Common Pitfall — Softmax + CrossEntropyLoss Double Application

A very common bug: applying torch.softmax(logits) in the model's forward method, then passing the result to nn.CrossEntropyLoss. Since CrossEntropyLoss internally calls log_softmax, you end up computing log(softmax(logits)) instead of log_softmax(logits), which is numerically unstable and gives wrong gradients. Always output raw logits from the model.

Choosing an Activation Function Core

The choice of activation function is one of the most important and most misunderstood hyperparameters. The practical rule is simple: use ReLU as your baseline for CNNs and general MLPs, switch to GELU for anything Transformer-based, use Sigmoid only at binary classification output, and use Softmax only at multi-class output. The table below summarises when and why to use each.

Activation	Range	Vanishing Gradient	Zero-Centred	Where Used	Default Choice?
Sigmoid	(0, 1)	Yes — severe	No	Binary output, LSTM gates	Only for binary output
Tanh	(−1, 1)	Yes — moderate	Yes	LSTM/GRU gates, RNNs	Legacy RNNs only
ReLU	[0, ∞)	No (positive)	No	CNNs, MLPs (pre-2018)	✓ CNNs still
Leaky ReLU	(−∞, ∞)	No	Near	When dead neurons are a problem	Good fallback
GELU	(−0.17, ∞)	No	Near	GPT, BERT, T5, Transformers	✓ Transformers
Swish/SiLU	(−0.28, ∞)	No	Near	EfficientNet, some LLMs	✓ Modern CNNs
Softmax	(0, 1), Σ=1	—	No	Multi-class output layer only	Only for output

∑ Chapter 4.2 Summary — Activation Functions

Without non-linear activations, stacking layers = still a single linear transformation — depth buys nothing expressively
Sigmoid σ(z) = 1/(1+e^−z): saturates → vanishing gradients; not zero-centred — use only for binary classification output
Tanh: same saturation problem but zero-centred — better gradient flow; still used in LSTM/GRU gates
ReLU: max(0,z) — fast, no saturation for positive inputs, default for CNNs; suffers from Dead ReLU (permanently zero neurons)
Leaky ReLU fixes dead neurons with small negative slope α — good fallback when ReLU causes training issues
GELU: smooth ReLU variant f(z) = z·Φ(z) — used in GPT, BERT, and virtually all modern Transformers; smooth gradient everywhere
Softmax: multi-class output only — temperature T controls sharpness of probability distribution; never apply before CrossEntropyLoss

4.3

Chapter 4.3

Backpropagation & Gradient Flow

Backpropagation is not magic — it is the chain rule of calculus applied systematically to a computational graph. The genius is not the mathematics (which dates to Leibniz) but the engineering insight that all gradients in a network can be computed in a single backward pass, as cheaply as one forward pass. Without this, deep learning would be computationally impossible.

Intuition Core

The central question of training is: "For every weight in the network, how much does the loss change if I nudge that weight by a tiny amount?" This quantity — the partial derivative of the loss with respect to each weight — is the gradient. To reduce the loss, we move each weight in the direction opposite to its gradient.

The naive approach is finite differences: for each weight w, compute loss(w + ε) − loss(w) / ε. This gives an approximate gradient for that weight. The problem is scale. GPT-4 has an estimated 1.8 trillion parameters. Computing one gradient update this way requires 1.8 trillion forward passes — at, say, 1 second per pass on a cluster, that is 57,000 years per update step. Completely impossible.

Backpropagation solves this by computing all gradients simultaneously in a single backward pass through the computational graph. The backward pass is no more expensive than the forward pass — it visits the same operations in reverse. The key ingredient is the chain rule, which tells us how to compose local gradients as they flow backward from the loss to the inputs.

Finite differences: O(W) forward passes for W weights. Backpropagation: ONE backward pass for all W weights simultaneously. This efficiency gap — many orders of magnitude — is what makes modern deep learning possible.

Computational Graph In-depth

Every computation a neural network performs can be represented as a directed acyclic graph (DAG). Each node in the graph is a mathematical operation — addition, multiplication, exp, sigmoid, max. Each directed edge carries a tensor value from one operation to the next. The leaf nodes on the left are the inputs and weights; the single root node on the right is the scalar loss.

The forward pass is data flowing left to right through this graph — compute z₁ = w × x, then z₂ = z₁ − y, then L = z₂². Each intermediate value is stored (this is why training uses more memory than inference). The backward pass is gradients flowing right to left — starting with ∂L/∂L = 1 and applying the chain rule at each node. Every node knows how to compute its local gradient (e.g., the gradient through a multiplication node is the other operand), and backprop just multiplies local gradients together along each path.

PyTorch builds this graph dynamically as you execute Python code — every tensor operation with requires_grad=True records itself into the graph. When you call loss.backward(), PyTorch traverses the graph in reverse topological order and accumulates gradients into each leaf tensor's .grad attribute. JAX uses a slightly different approach (function transformation) but the computational graph concept is identical.

Computational Graph — forward values and backward gradients for L = (w·x − y)²

Chain Rule in Neural Networks In-depth

The chain rule is calculus's rule for differentiating composed functions: if L = f(g(x)), then dL/dx = (dL/dg) · (dg/dx). In a neural network, every layer is a composed function. The loss is a composition of all the layer operations stacked together. Backprop is simply the chain rule applied methodically in reverse order through every layer.

For a single layer l with pre-activation Z⁽ˡ⁾ = A⁽ˡ⁻¹⁾ · W⁽ˡ⁾ + b⁽ˡ⁾ and output A⁽ˡ⁾ = f(Z⁽ˡ⁾), the gradient of the loss with respect to the weights W⁽ˡ⁾ decomposes into three factors by the chain rule: how the loss changes with the activation, how the activation changes with the pre-activation (the derivative of the activation function), and how the pre-activation changes with the weights (which is simply A⁽ˡ⁻¹⁾). Multiplied together, these give the weight gradient for that layer.

The error signal δ⁽ˡ⁾ is the gradient of the loss with respect to the pre-activation Z⁽ˡ⁾. It packages the chain rule product up to layer l. To propagate backward one more layer, we multiply δ⁽ˡ⁾ by the weight matrix W⁽ˡ⁾ transposed (to "route" gradients back to the correct inputs), then element-wise multiply by f'(Z⁽ˡ⁻¹⁾) — the local derivative of the activation. This recursion continues all the way to the first layer.

Backpropagation — Key Equations Weight gradient: ∂L/∂W⁽ˡ⁾ = δ⁽ˡ⁾ · (A⁽ˡ⁻¹⁾)ᵀ Bias gradient: ∂L/∂b⁽ˡ⁾ = δ⁽ˡ⁾ (sum over batch) Error signal: δ⁽ˡ⁾ = ((W⁽ˡ⁺¹⁾)ᵀ · δ⁽ˡ⁺¹⁾) ⊙ f'(Z⁽ˡ⁾) δ⁽ˡ⁾ = error signal at layer l · ⊙ = element-wise multiply · f' = activation derivative Full chain: ∂L/∂W⁽ˡ⁾ = ∂L/∂A⁽ˡ⁾ · ∂A⁽ˡ⁾/∂Z⁽ˡ⁾ · ∂Z⁽ˡ⁾/∂W⁽ˡ⁾

Gradient Flow — backward signals multiply local gradients at each layer

import torch # Simple 2-layer network — verify PyTorch autograd x = torch.tensor([[1.0, 2.0]]) # input (1×2) W1 = torch.randn(2, 3, requires_grad=True) b1 = torch.randn(3, requires_grad=True) W2 = torch.randn(3, 1, requires_grad=True) b2 = torch.randn(1, requires_grad=True) # Forward pass — PyTorch silently builds the computational graph z1 = x @ W1 + b1 # (1×3) — pre-activation a1 = torch.relu(z1) # (1×3) — activation z2 = a1 @ W2 + b2 # (1×1) — output logit loss = z2.sum() # scalar loss # Backward pass — ONE call computes ALL gradients loss.backward() print(f"dL/dW1 shape: {W1.grad.shape}") # torch.Size([2, 3]) — same as W1 print(f"dL/dW2 shape: {W2.grad.shape}") # torch.Size([3, 1]) — same as W2 print(f"dL/db1 shape: {b1.grad.shape}") # torch.Size([3]) — same as b1 # Check gradient hasn't been accumulated from a previous call # Always call optimizer.zero_grad() before loss.backward() in training loops!

⚠ Common Pitfall — Gradient Accumulation Bug

PyTorch accumulates (adds) gradients into .grad by default — it does not overwrite them. If you call loss.backward() twice without calling optimizer.zero_grad() in between, the gradients double. The canonical training loop order is always: zero_grad → forward → loss → backward → step. Gradient accumulation over multiple mini-batches is intentional use of this behaviour, but it must be explicit.

Vanishing Gradients In-depth

Gradients propagate backward by multiplication. If the gradient at each layer is a number less than 1, repeated multiplication makes the product shrink exponentially. Sigmoid's maximum derivative is 0.25. In a 10-layer sigmoid network, the gradient arriving at layer 1 has been multiplied by at most 0.25 per layer — giving 0.25¹⁰ ≈ 9.5 × 10⁻⁷, effectively zero. The first layers receive no gradient signal and learn nothing while the last few layers update normally.

This is why networks deeper than 5–6 layers were impractical before 2012. The symptom is clear in training: the loss decreases at first but then plateaus far above the optimal, and inspecting per-layer gradients shows near-zero values in the early layers. The activations in these layers also collapse — either all outputs are near 0 or near 1 (for sigmoid), with near-zero variance.

The primary solutions in order of importance: (1) ReLU activations — gradient is exactly 1 for positive inputs, breaking the exponential decay. (2) Residual connections (ResNet, Ch 4.5) — add a "skip" path that carries gradients directly from the loss to early layers, bypassing the layer multiplications entirely. (3) Batch Normalisation (Ch 4.4) — normalises activations to prevent saturation. (4) He initialisation — initialises weights to maintain gradient scale across layers.

Gradient Magnitude — Sigmoid vs ReLU across 10 layers

Sigmoid (max derivative = 0.25):

Layer 10 (near output): gradient ≈ 0.25¹ = 0.250

Layer 8: gradient ≈ 0.25³ = 0.016

Layer 5: gradient ≈ 0.25⁶ = 2.4 × 10⁻⁴

Layer 1 (first layer): gradient ≈ 0.25¹⁰ = 9.5 × 10⁻⁷ ← effectively zero

ReLU (derivative = 1 for active neurons):

Layer 10: gradient ≈ 1.0

Layer 1: gradient ≈ 1.0 ← same order of magnitude → learning in all layers

Vanishing Gradients — why sigmoid kills deep network training

⚠ Common Pitfall — Diagnosing Vanishing Gradients

The telltale sign: training loss stops improving very early, even with sufficient model capacity and data. To confirm, log the gradient norm per layer: for name, p in model.named_parameters(): print(name, p.grad.norm()). If early-layer norms are 10⁻⁶ or smaller while final-layer norms are ~1.0, you have a vanishing gradient problem. First fix: switch sigmoid → ReLU. Second fix: add residual connections.

Exploding Gradients Core

The opposite pathology occurs when the gradient magnitudes grow exponentially as they propagate backward — if the weight matrices have large singular values, each multiplication amplifies rather than shrinks the gradient. This is especially common in Recurrent Neural Networks (RNNs) processing long sequences: the gradient at time step 1 is the product of 100 Jacobian matrices, and if each has norm slightly above 1, the product explodes exponentially.

The symptom is unmistakable: the loss goes to NaN within the first few training steps, and weights become inf. The standard fix is gradient clipping: compute the global norm of all gradients, and if it exceeds a threshold, scale all gradients down proportionally. This preserves the direction of the gradient update but caps its magnitude. A clipping value of 1.0 is a widely used default.

Gradient Clipping if ||g|| > clip_value: g ← g × (clip_value / ||g||) g = concatenation of all gradient tensors as a flat vector · ||g|| = L2 norm Direction preserved — only magnitude is capped. clip_value = 1.0 is the standard default.

import torch import torch.nn as nn # Canonical training loop with gradient clipping model = nn.LSTM(input_size=64, hidden_size=128, num_layers=2, batch_first=True) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) for x, y in dataloader: optimizer.zero_grad() # 1. clear old gradients output, _ = model(x) loss = criterion(output, y) loss.backward() # 2. compute gradients # 3. clip before step — prevents exploding gradient NaN nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step() # 4. update weights # Monitor gradient norms to detect explosion early total_norm = 0.0 for p in model.parameters(): if p.grad is not None: total_norm += p.grad.data.norm(2).item() ** 2 total_norm = total_norm ** 0.5 print(f"Gradient norm: {total_norm:.4f}") # >10 → suspicious; >100 → exploding

💥

Symptoms of Explosion

Loss jumps to NaN
Weights become inf
Gradient norm > 100
Loss erratic, huge oscillations

🛡️

Solutions

Gradient clipping (max_norm=1.0)
Lower learning rate
LSTM/GRU gating (Ch 4.6)
Layer normalisation

🔍

Monitoring Gradients

Log gradient norm per step
Use WandB/TensorBoard
Check for inf/NaN in params
Early layers vs late layers

∑ Chapter 4.3 Summary — Backpropagation & Gradient Flow

Backprop answers: how does the loss change w.r.t. every single weight — in one backward pass, as cheap as one forward pass
Computational graph: every operation is a node; forward pass computes values; gradients flow backward through edges via the chain rule
Chain rule: ∂L/∂W⁽ˡ⁾ = δ⁽ˡ⁾ · (A⁽ˡ⁻¹⁾)ᵀ — upstream error signal × input activations transposed
Vanishing: sigmoid derivatives multiply to near-zero in deep networks (0.25¹⁰ ≈ 10⁻⁷) → ReLU, residual connections, BatchNorm solve this
Exploding: large weight matrices multiply gradients to NaN loss → clip_grad_norm_(max_norm=1.0) is the standard fix
PyTorch autograd: dynamic computation graph — .backward() computes all gradients; always call zero_grad() before each backward pass

4.4

Chapter 4.4

Training Deep Networks

A neural network architecture is only half the story. The other half is the engineering that makes it trainable: how weights are initialised, how activations are kept stable, how overfitting is controlled, and how the optimiser navigates the loss landscape. These techniques are what separate a network that converges from one that never learns at all.

Weight Initialisation In-depth

Before a single training example is shown, every weight must be given a starting value. This choice has enormous consequences. If all weights start at zero, every neuron in a layer computes exactly the same function and receives exactly the same gradient — no matter how many epochs you train, all neurons in a layer remain identical forever. This is the symmetry breaking problem: weights must differ to learn different features.

Initialising with random values breaks symmetry, but the variance of those values is critical. If weights are too small, activations shrink exponentially with depth — by layer 10, inputs have collapsed to near zero and there is no gradient signal. If weights are too large, activations explode exponentially — inputs saturate sigmoid/tanh and gradients vanish, or the network numerically overflows. The goal is to choose a variance that keeps activation magnitudes approximately stable across all layers.

Xavier/Glorot initialisation (Glorot & Bengio, 2010) derives the optimal variance analytically for linear activations and symmetric non-linearities like Tanh. It sets the weight variance to 2/(nᵢₙ + nₒᵤₜ), balancing the signal variance across both forward and backward passes. He/Kaiming initialisation (He et al., 2015) adjusts for the fact that ReLU kills half of all activations (setting them to zero), which halves the effective variance. He init compensates by scaling up by √2, using variance 2/nᵢₙ. For any ReLU-based network, He initialisation is the correct default.

Weight Initialisation Formulas Xavier (Glorot): W ~ N(0, 2/(nᵢₙ+nₒᵤₜ)) ← for Tanh / Sigmoid He (Kaiming): W ~ N(0, 2/nᵢₙ) ← for ReLU (most common) LeCun: W ~ N(0, 1/nᵢₙ) ← for SELU nᵢₙ = fan-in (inputs to neuron) · nₒᵤₜ = fan-out (outputs from neuron)

Weight Initialisation — variance stability across 10 layers

import torch.nn as nn layer = nn.Linear(256, 128) # Xavier uniform — default for linear/tanh layers nn.init.xavier_uniform_(layer.weight) # He / Kaiming — correct for ReLU networks (most common in practice) nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu') nn.init.zeros_(layer.bias) # GPT-2 style — small normal, used by many modern Transformers nn.init.normal_(layer.weight, mean=0.0, std=0.02) # Apply He init to all Linear layers in a model def init_weights(module): if isinstance(module, nn.Linear): nn.init.kaiming_normal_(module.weight, nonlinearity='relu') if module.bias is not None: nn.init.zeros_(module.bias) model.apply(init_weights) # applies recursively to all submodules

⚠ Common Pitfall — Wrong Init for Activation

Using Xavier init with ReLU or He init with Tanh gives suboptimal results. The mismatch causes activation variance to drift across layers. Rule: He/Kaiming for ReLU/LeakyReLU/GELU networks, Xavier/Glorot for Tanh/Sigmoid networks. When in doubt, use He — most modern networks use ReLU-family activations.

Batch Normalisation In-depth

Ioffe and Szegedy (2015) diagnosed a key training instability they called internal covariate shift: as the parameters of layer l change during training, the distribution of inputs seen by layer l+1 shifts. The later layer must constantly readjust to its changing input distribution, slowing convergence. Their solution — Batch Normalisation — normalises each layer's pre-activation values across the mini-batch, forcing the distribution to approximately N(0,1) regardless of what the previous layer learned.

The normalisation has four steps. First, compute the mean μ_B and variance σ²_B of the current mini-batch. Second, subtract the mean and divide by the standard deviation to get x̂ — a zero-mean, unit-variance vector. Third — and critically — apply a learnable scale γ and shift β: y = γx̂ + β. These learned parameters let the network undo the normalisation if that is optimal; without them, BatchNorm would permanently constrain every layer's activations to N(0,1), which is too restrictive.

At inference time there is no mini-batch, so BatchNorm uses running statistics — exponential moving averages of μ_B and σ²_B accumulated during training — to normalise. This is why you must call model.eval() before inference: it switches BatchNorm from batch statistics to running statistics. Forgetting this is one of the most common and damaging bugs in deep learning practice.

Batch Normalisation — Forward Pass μ_B = (1/m) Σ xᵢ (batch mean, m = batch size) σ²_B = (1/m) Σ (xᵢ−μ_B)² (batch variance) x̂ᵢ = (xᵢ−μ_B) / √(σ²_B+ε) (normalise, ε≈1e-5 for numerical stability) yᵢ = γ·x̂ᵢ + β (scale & shift — γ, β are LEARNED) γ, β initialised to 1 and 0 — network learns optimal scale/shift during training

Batch Normalisation Position and Effect on Layer Activations

⚠ Common Pitfall — model.train() vs model.eval()

Forgetting to call model.eval() before inference causes BatchNorm to use the mini-batch statistics of a single inference batch (which may be size 1) instead of the running statistics accumulated during training. With batch size 1, the batch mean equals the input, normalised output is always zero, and predictions are garbage. Always: model.train() during training, model.eval() during evaluation and inference.

Dropout In-depth

Srivastava et al. (2014) introduced dropout as a computationally cheap approximation to training an ensemble of exponentially many networks. During each forward pass, every neuron is independently deactivated with probability p. The remaining (1−p) fraction of neurons process the input and update normally. At inference, all neurons are active — but since the network was trained with only (1−p) of neurons active on average, the outputs are scaled down by (1−p) to keep the expected activation magnitude consistent. In practice, inverted dropout is used: scale activations up by 1/(1−p) during training so no adjustment is needed at inference.

The theoretical justification has three complementary perspectives. The ensemble view: with N neurons, there are 2^N possible sub-networks; dropout samples a different one each forward pass, and inference approximates their average. The co-adaptation view: neurons cannot rely on specific other neurons being present, so they learn more independent, redundant features. The noise injection view: randomly zeroing neurons adds multiplicative noise, acting like a data augmentation that prevents the network from memorising specific training patterns.

Practical guidance: dropout rates of 0.1–0.2 work well for earlier or convolutional layers; 0.3–0.5 for large fully connected layers. Dropout is rarely applied to convolutional feature maps (DropBlock is preferred there). In Transformer models, dropout is applied after attention and after the feed-forward sublayer with rates of 0.1 being standard. For very large models, lower dropout rates (0.05–0.1) are preferred as the model already has strong regularisation from scale.

Dropout — random deactivation during training, full network at inference

⚠ Common Pitfall — Dropout at Wrong Places

Do not apply standard Dropout after every layer indiscriminately. Applying it after BatchNorm can interfere with BN's running statistics. Applying it in convolutional layers often hurts performance (use DropBlock instead). Applying it at the output layer is always wrong. Standard rule: dropout in the fully connected classifier head only (or after transformer attention layers at p=0.1).

Optimisers for Deep Learning In-depth

Stochastic Gradient Descent (SGD) updates each weight by subtracting a fraction of its gradient: θ ← θ − α∇L. Plain SGD oscillates badly in directions with high curvature (the narrow valleys common in deep loss landscapes) and moves too slowly in flat directions. SGD with Momentum adds a velocity term that accumulates gradient history, smoothing oscillations and accelerating through flat regions. It remains the preferred optimiser for training ResNets and CNNs on image classification.

Adam (Kingma & Ba, 2014) computes an adaptive learning rate per parameter: it tracks the first moment (mean of gradients) and second moment (mean of squared gradients) and uses their ratio to scale each parameter's update independently. A parameter whose gradient has been consistently large gets a smaller effective step; a parameter with small, consistent gradients gets a larger step. This makes Adam dramatically faster to converge on most problems and largely insensitive to the global learning rate choice.

AdamW (Loshchilov & Hutter, 2019) fixes a subtle mathematical bug in Adam's weight decay implementation. In Adam, L2 regularisation (weight decay) was applied to the gradient before the adaptive scaling — which means the actual weight penalty is scaled by the adaptive term and varies per parameter. AdamW decouples weight decay from the gradient update, applying it directly to the parameters after the Adam step: θ ← θ − λθ (separately from the Adam gradient term). This is now the mandatory default for training large language models and Transformers.

AdamW Update Rule m_t = β₁m_{t-1} + (1−β₁)g_t (1st moment — gradient mean) v_t = β₂v_{t-1} + (1−β₂)g_t² (2nd moment — gradient variance) m̂_t = m_t/(1−β₁ᵗ), v̂_t = v_t/(1−β₂ᵗ) (bias correction) θ_t = θ_{t-1} − α(m̂_t/√(v̂_t+ε) + λθ_{t-1}) β₁=0.9 · β₂=0.999 · ε=1e-8 · λ=weight decay (typical 1e-2) · α=learning rate

Optimiser Trajectories — SGD vs Momentum vs Adam on ill-conditioned surface

Optimiser	Best For	Typical LR	Weight Decay	Notes
SGD	Legacy CNNs	0.01–0.1	via L2 penalty	Requires careful LR schedule
SGD + Momentum	CV fine-tuning, ResNets	0.01–0.1	via L2	momentum=0.9 standard
Adam	Prototyping, NLP	1e-4 to 3e-4	Broken — use AdamW	Default for quick experiments
AdamW	Transformers, LLMs	1e-4 to 3e-4	Decoupled (1e-2)	Mandatory for modern models

Learning Rate Schedules Core

A fixed learning rate is rarely optimal throughout training. Early in training, large steps are desirable — the network is far from a good solution and can afford rough updates. Later in training, large steps overshoot the loss minimum and cause oscillation — smaller steps are needed for fine-grained convergence. Learning rate schedules adjust the learning rate automatically over the course of training.

The warmup + cosine annealing schedule has become the dominant approach for Transformer training. For the first few percent of training steps (the "warmup"), the learning rate increases linearly from near-zero to the target learning rate. This protects the model from large, destabilising gradient updates at the start of training, when the parameters are random and gradients are noisy. After warmup, the learning rate follows a cosine curve from the peak down to a small minimum — providing a smooth, continuous decay that typically outperforms staircase schedules.

LR Schedules — Warmup+Cosine is the Transformer standard

The Complete Training Loop In-depth

Putting it all together: a production-quality PyTorch training loop that incorporates gradient clipping, the model.train()/eval() switch, a validation loop, and a cosine learning rate schedule. Every line is intentional — understanding why each piece is there is as important as knowing what it does.

import torch import torch.nn as nn from torch.utils.data import DataLoader def train_epoch(model, loader, optimizer, criterion, device, clip_grad=1.0): model.train() # dropout ON, BN uses batch stats total_loss, correct = 0.0, 0 for X, y in loader: X, y = X.to(device), y.to(device) logits = model(X) # 1. forward loss = criterion(logits, y) optimizer.zero_grad() # 2. clear grads loss.backward() # 3. backprop nn.utils.clip_grad_norm_(model.parameters(), clip_grad) # 4. clip optimizer.step() # 5. update total_loss += loss.item() correct += (logits.argmax(1) == y).sum().item() return total_loss / len(loader), correct / len(loader.dataset) def evaluate(model, loader, criterion, device): model.eval() # dropout OFF, BN uses running stats total_loss, correct = 0.0, 0 with torch.no_grad(): # disable grad tracking → saves memory for X, y in loader: X, y = X.to(device), y.to(device) logits = model(X) total_loss += criterion(logits, y).item() correct += (logits.argmax(1) == y).sum().item() return total_loss / len(loader), correct / len(loader.dataset) # Setup device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = MLP(784, 256, 10).to(device) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2) criterion = nn.CrossEntropyLoss() scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) best_val_loss, best_epoch = float('inf'), 0 for epoch in range(50): train_loss, train_acc = train_epoch(model, train_loader, optimizer, criterion, device) val_loss, val_acc = evaluate(model, val_loader, criterion, device) scheduler.step() if val_loss < best_val_loss: # early stopping / best model best_val_loss = val_loss torch.save(model.state_dict(), 'best_model.pt') if epoch % 5 == 0: print(f"Epoch {epoch:3d}: train={train_loss:.4f} val_acc={val_acc:.4f}")

Training vs Validation Loss — detecting overfitting and early stopping

Other Regularisation Techniques Reference

Beyond dropout and BatchNorm, a range of regularisation techniques are routinely used in modern deep learning. The right combination depends on the task, architecture, and dataset size. The table below summarises the most important ones with their use cases and PyTorch APIs.

Technique	Mechanism	When to Use	PyTorch API
L2 / Weight Decay	Add λ‖W‖² penalty to loss → shrinks weights	Always — small λ (1e-4 to 1e-2)	`weight_decay=` in optimizer
Dropout	Randomly zero neurons during training	FC layers, Transformers (p=0.1–0.5)	`nn.Dropout(p=0.3)`
Batch Normalisation	Normalise activations per mini-batch	After linear/conv, before activation	`nn.BatchNorm1d/2d`
Layer Normalisation	Normalise across features (not batch)	Transformers — no batch-size dependency	`nn.LayerNorm`
Data Augmentation	Random transforms of training inputs	Image tasks (flip, crop, colour jitter)	`torchvision.transforms`
Early Stopping	Stop when validation loss stops improving	Always — monitor val_loss with patience	Manual or `PyTorch Lightning`
Label Smoothing	Soften hard 0/1 targets to ε/(K-1)	Classification — prevents overconfidence	`nn.CrossEntropyLoss(label_smoothing=0.1)`

∑ Chapter 4.4 Summary — Training Deep Networks

He init for ReLU: W ~ N(0, √(2/nᵢₙ)) — prevents vanishing/exploding activations before training even starts
BatchNorm: normalise per mini-batch → stable training, higher LR tolerance, less sensitivity to initialisation; always call model.eval() at inference
Dropout: randomly drop p of neurons each forward pass → ensemble effect, prevents co-adaptation; use in FC layers at p=0.1–0.5
AdamW: Adam with decoupled weight decay — the mandatory standard optimiser for Transformer and LLM training (β₁=0.9, β₂=0.999, λ=1e-2)
Warmup + cosine annealing: protect early training instability then smoothly decay LR — standard for all large-scale training runs
Training loop order: zero_grad → forward → loss → backward → clip → step; run evaluate() separately with model.eval() and torch.no_grad()

4.5

Chapter 4.5

Convolutional Neural Networks

Convolutional Neural Networks are not a minor variation on the MLP. They encode a powerful prior about visual data — that meaningful patterns are local and translation-invariant — directly into the architecture. This inductive bias, combined with weight sharing, reduces parameters by orders of magnitude while improving generalisation. The result transformed computer vision from hand-crafted features to end-to-end learning.

Why Not MLP for Images? Core

A standard 224×224 RGB image contains 224 × 224 × 3 = 150,528 individual pixel values. A single hidden layer of 1,024 neurons in a fully connected MLP requires 150,528 × 1,024 = 154 million weights — for the first layer alone, before any useful representation has been learned. Scale this to the thousands of neurons in a real network and you have a parameter count that dwarfs the available training data, making the network impossible to train effectively.

The parameter count is only the first problem. A deeper issue is that the MLP treats every pixel as equally related to every other pixel — it has no concept of spatial locality. A cat's eye in the top-left corner and a cat's eye in the bottom-right corner are unrelated to the MLP; it must learn to recognise them as independent patterns, requiring separate learned features for every possible position. Images have three structural properties the MLP ignores: local structure (nearby pixels are more related than distant ones), translation invariance (a cat is a cat wherever it is), and compositionality (parts compose into objects).

CNNs address all three with a single idea: weight sharing via convolution. Instead of connecting each pixel to each neuron, a CNN applies a small learned filter (e.g., 3×3) across the entire image. The same 27 weights (3×3×3 channels) are reused at every spatial position. This reduces the first layer's parameters from 154 million to a few hundred, while the filter learns to detect the same feature (an edge, a colour gradient) wherever it appears.

MLP vs CNN — weight sharing reduces parameters by 10,000×

The Convolution Operation In-depth

A convolution applies a small learnable filter — called a kernel — to an input feature map by sliding it across every position and computing an element-wise dot product at each location. At position (i, j), the output value is the sum of all products between the kernel weights and the corresponding input patch. If the kernel has learned to detect horizontal edges, positions where horizontal edges are present produce large activations; other positions produce small ones. With 64 different kernels, you get 64 different feature maps, each detecting a different pattern.

Three hyperparameters control the output size. Kernel size K (typically 3×3 in modern networks): larger kernels see more context but use more parameters. Stride S: how many pixels the kernel jumps between applications. Stride=2 halves the spatial resolution. Padding P: zeros added around the input. "Same" padding (P=(K-1)/2) preserves the input spatial size, which is standard for 3×3 convolutions.

The parameter count scales with kernel size and channel counts, not with image size — this is the core efficiency of CNNs. A 3×3 conv layer with 64 input channels and 128 output channels has 3 × 3 × 64 × 128 + 128 = 73,856 parameters regardless of whether the input is 32×32 or 512×512. This size-invariance is why a single CNN trained on 224×224 images can be applied to any input resolution at inference.

Convolution Formulas Output size: W_out = ⌊(W_in − K + 2P) / S⌋ + 1 Params per layer: K × K × C_in × C_out + C_out (bias) K = kernel size · P = padding · S = stride · C_in/C_out = input/output channels Example: 3×3 conv, 64→128 ch, no bias: 3×3×64×128 = 73,728 weights

Convolution: sliding a 3×3 filter across input to produce feature map

import torch.nn as nn # Conv2d(in_channels, out_channels, kernel_size, stride, padding) conv = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1) # 'same' padding # Parameter count: 3×3×3×64 + 64 = 1,792 (vs 154M for MLP on 224×224×3) params = sum(p.numel() for p in conv.parameters()) import torch x = torch.randn(1, 3, 224, 224) # batch=1, C=3, H=224, W=224 out = conv(x) # shape: (1, 64, 224, 224) — same spatial size print(f"Input: {x.shape}") # [1, 3, 224, 224] print(f"Output: {out.shape}") # [1, 64, 224, 224] print(f"Params: {params}") # 1,792 # Stride=2 halves spatial resolution conv_s2 = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1) out2 = conv_s2(out) print(f"Stride=2 output: {out2.shape}") # [1, 128, 112, 112]

Pooling Core

After convolution extracts local features, the spatial dimensions of the feature maps are often larger than necessary — and carrying large feature maps through many layers is expensive. Pooling layers reduce the spatial size while retaining the most important information. Max pooling (the dominant choice) partitions the feature map into non-overlapping windows and takes the maximum value in each. With a 2×2 window and stride 2, max pooling halves both height and width, reducing the number of activations by 4× while making features more invariant to small spatial shifts.

The intuition behind max pooling: each feature map cell contains an activation measuring "how strongly is this feature present at this position?" The maximum within a 2×2 region answers "was this feature present anywhere in this region?" — a coarser, position-invariant question that is still useful for recognition. An eye is an eye whether it is 2 pixels to the left or right. Max pooling discards that 2-pixel difference.

Global Average Pooling (GAP) is a key modern innovation: instead of pooling 2×2 regions, it averages the entire spatial extent of each channel into a single scalar. Applied after the last convolutional block, GAP converts a [B, C, H, W] tensor into [B, C], replacing the large fully-connected layers that were responsible for most parameters in early CNN architectures. ResNet and subsequent models use GAP as the bridge between convolutional features and the classification head.

Max Pooling 2×2 — reduces spatial dimensions by 2×

CNN Architecture In-depth

A typical CNN follows a regular pattern: alternating convolution blocks and pooling layers, progressively reducing the spatial dimensions while increasing the number of feature channels. The spatial compression concentrates local features into increasingly compact representations. The channel expansion gives the network more "vocabulary" for describing what it sees. The final stage converts the 3D feature tensor into a class prediction via either a series of fully-connected layers or Global Average Pooling.

The hierarchical feature learning in CNNs is perhaps their most important property. Visualisation studies show that early layers (Layer 1-2) learn to detect simple patterns: oriented edges, colour gradients, and textures. Middle layers (Layer 3-4) detect parts: corners, curves, texture patches that resemble scales, fur, or brickwork. Late layers detect whole objects or object parts: faces, wheels, paws. This hierarchy emerges from training alone — it is not hand-crafted. The network discovers it is useful by virtue of gradient descent on classification loss.

CNN Architecture — Spatial compression + Feature depth expansion

CNN Feature Hierarchy — from pixels to objects through learned abstractions

ResNet & Skip Connections In-depth

By 2014, the empirical pattern was clear: deeper networks should perform better, because more layers can learn more complex functions. Attempts to train networks with 20–30 layers consistently produced worse results than 10–15 layer networks — not just on validation, but on the training set. This degradation problem was not overfitting. It meant the optimiser was fundamentally unable to train very deep networks, even when additional capacity should have helped.

He et al.'s insight was deceptively simple. If a shallower network achieves some accuracy A, then a deeper network that copies the shallower network's layers and sets all additional layers to identity (f(x) = x) should achieve at least accuracy A. But gradient descent cannot easily learn the identity mapping — pushing all weights in a layer toward zero is hard, because zero weights produce zero outputs (not the input x). The residual block makes this easy by reformulating the learning objective: instead of learning f(x), the block learns the residual r(x) = f(x) − x. The shortcut connection adds the original input directly: output = r(x) + x. Now learning the identity is trivial — just set r(x) = 0.

The practical impact was enormous. ResNet-152 (152 layers, 2015) achieved 3.57% Top-5 error on ImageNet — surpassing human-level performance (~5%). The skip connection also dramatically improves gradient flow: gradients can propagate directly from the loss to any earlier layer through the shortcut path, bypassing the multiplicative chain that causes vanishing gradients. This is why skip connections appear in virtually every modern architecture — Transformers include them as a core component.

Residual Block Standard: y = F(x, W) (learn full mapping) Residual: y = F(x, W) + x (learn the change only) If F(x) = 0: y = x (identity — trivial to learn) F(x, W) = two conv layers with BN and ReLU between them

Residual Block — skip connections solve the deep network degradation problem

ResNet vs Plain CNN — Skip connections enable depth without degradation

import torch import torch.nn as nn class ResidualBlock(nn.Module): def __init__(self, channels): super().__init__() self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False) self.bn1 = nn.BatchNorm2d(channels) self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False) self.bn2 = nn.BatchNorm2d(channels) self.relu = nn.ReLU(inplace=True) def forward(self, x): identity = x # save input for skip connection out = self.relu(self.bn1(self.conv1(x))) out = self.bn2(self.conv2(out)) out = out + identity # F(x) + x — the residual connection out = self.relu(out) return out # Use pre-trained ResNet-50 from torchvision import torchvision.models as models resnet = models.resnet50(weights='IMAGENET1K_V2') # Replace classifier head for custom task (e.g. 10 classes) resnet.fc = nn.Linear(resnet.fc.in_features, 10) x = torch.randn(1, 3, 224, 224) print(f"ResNet-50 output: {resnet(x).shape}") # [1, 10]

⚠ Common Pitfall — Mismatched Skip Connection Dimensions

The skip connection adds x directly to F(x). This requires x and F(x) to have the same shape. When a residual block changes the number of channels or uses stride > 1 (to downsample), the shortcut must include a 1×1 convolution (called a "projection shortcut") to match dimensions: self.shortcut = nn.Conv2d(in_ch, out_ch, 1, stride=stride). Forgetting this causes a shape mismatch error at the addition step.

Modern CNNs Reference

The history of CNNs on ImageNet is a story of architectural innovations compounding: each milestone introduced one key idea that is now ubiquitous. AlexNet proved GPU-trained deep networks could work. VGG showed depth with only 3×3 convolutions was sufficient and more principled. Inception introduced parallel multi-scale filters. ResNet introduced skip connections enabling extreme depth. EfficientNet used Neural Architecture Search to jointly scale depth, width, and resolution. Vision Transformers (ViT) ultimately replaced convolutions entirely with attention — showing that the convolutional inductive bias, while useful, is not necessary given enough data.

Architecture	Year	Depth	Key Innovation	ImageNet Top-5
LeNet-5	1998	5	First successful CNN for digits	~25% (MNIST era)
AlexNet	2012	8	GPU training, ReLU, Dropout	15.3% — 11% improvement
VGG-16/19	2014	16–19	Only 3×3 convolutions throughout	7.3%
GoogleNet/Inception	2014	22	Inception modules, global avg pool	6.7%
ResNet-50/152	2015	50–152	Residual skip connections	3.57% — superhuman
DenseNet-121	2017	121–264	Dense connections (all-to-all)	3.46%
EfficientNet-B7	2019	Variable	Neural Architecture Search (NAS)	2.9%
Vision Transformer (ViT)	2020	Variable	Pure self-attention — no convolution	2.0%+

Receptive Field Core

The receptive field of a neuron in layer l is the region of the original input image that can influence that neuron's activation. A neuron in the first conv layer with a 3×3 kernel sees a 3×3 region. A neuron in the second conv layer sees a 5×5 region (each of its 9 input cells saw a 3×3 region, overlapping to cover 5×5). With each additional 3×3 conv layer, the receptive field grows by 2 in each dimension. After k layers of 3×3 convolutions: receptive field = 2k + 1 pixels.

Pooling layers and strided convolutions multiply the receptive field growth. After a 2× pooling layer, subsequent convolutional layers grow the receptive field twice as fast. This is why deep CNNs develop neurons in later layers that respond to large, complex objects: they have receptive fields spanning the entire image. The final convolutional layer in ResNet-50 has a theoretical receptive field of 483×483 — larger than the 224×224 input — ensuring every output cell has seen the full input context.

📐

Receptive Field Growth

1 conv (3×3): RF = 3×3
2 convs (3×3): RF = 5×5
3 convs (3×3): RF = 7×7
k convs: RF = (2k+1)×(2k+1)
Pooling 2×: doubles growth rate

🔍

Why Large RF Matters

Small RF → misses global context
Object recognition needs full object
Dilated conv: large RF without depth
Attention (ViT): global RF layer 1
ResNet-50 RF > input size ✓

⚡

PyTorch CNN in 10 Lines

nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2),
nn.AdaptiveAvgPool2d(1),
nn.Flatten(),
nn.Linear(64, num_classes))

∑ Chapter 4.5 Summary — Convolutional Neural Networks

Convolution: slide a learned filter across input → weight sharing = same feature detector everywhere — 154M MLP params → 27 CNN params for first layer
K×K×C_in×C_out parameters per layer — size-invariant: same params for 32×32 or 512×512 inputs
CNN hierarchy: edges (L1) → textures (L2) → parts (L3) → objects (L4+) — all learned automatically
Max pooling: keep maximum per 2×2 window → translation invariance + spatial compression; Global Average Pooling replaces FC layers
ResNet skip connections: y = F(x) + x — reformulate as residual learning → solves degradation, enables 150+ layer networks, gradients flow freely
ResNet (2015) → EfficientNet → ViT (2020): Transformers now rival CNNs on vision — attention replaces convolution with global receptive field from layer 1

4.6

Chapter 4.6

Recurrent Neural Networks & LSTMs

The RNN and LSTM are not obsolete — they are the conceptual bedrock of sequence modelling. Understanding why RNNs struggle with long-range dependencies, and how LSTMs solve this with gating, is the essential preparation for understanding why the Transformer replaced them. Every concept in attention mechanisms traces directly back to this chapter.

Why Sequences Need Memory Core

An MLP processes each input independently. Feed it the word "bank" and it produces a prediction — but it has no way to know whether the previous words were "river" or "money". CNNs add local spatial context via convolution windows, but they still process a fixed-size input with no persistent state across positions. Neither architecture is suited to data where order matters and length varies: text, speech, time series, video.

The semantic difference between "The dog bit the man" and "The man bit the dog" lies entirely in word order — same vocabulary, opposite meaning. Processing each word independently destroys this information. A model needs to carry a memory of what it has already seen as it processes each new token. This is the core motivation for the Recurrent Neural Network: maintain a hidden state hǕₜ that accumulates information from all previous time steps.

RNN vs MLP — hidden state enables sequential memory

Basic RNN In-depth

The vanilla RNN cell takes the current input xₜ and previous hidden state hₜ₋₁, applies a weighted sum, and passes through tanh. The same weight matrices Wₕ and Wₓ are used at every single time step — weight sharing across time, analogous to how a CNN shares weights across space. Unrolling through T steps creates an effective computational graph T layers deep: a sequence of 100 words = 100 effective layers = severe vanishing gradient risk.

Vanilla RNN hₜ = tanh(Wₕ·hₜ₋₁ + Wₓ·xₜ + b) yₜ = W𝙪·hₜ + b𝙪 Same Wₕ, Wₓ, W𝙪 at ALL time steps · effective depth = sequence length T

RNN Unrolled — same weights at every time step, hidden state flows forward

import torch, torch.nn as nn rnn = nn.RNN(input_size=64, hidden_size=128, num_layers=2, batch_first=True) x = torch.randn(8, 30, 64) # (batch, seq_len, features) output, h_n = rnn(x) print(f"output: {output.shape}") # [8, 30, 128] — all hidden states print(f"h_n: {h_n.shape}") # [2, 8, 128] — final, 2 layers

Backpropagation Through Time (BPTT) In-depth

BPTT applies backprop to the unrolled RNN graph. To update Wₕ, gradients must flow from the loss at step T back through every time step, multiplying by the Jacobian ∂hₜ/∂hₜ₋₁ = Wₕᵀ · diag(tanh’(·)) at each step. Across T steps: product of T such matrices. If ‖Wₕ‖ < 1, the product shrinks exponentially — vanishing gradient. If > 1, it grows — exploding. Practical limit: vanilla RNNs cannot reliably learn dependencies beyond ~10–20 steps.

BPTT Vanishing Gradients — early time steps receive no learning signal

⚠ Common Pitfall — Truncated BPTT

Training on very long sequences with full BPTT is expensive (memory scales linearly with sequence length). Truncated BPTT splits sequences into chunks and backpropagates only within each chunk, carrying the hidden state forward without gradients. In PyTorch, detach the hidden state between chunks: h = h.detach() before each new chunk.

LSTM — Long Short-Term Memory In-depth

Hochreiter & Schmidhuber (1997) added a cell state cₜ — a horizontal "highway" running through all time steps with only element-wise operations. Gradients flowing through cₜ are multiplied by learned scalar gate values (not weight matrices), dramatically reducing vanishing. Three sigmoid gates ∈ (0,1) control information flow: forget gate fₜ decides what to erase from cₜ₋₁; input gate iₜ decides what new candidate c̃ₜ to write; output gate oₜ decides what to expose as hₜ.

The cell state update is the key: cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ — only addition and element-wise multiply, no matrix multiplication. The forget gate can learn to stay near 1 for important long-range information, creating an almost unimpeded gradient path over hundreds of steps.

LSTM Equations fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf) (Forget gate) iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi) (Input gate) c̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc) (Candidate values) cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ (Cell state — GRADIENT HIGHWAY) oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo) (Output gate) hₜ = oₜ ⊙ tanh(cₜ) (Hidden state output) σ = sigmoid ∈ (0,1) · ⊙ = element-wise multiply · [h,x] = concatenation

LSTM Cell — three gates control what to forget, remember, and output

import torch, torch.nn as nn lstm = nn.LSTM(input_size=64, hidden_size=128, num_layers=2, batch_first=True, dropout=0.2) x = torch.randn(8, 50, 64) output, (h_n, c_n) = lstm(x) print(f"output: {output.shape}") # [8, 50, 128] print(f"h_n: {h_n.shape}") # [2, 8, 128] print(f"c_n: {c_n.shape}") # [2, 8, 128] # Truncated BPTT — detach BOTH states between chunks h = (h_n.detach(), c_n.detach()) # prevents OOM on long sequences # Bidirectional LSTM bilstm = nn.LSTM(64, 128, bidirectional=True, batch_first=True) out, _ = bilstm(x) print(f"BiLSTM: {out.shape}") # [8, 50, 256] = 128x2 (fwd + bwd)

⚠ Common Pitfall — Forgetting to Detach Cell State

When processing long sequences in chunks, detach both h_n AND c_n: h = (h_n.detach(), c_n.detach()). Forgetting to detach c_n keeps the computation graph alive across chunks causing unbounded memory growth until OOM — one of the most common LSTM training bugs.

GRU — Gated Recurrent Unit Core

Cho et al. (2014) introduced the GRU as a simplified LSTM with only 2 gates: an update gate zₜ (blend old vs new hidden state) and a reset gate rₜ (how much past state to use for candidate). No separate cell state — just one hidden vector. Fewer parameters, faster training, often competitive with LSTM.

GRU Equations zₜ = σ(Wz·[hₜ₋₁, xₜ]) (Update gate) rₜ = σ(Wr·[hₜ₋₁, xₜ]) (Reset gate) h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ]) (Candidate hidden state) hₜ = (1−zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ (Final: blend old and new) 2 gates vs LSTM’s 3 · no separate cell state · fewer parameters · use as first choice

LSTM

GRU

3 gates: forget, input, output

2 gates: update, reset

Separate cell state + hidden state

Single hidden state

More parameters per layer

Fewer parameters — faster training

Better for very long sequences

Competitive on most tasks

Standard NLP 2014–2018

Use first — switch to LSTM if needed

Seq2Seq & Attention Preview Core

Seq2Seq (Sutskever et al. 2014) combined two RNNs: an encoder compressing the input into a fixed context vector, and a decoder generating output conditioned on it. The fundamental flaw: the entire input — regardless of length — is squeezed into one fixed-size vector. Bahdanau et al. (2015) fixed this with attention: at each decoder step compute a weighted sum over ALL encoder states cₜ = ∑ αₜᴵ · hᴵ. This is the direct predecessor of Transformer self-attention in Ch 4.7.

Seq2Seq with Bahdanau Attention — decoder attends to relevant encoder states

The attention mechanism did not replace the RNN in 2015 — it made the RNN dramatically better. It took Vaswani et al. (2017) to ask: "What if we remove the RNN and use attention exclusively?" The answer was the Transformer. Chapter 4.7 completes this story.

∑ Chapter 4.6 Summary — RNNs & LSTMs

RNN: hidden state hₜ = tanh(Wₕhₜ₋₁ + Wₓxₜ + b) — same weights each step, memory flows forward through hₜ
BPTT: gradients multiply through T Jacobians → vanishing/exploding for long sequences (practical limit ~10–20 steps vanilla RNN)
LSTM: cell state cₜ = gradient highway; 3 gates (forget, input, output) control what to erase, write, expose
LSTM key: cₜ = fₜ⊙cₜ₋₁ + iₜ⊙c̃ₜ — only element-wise ops on gradient path → no vanishing through cell state
GRU: 2 gates (update, reset), single hidden state, fewer parameters — competitive with LSTM; use first
Seq2Seq + Bahdanau attention: decoder attends to all encoder states — solved the bottleneck; direct ancestor of Transformer self-attention (Ch 4.7)

4.7

Chapter 4.7

The Transformer Architecture

“Attention Is All You Need” (Vaswani et al., 2017) is the most consequential paper in the history of AI. It removed the recurrence entirely and showed that attention alone — applied in parallel across all tokens — outperforms every RNN variant at every scale. Every major AI system since 2018 is built on this architecture. Understanding it fully is not optional.

Why Transformers Replaced RNNs Core

The RNN’s fundamental flaw is its sequential nature. To process token t, you must first finish token t−1. This means training on a sequence of 10,000 tokens requires 10,000 sequential steps — no amount of hardware parallelism can help. Training GPT-3 (which processes sequences of 2,048 tokens) on an RNN would be computationally impossible at scale. The Transformer abolishes this constraint: all tokens are processed simultaneously, turning a sequential problem into a parallel matrix multiplication problem that GPUs excel at.

The second flaw is information decay. Even with LSTM gating, information from 500 tokens ago is weakly represented in the current hidden state. In contrast, the Transformer’s attention mechanism creates a direct path between any two tokens regardless of distance. The word “it” 300 tokens after “the cat” can attend directly to “cat” with no intermediate steps — the path length is always 1. This is why Transformers handle long documents, code files, and entire books in ways that RNNs fundamentally cannot.

RNN Limitations

Transformer Solutions

Sequential computation — O(n) passes required

Fully parallel — all tokens simultaneously

Vanishing gradients across long sequences

Direct attention: any token to any token

Fixed-size context bottleneck

Full context at every layer

Maximum path length = n steps

Constant path length = 1 step

Scaled Dot-Product Self-Attention In-depth

Self-attention is the mechanism that allows each token to gather information from all other tokens in the sequence. For each token, three vectors are computed by applying learned linear projections: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“what information do I carry?”). The attention weight between token i and token j is computed as the dot product of token i’s Query with token j’s Key, scaled by √dₖ (to prevent the dot products from growing large and saturating the softmax), then normalised via softmax across all tokens. The output for token i is the weighted sum of all Value vectors.

The scaling by √dₖ is critical. For dₖ=64, a random unit vector has dot product with another of approximately 8 in expectation. Without scaling, these large values push the softmax into near-zero gradient regions. Dividing by √64=8 normalises the variance and keeps gradients healthy. This is why the formula specifically includes the √dₖ denominator.

Scaled Dot-Product Attention Attention(Q,K,V) = softmax(QKᵀ / √dₖ) · V Step 1: Project inputs — Q = XWᵈ, K = XWᴷ, V = XWᵝ Step 2: Compute scores — S = QKᵀ / √dₖ (shape: n×n attention matrix) Step 3: Normalise — A = softmax(S) (row-wise, sums to 1) Step 4: Aggregate — Output = A·V (weighted sum of values)

Self-Attention: Q, K, V projections and attention weight computation

import torch import torch.nn.functional as F import math def scaled_dot_product_attention(Q, K, V, mask=None): # Q, K, V: (batch, heads, seq_len, d_k) d_k = Q.size(-1) scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) # causal masking attn = F.softmax(scores, dim=-1) # normalise over keys return torch.matmul(attn, V), attn # output + weights # Example: seq_len=10, d_k=64, batch=2, heads=8 Q = torch.randn(2, 8, 10, 64) K = torch.randn(2, 8, 10, 64) V = torch.randn(2, 8, 10, 64) out, weights = scaled_dot_product_attention(Q, K, V) print(f"Output: {out.shape}") # [2, 8, 10, 64] print(f"Weights: {weights.shape}") # [2, 8, 10, 10] — 10×10 attention matrix

⚠ Common Pitfall — Forgetting the Causal Mask in Decoder

The decoder must not attend to future tokens during training (it would “cheat” by reading the answer). Apply a causal mask: a lower-triangular matrix where position i can only attend to positions ≤i. In PyTorch: mask = torch.tril(torch.ones(n,n)). Forgetting this mask means the model sees the target during training but not during inference — causing a catastrophic train/eval mismatch where generated text is gibberish.

Multi-Head Attention In-depth

A single attention head can only attend to information from one representational subspace at a time. Multi-head attention runs h independent attention functions in parallel, each with its own learned projection matrices Wᴵᵈ, Wᴵᴷ, Wᴵᵝ. Each head learns to attend to a different type of relationship: one head might track syntactic subject-verb agreement, another resolves coreferences (“he” → “John”), another focuses on positional proximity, and yet another captures semantic similarity. All h head outputs are concatenated and projected back to dᵐᵒᵑᵉℹ via Wᵊ — a learned combination of what each head discovered.

The dimension of each head is dₖ = dᵐᵒᵑᵉℹ / h, so the total computation is the same as a single attention with dᵐᵒᵑᵉℹ dimensions. GPT-3 uses 96 attention heads with dᵐᵒᵑᵉℹ=12,288, giving each head a 128-dimensional subspace. This is one of the key scaling choices: more heads = more types of relationships the model can simultaneously track.

Multi-Head Attention MultiHead(Q,K,V) = Concat(head₁,...,headₕ) · Wᵊ headᴵ = Attention(QWᴵᵈ, KWᴵᴷ, VWᴵᵝ) dₖ = dᵐᵒᵑᵉℹ/h · GPT-3: dᵐᵒᵑᵉℹ=12288, h=96 heads, dₖ=128 per head

Multi-Head Attention — h parallel attention mechanisms with different projections

Positional Encoding Core

Attention is permutation invariant: swap any two tokens and the attention scores are identical (just reordered). The Transformer has no inherent concept of token order — “cat sat mat” and “mat sat cat” would produce the same attention weights if not corrected. To inject positional information, the original paper adds a fixed sinusoidal positional encoding to each token embedding before the first attention layer.

The sinusoidal encoding uses different frequencies for different embedding dimensions: PE(pos, 2i) = sin(pos/10000^(2i/d)) and PE(pos, 2i+1) = cos(pos/10000^(2i/d)). Each position gets a unique vector. The model can learn to extract relative position from these encodings because PE(pos+k) can be expressed as a linear function of PE(pos) — the Transformer can learn to do relative position arithmetic via attention.

Modern LLMs use Rotary Positional Embeddings (RoPE) instead — used in LLaMA, GPT-NeoX, and most recent architectures. RoPE encodes position by rotating the Q and K vectors by an angle proportional to position before computing dot products. The key advantage: attention scores naturally depend on the relative position between tokens (pos_i − pos_j), not absolute positions — enabling better generalisation to longer sequences than seen during training.

Sinusoidal Positional Encoding — unique position fingerprint added to each token

Full Transformer Architecture In-depth

The complete Transformer block repeats the same structure N times: Multi-Head Attention → Add&Norm → Feed-Forward Network → Add&Norm. The Feed-Forward Network is a two-layer MLP applied independently to each token position: FFN(x) = max(0, xW₁+b₁)W₂+b₂, expanding from dᵐᵒᵑᵉℹ to 4·dᵐᵒᵑᵉℹ then back. This 4× expansion and contraction lets the model compute complex non-linear transformations per token. The Add&Norm step adds the input as a residual connection and applies Layer Normalisation — enabling stable training at depth and solving the vanishing gradient problem (Ch 4.3).

The original Transformer had two components. The encoder is bidirectional: every token attends to all other tokens in both directions. It produces contextualised representations of the input sequence. The decoder is causal: each output token attends only to previously generated tokens (enforced by the causal mask), plus cross-attention to all encoder outputs. This asymmetry is fundamental: the encoder understands the full input at once, while the decoder generates output token-by-token, attending to what it has already produced.

Full Transformer — Encoder (left) and Decoder (right) with all components

BERT vs GPT vs T5 In-depth

The original Transformer (2017) had both an encoder and decoder. Within a year, two research groups realised that you could use just one half and pretrain it on massive text corpora to create a general-purpose language model. Google Brain introduced BERT (Bidirectional Encoder Representations from Transformers, 2018) using the encoder only, pretrained with Masked Language Modelling: randomly mask 15% of tokens and train the model to predict them using bidirectional context. OpenAI introduced GPT (Generative Pre-trained Transformer, 2018) using the decoder only, pretrained with standard next-token prediction. These two approaches define the landscape of modern NLP.

Transformer Variants — Encoder-only, Decoder-only, Encoder-Decoder

Model	Architecture	Attention	Pre-training Task	Best For
BERT / RoBERTa	Encoder only	Bidirectional	Masked Language Model (MLM)	Classification, NER, QA
GPT-2/3/4, LLaMA	Decoder only	Causal (L→R)	Next token prediction	Generation, chatbots, LLMs
T5 / BART	Encoder-Decoder	Enc: bidirectional, Dec: causal	Text-to-text / denoising	Translation, summarisation

“Attention Is All You Need” (2017) is the most consequential paper in the history of AI. BERT and GPT were both published in 2018. By 2022, GPT-3 demonstrated few-shot learning at a scale nobody had anticipated. Every major AI system since 2018 — GPT-4, Claude, Gemini, LLaMA, Stable Diffusion, AlphaFold 2, Whisper — is built on the Transformer.

∑ Chapter 4.7 Summary — The Transformer

Self-attention: Attention(Q,K,V) = softmax(QKᵀ/√dₖ)·V — each token attends to all others in one parallel operation
Parallel processing: all tokens computed simultaneously → massive GPU efficiency gain over RNNs — enabled training at GPT-3/4 scale
Multi-head: h parallel attention heads → each learns different relationship patterns (syntax, coreference, position, semantics)
Positional encoding: must be added — attention alone is position-blind; modern LLMs use RoPE for relative position
Transformer block: Multi-Head Attention → Add&Norm → FFN → Add&Norm — residual connections at every step
BERT=encoder (bidirectional, understanding tasks), GPT=decoder (causal, generation), T5=encoder-decoder (seq2seq tasks)

4.8

Chapter 4.8

Transfer Learning & Fine-Tuning

Pre-training a large model once on vast data, then adapting it cheaply to specific tasks, is the defining paradigm of modern AI. Without transfer learning there would be no GPT-4, no BERT, no Stable Diffusion — the compute required to train each from scratch would be prohibitive. Understanding when to freeze, when to fine-tune, and when to use LoRA separates practitioners from theorists.

What is Transfer Learning? Core

Training a large neural network from scratch requires two things that most practitioners do not have: millions of labelled examples and millions of GPU-hours. GPT-3 cost approximately $4.6M to pre-train; BERT took 4 days on 64 TPU v3 chips. Transfer learning solves this by splitting the problem into two phases. Pre-training: train on a large, general dataset (entire internet text, 1.2M ImageNet images, all of Wikipedia) until the model learns rich, reusable representations. Fine-tuning (or adaptation): start from those learned weights and update them toward a specific downstream task — with far less data and compute.

The foundational insight is that early layers of neural networks learn general features that transfer across tasks. In CNNs, layer 1 universally detects oriented edges regardless of whether the network was trained for cats, cars, or faces. In language models, early layers build syntactic representations applicable to any NLP task. This hierarchy of generality — general features at the bottom, task-specific at the top — is what makes transfer learning work. The analogy: a radiologist who spent years in medical school (pre-training on general anatomy) can specialise in chest X-ray reading (fine-tuning on specific task data) far faster than someone starting from scratch.

Transfer Learning Strategies — Feature Extraction vs Fine-Tuning

Feature Extraction Core

Feature extraction freezes every parameter of the pre-trained backbone and trains only a small task-specific head attached to the top. Because no gradients need to flow through the frozen backbone, forward passes do not require gradient tracking — making this dramatically faster and less memory-intensive than fine-tuning. For image tasks, the head is typically a linear classifier or small MLP on top of the backbone’s pooled output. For text tasks with BERT, the [CLS] token embedding (position 0 of the last hidden state) serves as a fixed-size sentence representation, since BERT is trained to pack sentence-level information into it during pre-training (via the Next Sentence Prediction objective).

Feature extraction works best when your task is similar to the pre-training distribution and you have limited labelled data. If your task is quite different from pre-training (e.g., medical imaging from a model pre-trained on natural images), the frozen features may not be informative enough — full fine-tuning is needed. The key question: do the features the backbone learned happen to be useful for your task?

from transformers import BertModel, BertTokenizer import torch, torch.nn as nn tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') # FREEZE all BERT parameters — no gradients flow through backbone for param in model.parameters(): param.requires_grad = False # Only the task head is trainable classifier = nn.Linear(768, 2) # binary sentiment: 768-dim BERT output → 2 classes text = "The model extracts semantic features." inputs = tokenizer(text, return_tensors='pt') with torch.no_grad(): # no gradients needed for frozen backbone outputs = model(**inputs) # [CLS] token embedding = sentence representation cls_embedding = outputs.last_hidden_state[:, 0, :] # shape: (1, 768) logits = classifier(cls_embedding) # shape: (1, 2) print(f"Embedding: {cls_embedding.shape}") # (1, 768) print(f"Logits: {logits.shape}") # (1, 2) # Trainable params: only the 768×2 + 2 = 1,538 classifier params trainable = sum(p.numel() for p in classifier.parameters()) frozen = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable:,} / Total: {frozen+trainable:,}") # 1,538 / 110M

Full Fine-Tuning In-depth

Full fine-tuning updates all weights of the pre-trained model on the downstream task. The critical constraint is the learning rate. Pre-trained weights encode years of training signal from massive datasets — a large learning rate will destroy this knowledge within a few gradient steps in a process called catastrophic forgetting. The pre-training knowledge disappears as new task gradients overwrite it. The standard remedy: use a learning rate 10–100× smaller than the original pre-training LR (typically 2e-5 to 5e-5 for BERT/GPT-sized models, vs 3e-4 for pre-training).

Layer-wise learning rate decay (LLRD) refines this further: assign progressively smaller learning rates to earlier layers. The final layer gets the full (small) LR; each preceding layer gets the LR multiplied by a decay factor (typically 0.9 per layer). This preserves the most general representations in early layers while allowing later layers to adapt more aggressively. Google’s ULMFiT (Howard & Ruder, 2018) pioneered this technique and it remains standard practice for fine-tuning large Transformers.

Fine-Tuning Learning Rate — catastrophic forgetting vs stable adaptation

from transformers import BertForSequenceClassification, BertTokenizer from torch.optim import AdamW from transformers import get_cosine_schedule_with_warmup model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) # Layer-wise LR decay: earlier layers get smaller LR no_decay = ['bias', 'LayerNorm.weight'] optimizer_grouped_parameters = [ # Last transformer layer: full LR {'params': [p for n,p in model.bert.encoder.layer[11].named_parameters()], 'lr': 2e-5}, # Middle layers: slightly reduced {'params': [p for n,p in model.bert.encoder.layer[6].named_parameters()], 'lr': 1e-5}, # Classifier head: highest LR (task-specific) {'params': model.classifier.parameters(), 'lr': 3e-5}, ] optimizer = AdamW(optimizer_grouped_parameters, weight_decay=0.01) # Warmup 10% of steps then cosine decay total_steps = 1000 scheduler = get_cosine_schedule_with_warmup( optimizer, num_warmup_steps=100, num_training_steps=total_steps) # Fine-tuning loop (standard — see Ch 4.4 training loop) for batch in train_loader: outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() scheduler.step() optimizer.zero_grad()

⚠ Common Pitfall — Catastrophic Forgetting

Using a learning rate ≥ 1e-4 for fine-tuning a pre-trained Transformer will typically cause catastrophic forgetting within 1–2 epochs. The model learns your task quickly but destroys its general language understanding. Signs: training loss drops fast but evaluation on any other task collapses. Fix: use LR ≤ 5e-5, add warmup (100–500 steps), and monitor performance on a held-out validation set from the original task distribution.

Parameter-Efficient Fine-Tuning (PEFT) In-depth

Full fine-tuning a 7B-parameter model requires the same GPU memory as pre-training it — typically 80GB+ in fp16. For 175B (GPT-3) or 540B (PaLM) models, full fine-tuning is simply impossible on any commercially available hardware. Parameter-Efficient Fine-Tuning (PEFT) methods address this by updating only a tiny fraction of parameters — usually 0.1–5% — while keeping the rest frozen, achieving near-full-fine-tuning quality at a fraction of the cost.

Adapter layers (Houlsby et al., 2019) insert small bottleneck networks between Transformer layers — project down to a small dimension, apply non-linearity, project back up. Only adapter parameters are trained. Prefix tuning (Li & Liang, 2021) prepends learnable virtual tokens to the Key and Value matrices at every layer — the model sees these as additional context but they are just learned parameter vectors. Prompt tuning simply prepends soft tokens to the input embedding — the simplest form, effective only for large models (>10B parameters).

Method	Trainable Params	Storage per Task	Quality	Inference Overhead
Full fine-tuning	100%	Full model copy	Best	None
Adapter layers	0.5–5%	Small adapter	Good	Small (extra forward pass)
Prefix tuning	0.1–1%	Prefix vectors	Moderate	Small (extra KV)
Prompt tuning	<0.01%	Just prompts	Good (>10B only)	None
LoRA	0.1–1%	Low-rank matrices	Very good	None (merge at inference)
QLoRA	0.1–1%	Even smaller	Good	None (4-bit base model)

LoRA — Low-Rank Adaptation In-depth

Hu et al. (2021) observed that the weight matrices of large language models have low intrinsic rank — meaning their information content can be well-approximated with far fewer dimensions than the full d×d matrix. The hypothesis is that fine-tuning induces weight updates खW that also have low intrinsic rank: the task adaptation doesn’t require updating all d² values independently, because the update lies in a lower-dimensional subspace.

LoRA exploits this by decomposing खW = A·B, where A is d×r and B is r×d with r ≪ d (typically r=4, 8, or 16). The original weight W is frozen. Only A and B are trained. At the end, the adaptation is merged: W’ = W + (α/r)·A·B — a simple matrix addition — so inference has zero added latency. For a typical d=4096 matrix: full खW = 16.7M parameters; LoRA r=8: A+B = 2×4096×8 = 65,536 parameters (256× smaller).

LoRA — Low-Rank Decomposition Standard: W’ = W + खW (update full d×d — expensive) LoRA: W’ = W + (α/r)·A·B (A: d×r, B: r×d — cheap) r = rank (4/8/16/32) · α = scaling · A~N(0,1) · B=0 initially (खW=0 at start) Merge at inference: W’ = W + (α/r)AB — zero latency overhead

LoRA — Low-rank decomposition trains A and B instead of full खW

from peft import LoraConfig, get_peft_model, TaskType from transformers import AutoModelForCausalLM # Load base model (7B parameters) model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf") # Configure LoRA lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=8, # rank — controls expressivity vs size lora_alpha=32, # scaling factor α target_modules=["q_proj", "v_proj"], # Q and V in attention only lora_dropout=0.1, bias="none" ) # Wrap model — adds A and B matrices, freezes everything else peft_model = get_peft_model(model, lora_config) peft_model.print_trainable_parameters() # trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622 # QLoRA: LoRA on 4-bit quantised model — fine-tune 7B on a single 24GB GPU from transformers import BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16) model_4bit = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=bnb_config) # Then apply LoRA as before — trains adapters in fp16 on 4-bit base

⚠ Common Pitfall — Choosing LoRA Rank Too High or Too Low

Rank r is the main hyperparameter. Too low (r=1–2): insufficient expressivity, model underfits the task. Too high (r=64+): approaches full fine-tuning cost with no inference merge advantage. Standard starting point: r=8 for most tasks, try r=4 if memory is tight, r=16–32 if quality is insufficient. Also critical: only applying LoRA to Q and V (not K) is a common default but applying to all projection matrices (Q, K, V, O) often improves quality with minimal extra cost.

RLHF — Reinforcement Learning from Human Feedback Core

The final alignment step that transforms a raw language model into a helpful assistant is RLHF — the technique behind ChatGPT, Claude, and Gemini. A pre-trained LLM knows how to predict text distributions but has no concept of “helpful” or “harmful”. RLHF shapes the model’s outputs toward human preferences through a three-stage pipeline: supervised fine-tuning on demonstrations, training a reward model from pairwise human judgements, and then using reinforcement learning to optimise the LLM to maximise the learned reward.

The RL stage uses Proximal Policy Optimisation (PPO) with a KL divergence constraint: the policy (the LLM being trained) cannot stray too far from the SFT baseline. Without this constraint, the model learns to “game” the reward model — generating outputs that score high on the RM but are not actually helpful (reward hacking). The KL term penalises outputs that diverge greatly from the pre-RLHF distribution, maintaining general capability while nudging toward preferred behaviour. Anthropic’s Constitutional AI (CAI) replaces human annotators in Stage 2 with AI-generated critiques, scaling the process to millions of preference pairs.

RLHF Pipeline — SFT → Reward Modelling → PPO Optimisation

📚

SFT Data Requirements

High-quality human demos
InstructGPT: ~13k examples
Diverse tasks and formats
Quality >> quantity

⚖️

Reward Model

Same architecture as LLM
Final layer: scalar score
InstructGPT: ~33k comparisons
Bradley-Terry ranking model

🤖

Modern Alternatives

DPO: skip RL, direct preference
Constitutional AI (Claude)
RLAIF: AI feedback at scale
Reward-free: RLVR (reasoning)

∑ Chapter 4.8 Summary — Transfer Learning & Fine-Tuning

Transfer learning: pre-train on large data, adapt cheaply — reuse expensive knowledge; early layers learn universal features
Feature extraction: freeze backbone, train head only — minutes not days, works with hundreds of labelled examples
Fine-tuning: update all weights with small LR (2e-5 to 5e-5) — catastrophic forgetting risk; use warmup + LLRD
LoRA: खW = A·B, rank r ≪ d — train 0.06% of parameters, match full fine-tuning quality, zero inference overhead
RLHF: SFT → Reward Model → PPO with KL constraint — the recipe behind ChatGPT, Claude, and Gemini
QLoRA = LoRA on 4-bit quantised base model — fine-tune 70B models on a single 24GB consumer GPU

4.9

Chapter 4.9

Generative Models — VAEs, GANs & Diffusion

Generative models learn the distribution of data well enough to create new data indistinguishable from real examples. Every AI-generated image, synthesised voice, and hallucinated protein structure is the output of a generative model. VAEs introduced structured latent spaces, GANs introduced adversarial training, and diffusion models combined the best of both — producing the current state of the art.

Generative vs Discriminative Models Core

Most models covered so far are discriminative: given an input x, predict a label y. They learn P(y|x) — the conditional distribution of outputs given inputs. A discriminative model draws a decision boundary in input space but has no model of what the input data actually looks like.

Generative models instead learn P(x) — the distribution of the data itself. Once you have a good model of P(x), you can sample from it: draw a new x that was never in the training set but looks like it could have been. This is qualitatively different from classification: you are not deciding which bucket an input belongs to, you are learning what valid inputs look like and manufacturing new ones. Applications span every domain: faces, voices, molecules, code, music, 3D shapes.

Discriminative Models

Generative Models

"Is this a cat?" → Yes/No

"Generate a cat image" → new image

Learns P(y|x) — conditional

Learns P(x) or P(x|y) — joint/marginal

One direction: input → label

Creates new data by sampling

Simpler, more data-efficient

Harder to train, more expressive

Classification, regression, NER

Image synthesis, text gen, drug design

Variational Autoencoders (VAEs) In-depth

Kingma & Welling (2013) introduced the VAE as the first principled neural generative model. A regular autoencoder compresses input x to a latent code z, then reconstructs x — useful for compression but not generation, because the latent space has unpredictable gaps: points between training examples decode to garbage. The VAE fixes this by making the encoder probabilistic: instead of outputting a point z, it outputs a Gaussian distribution N(μ, σ²). The network is then trained to ensure this distribution stays close to a standard normal N(0, I) (via a KL divergence penalty), forcing all latent representations to occupy a continuous, organised neighbourhood around the origin.

The reparameterisation trick is the key engineering insight that makes training possible. You cannot backpropagate through a sampling operation z ~ N(μ, σ²) directly, because sampling is stochastic. The trick: write z = μ + σ·ε where ε ~ N(0,1) is sampled externally. Now μ and σ are deterministic outputs of the encoder, gradients can flow through them, and the stochasticity is isolated in ε which has no parameters to update. This simple algebraic rearrangement is what makes VAE training feasible.

VAE Loss (ELBO — Evidence Lower Bound) L = E[log P(x|z)] − KL(q(z|x) ∥ N(0,I)) = Reconstruction loss − ½∑(1 + log σ² − μ² − σ²) Reconstruction: how faithfully does the decoder recreate x from z? KL: how close is q(z|x) to N(0,I)? Regularises the latent space. Reparameterisation: z = μ + σ·ε, ε~N(0,1) — makes sampling differentiable

VAE — Probabilistic encoder outputs distribution, decoder generates from samples

Latent Space Properties Core

The VAE’s KL penalty forces all class clusters in latent space to overlap near the origin, eliminating gaps. This creates a structured, continuous latent space where every point decodes to a plausible output. You can smoothly interpolate between any two encoded examples: a straight line from the latent code of a young face to the code of an old face passes through intermediate codes that decode to faces of intermediate age. This is qualitatively impossible with regular autoencoders, which have large empty voids between training examples.

VAE Latent Space — structured, continuous, enables smooth interpolation

Generative Adversarial Networks (GANs) In-depth

Goodfellow et al. (2014) proposed a radically different generative approach: instead of maximising a likelihood, frame generation as a two-player game. A Generator G takes random noise z as input and outputs synthetic data G(z). A Discriminator D receives either a real data point x or a fake G(z) and must decide which is which. G is trained to fool D; D is trained to detect fakes. Neither player sees the other’s loss function directly — they only see each other’s outputs. The result of this adversarial dynamic, when it works, is a Generator that produces data indistinguishable from real training examples — because any distinguishable fake will be caught by D and penalised.

The theoretical optimum is a Nash equilibrium: G generates data exactly matching the true distribution P(x), and D can do no better than random guessing (P(real) = 0.5 for all inputs). In practice, reaching this equilibrium is notoriously difficult. The training is unstable, sensitive to hyperparameters, and prone to mode collapse — covered in the next section.

GAN Objective — Minimax Game minᵀ maxᴰ [Eₓ[log D(x)] + Eₓ[log(1 − D(G(z)))]] D maximises: correctly labelling real x as real, fake G(z) as fake G minimises: making D output 1 (real) for fake data (i.e., fool D) Nash equilibrium: D(x) = 0.5 everywhere — G matches true data distribution

GAN — Generator vs Discriminator in a minimax game

🎨

Notable GAN Variants

DCGAN (2015) — CNN-based GAN
cGAN — conditional generation
CycleGAN — image-to-image
StyleGAN (2018–22) — faces

⚠️

Training Challenges

Mode collapse (next section)
Vanishing D gradient
D/G balance is fragile
No convergence guarantee

🔧

Stabilisation Techniques

WGAN — Wasserstein distance
Spectral normalisation
Progressive growing
Gradient penalty (WGAN-GP)

GAN Training Challenges & Mode Collapse Core

Mode collapse is the most notorious GAN failure mode. If G discovers one type of output that consistently fools D, it stops exploring the rest of the distribution and produces the same output (or a small set) regardless of the input noise. The discriminator adapts, G shifts to another single mode, and training cycles without covering the full data distribution. The GAN loss gives no signal that this is happening — the loss values look normal while G has abandoned 90% of the training distribution.

Training instability arises from the balance requirement: D and G must improve at similar rates. If D becomes too powerful early in training, G receives near-zero gradients and cannot improve (D correctly labels everything with high confidence, so the loss for G becomes flat). If G outpaces D early, D provides no meaningful feedback. The WGAN (Wasserstein GAN) addresses both problems by replacing the original loss with the Wasserstein distance, which provides smoother gradients and is more robust to D/G imbalance.

GAN Mode Collapse — and recovery with stabilisation techniques

Diffusion Models In-depth

Ho et al. (2020) introduced Denoising Diffusion Probabilistic Models (DDPM), which now power Stable Diffusion, DALL-E 3, Midjourney, and Sora. The key idea is elegant: define a forward process that gradually corrupts data by adding Gaussian noise over T steps until the image becomes pure noise, then train a neural network to learn the reverse process — predicting what noise was added at each step, and thus denoising one step at a time. Generation is simply running the reverse process starting from random noise.

Unlike GANs, diffusion training is stable: the objective is a simple regression (predict the noise added at step t), there is no adversarial game to balance, and the model sees all noise levels during training. Unlike VAEs, there is no explicit latent space to constrain, so the generative quality is not limited by the bottleneck. The tradeoff is sampling speed: generating one image requires T=100–1000 denoising steps, each requiring a full forward pass through the network. Modern techniques like DDIM (deterministic sampling) and SDXL-Turbo (distillation) reduce this to 1–4 steps, largely eliminating the speed disadvantage.

Diffusion Model — Forward and Reverse Forward: q(xₜ|xₜ₋₁) = N(xₜ; √(1−βₜ)xₜ₋₁, βₜI) (add noise, fixed) Efficient: q(xₜ|x₀) = N(xₜ; √α̅ₜ x₀, (1−α̅ₜ)I) (jump to step t) Reverse: pθ(xₜ₋₁|xₜ) = N(xₜ₋₁; μθ(xₜ,t), Σθ(xₜ,t)) (learned) Training: minimise E[||ε − εθ(xₜ, t)||²] — predict the noise added at step t

Diffusion Model — forward noise addition and learned reverse denoising

Generative Model Comparison Core

Property	VAE	GAN	Diffusion
Training stability	Stable	Unstable (adversarial)	Stable
Sample quality	Blurry (over-smooth)	Sharp (when works)	State-of-the-art
Latent space	Structured, continuous	Less structured	Not explicit
Sampling speed	Fast (1 pass)	Fast (1 pass)	Slow (T=100-1000)
Controllability	Good (interpolation)	Moderate	Excellent (conditioning)
Mode coverage	Good	Mode collapse risk	Good
Best use today	Compression, anomaly detection	Video gen, GAN editing	Image/video synthesis SOTA
Examples	VQ-VAE, VQ-VAE-2	StyleGAN-3, BigGAN	Stable Diffusion, DALL-E 3, Sora

Generative Model Timeline — from VAE to Diffusion dominance

∑ Chapter 4.9 Summary — Generative Models

Generative models learn P(x) — the data distribution itself — enabling new data synthesis; discriminative models learn P(y|x) (labels from inputs)
VAE: encoder → (μ, σ) → z = μ + σ·ε → decoder; KL penalty forces continuous structured latent space — enables interpolation and generation
Reparameterisation trick: z = μ + σ·ε, ε~N(0,1) — makes sampling differentiable for backpropagation
GAN: Generator G(z) fools Discriminator D(x) — minimax game; Nash eq: D(x)=0.5 everywhere; notorious for mode collapse and training instability
WGAN, spectral normalisation, gradient penalty — stabilisation techniques that largely solved GAN training (StyleGAN-3, BigGAN)
Diffusion: forward process adds Gaussian noise over T steps; reverse process (εθ) learns to denoise — training objective: E[||ε − εθ(xₜ,t)||²]
Diffusion = current SOTA for image/video generation — Stable Diffusion, DALL-E 3, Sora — stable training, no mode collapse, excellent conditioning

🎓 Domain 4 Complete — Deep Learning & Neural Networks

Ch 4.1 Perceptron to MLP: weighted sum + step function. XOR killed neural nets for 15 years; hidden layers + non-linearity solved it by learning hierarchical representations.
Ch 4.2 Activation Functions: ReLU for CNNs, GELU for Transformers — sigmoid only for binary output. Non-linearity is what makes stacked layers more powerful than one.
Ch 4.3 Backpropagation: chain rule through the computational graph. Vanishing gradients: sigmoid → 0.25ᴿ per layer; ReLU fixes this with gradient=1 for positive inputs.
Ch 4.4 Training Deep Networks: He init, BatchNorm, Dropout, AdamW + warmup–cosine LR — the engineering stack that makes 100+ layer networks trainable in practice.
Ch 4.5 CNNs: local receptive fields + weight sharing. ResNet y=F(x)+x solved depth degradation — enabled going from 16 to 152+ layers without degradation.
Ch 4.6 RNNs & LSTMs: hidden state = sequential memory. LSTM gating (forget/input/output) solves vanishing gradients; attention preview leads directly to the Transformer.
Ch 4.7 Transformer: Attention(Q,K,V)=softmax(QKᵀ/√dₖ)V — parallel, direct long-range dependencies. GPT=decoder, BERT=encoder, T5=both.
Ch 4.8 Transfer Learning: pre-train then adapt. LoRA trains 0.1% of parameters with zero inference overhead. RLHF (SFT→RM→PPO) creates aligned helpful LLMs.
Ch 4.9 Generative Models: VAE = structured latent space. GAN = adversarial game. Diffusion = learn to reverse Gaussian noise — Stable Diffusion, DALL-E 3, Sora.

Domain 4 is the mathematical engine behind every frontier AI system. The Transformer (Ch 4.7) is the single most important architecture in AI today — GPT-4, Claude, Gemini, DALL-E, AlphaFold, Whisper, and virtually every LLM runs on it. Domain 5 (NLP & LLMs) explores what happens when you scale the Transformer to trillions of tokens. Domain 8 (Agentic AI) shows what happens when you give it tools, memory, and the ability to act in the world.