Neural Networks โ Perceptron to MLP
How a single artificial neuron scales into the multi-layer networks that power modern AI
A neural network is not a brain simulation. It is a function approximator โ a mathematical machine that learns to map inputs to outputs by adjusting millions of numerical parameters. The biological metaphor is useful for intuition; the mathematics is what actually works.
Biological Inspiration Introductory
Long before computers existed, scientists observed that the human brain processes information through a vast network of interconnected cells called neurons. Each biological neuron receives chemical signals through branching fibres called dendrites, integrates those signals in its cell body (the soma), and โ if the combined signal exceeds an internal threshold โ fires an electrical impulse along its axon to downstream cells. This "integrate and fire" mechanism, repeated across roughly 86 billion neurons with trillions of connections, gives rise to everything from reflex actions to abstract reasoning.
In 1943, McCulloch and Pitts created the first mathematical model of a neuron: a binary threshold unit that sums its inputs and outputs 1 if the sum exceeds a fixed threshold, 0 otherwise. The mapping from biology to mathematics is direct: dendrites become numeric inputs, synaptic strengths become weights, the soma becomes a weighted summation, and the axon firing becomes an activation function. This abstraction โ inputs → weighted sum → activation โ is still the foundation of every neural network today.
The analogy has important limits. Biological neurons communicate via discrete spikes; artificial neurons use continuous real-valued outputs. Biological learning involves complex biochemical processes; artificial networks learn by gradient descent on a loss function. The phrase "inspired by, not modelled after" is exactly right. Deep learning borrowed the high-level architecture of layered computation and discarded almost everything else in favour of mathematical tractability.
The Perceptron In-depth
In 1957, Frank Rosenblatt at the Cornell Aeronautical Laboratory built the Perceptron โ the first machine specifically designed to learn from examples. The idea was elegantly simple: represent a decision-making unit as a weighted sum of inputs passed through a step function. If the total weighted input exceeds a threshold, the unit fires (outputs 1); otherwise it stays silent (outputs 0). Crucially, the weights could be adjusted automatically when the unit made a mistake โ this was the first learning algorithm for a neural model.
The perceptron structure has four components. First, numeric inputs x₁, x₂, …, xₙ โ these could be pixel intensities, sensor readings, or any measurable feature. Second, a weight wᵢ for each input, representing how important that feature is. Third, a bias b that shifts the decision boundary independently of the inputs. Fourth, a step activation function that converts the raw weighted sum into a binary decision.
The perceptron learning rule is the ancestor of gradient descent. After every prediction, if the prediction was correct, do nothing. If the network predicted 0 but the true label was 1, increase each weight by a small fraction of the corresponding input. If the network predicted 1 but should have predicted 0, do the reverse. This simple rule has a remarkable theoretical guarantee: if the training data is linearly separable, the perceptron will converge to a correct solution in a finite number of steps โ the Perceptron Convergence Theorem.
Tracing the AND logic gate concretely: AND outputs 1 only when both inputs are 1. Start with all weights at 0. When we show (1,1)→1 and the network predicts 0 (since 0<0), we add the inputs to the weights. After a few cycles the perceptron settles at weights w₁=1, w₂=1, b=−1.5, which correctly separates AND's one positive case from the three negatives by the line x₁ + x₂ = 1.5.
The convergence theorem only guarantees convergence if the data is linearly separable. For non-separable data, the algorithm loops forever, cycling through updates that never stabilise. Always set a maximum epoch limit and check whether loss has stopped decreasing.
The XOR Problem & MLP Motivation In-depth
In 1969, Marvin Minsky and Seymour Papert published Perceptrons, a rigorous mathematical analysis of what single-layer networks could and could not compute. Their central result was devastating: a single-layer perceptron cannot learn the XOR function. XOR outputs 1 when exactly one of two inputs is 1: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. The proof is geometric โ there is no single straight line that can separate the two positive cases from the two negative cases. They form a checkerboard that is impossible to bisect with one hyperplane.
The Minsky-Papert result triggered the first AI winter: funding dried up and neural network research sat dormant for roughly 15 years. The irony is that the paper itself pointed toward the solution โ adding hidden layers could overcome these limitations, but they doubted an efficient learning algorithm for such networks could be found. That algorithm โ backpropagation, popularised by Rumelhart, Hinton, and Williams in 1986 โ became the key that unlocked the field.
The geometric resolution is illuminating. With a hidden layer, the network first learns two intermediate linear boundaries: one that isolates (1,1) and one that isolates (0,0). The hidden layer outputs encode whether the input is in each region. The output layer then combines these hidden representations to produce the XOR decision โ a task that is linearly separable in the transformed space. This is the core insight of deep learning: each layer transforms the data into a representation where the next layer's job becomes easier.
The XOR problem is not just a failure of the perceptron โ it is a proof that any single linear classifier has a fundamental expressiveness limit. The solution is not a better linear classifier. The solution is composition: learn intermediate nonlinear representations, then combine them.
Even a deep stack of linear layers with no activation functions cannot solve XOR. A stack of linear transformations is itself a single linear transformation: W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂). Non-linear activation functions between layers are the essential ingredient โ without them, depth buys you nothing.
Multi-Layer Perceptron (MLP) In-depth
The Multi-Layer Perceptron (MLP) โ also called a feedforward network or fully connected network โ adds one or more hidden layers between the input and output. Every neuron in each layer connects to every neuron in the next layer. The key architectural decision is that a non-linear activation function is applied after each layer's weighted sum, breaking the linear chain that would otherwise collapse the whole network into a single affine transformation.
Each hidden layer acts as a feature detector. The first hidden layer learns combinations of the raw inputs โ in image recognition, this might correspond to edges or local contrasts. The second hidden layer learns combinations of those features โ perhaps corners and junctions. Deeper layers learn increasingly abstract concepts. This hierarchical representation learning is why depth is valuable: rather than memorising the training set, the network learns what to look for.
Architecture notation is typically given as a list of layer sizes. [4, 8, 8, 3] means 4 inputs, two hidden layers of 8 neurons, and 3 outputs. The total parameter count: for each layer, (inputs to that layer) × (neurons in that layer) weights plus one bias per neuron. A [4, 5, 4, 3] network has (4×5+5) + (5×4+4) + (4×3+3) = 25 + 24 + 15 = 64 parameters.
The most common beginner mistake when building an MLP is stacking nn.Linear layers without activation functions between them. Without non-linearity, no matter how many layers you add, the network can only learn linear decision boundaries. Always add nn.ReLU() between every pair of linear layers.
The Forward Pass In-depth
The forward pass is the computation that transforms an input vector into a prediction by passing it sequentially through each layer. Understanding the forward pass precisely โ including the shapes of every matrix and vector โ is essential for debugging, designing architectures, and reasoning about computational cost.
For each layer l, the computation has two steps. First, compute the pre-activation Z by multiplying the previous layer's output A by the weight matrix W and adding a bias b. Second, apply the activation function f element-wise to Z to produce the output A of this layer. The output of the final layer is the network's prediction.
The most common forward pass error is a matrix dimension mismatch. If layer l has input dimension d and output dimension k, then W has shape [d × k] and the output has shape [batch × k]. A common mistake is transposing weight matrices incorrectly or confusing input/output sizes when building layers manually without nn.Linear.
Universal Approximation Theorem Core
In 1989, George Cybenko proved a remarkable result about feedforward networks: a network with a single hidden layer using sigmoid-like activation functions, and sufficiently many neurons, can approximate any continuous function on a compact subset of ℜ² to arbitrary precision. Hornik (1991) extended this to any non-constant, bounded, continuous activation function. This result โ the Universal Approximation Theorem โ gives MLPs their theoretical power: they are, in principle, general-purpose function approximators.
The intuition is geometric. A single neuron with a step function carves out a half-space. With many neurons, you can approximate arbitrary regions. With smooth activations, you build up smooth functions by summing many "bumps". As you add more neurons, the approximation gets finer โ the diagram below shows a coarse 2-neuron step-wise approximation converging to the true function as width increases to 32.
The theorem has important caveats practitioners often overlook. It says a solution exists โ it says nothing about whether gradient descent will find it, how many samples are needed, or whether the required network is computationally feasible. In practice, a single very wide layer may require exponentially more neurons than a deeper network to represent the same function. This is the practical motivation for depth: depth enables exponentially more efficient representation. In CNNs, this manifests as hierarchical feature detection: edges in layer 1, textures in layer 2, object parts in layer 3.
The Universal Approximation Theorem says any function can be represented โ it does not say gradient descent will find it. Expressiveness and learnability are different things. This is why depth, regularisation, and data quantity matter in practice.
Network Anatomy & Hyperparameters Core
Understanding the key design choices of an MLP โ and the consequences of setting them poorly โ is essential for practitioners. The table below summarises the principal hyperparameters, their typical ranges, and the effects of extreme values. Each will be explored in depth in subsequent chapters; develop an intuition for the tradeoffs now.
| Hyperparameter | Definition | Typical Range | Effect if Too Small | Effect if Too Large |
|---|---|---|---|---|
| Number of layers (depth) | How many hidden layers | 2โ50+ (hundreds with residual connections) | Underfits; limited representational power | Vanishing gradients; harder to train without tricks |
| Width (neurons per layer) | Nodes per hidden layer | 64โ4096 | Underfits; insufficient capacity | Memory intensive; increased overfitting risk |
| Activation function | Non-linearity applied between layers | ReLU, GELU, Tanh, Sigmoid | โ | โ (see Chapter 4.2) |
| Batch size | Samples per gradient update | 16โ2048 | Noisy gradients; slow wall-clock time | Sharp minima; poor generalisation to test set |
| Learning rate | Gradient step size (α) | 1e-4 to 1e-2 | Very slow convergence; appears stuck | Divergence; NaN loss; oscillating training |
| Dropout rate | Fraction of neurons randomly zeroed each step | 0.1โ0.5 | No regularisation; model memorises training data | Too much information loss; underfitting |
Parameter Count Formula
For each layer with d𝕪 inputs and dₒ𝕦𝕧 outputs:
- Weights: d𝕪 × dₒ𝕦𝕧
- Biases: dₒ𝕦𝕧
- Total: ∑ (d𝕪 × dₒ𝕦𝕧 + dₒ𝕦𝕧)
Modern Architecture Scale
Reference points for context:
- MNIST MLP: ~0.5M params
- ResNet-50: ~25M params
- GPT-2: ~117M params
- GPT-4 (est.): ~1.8T params
Where MLPs Appear Today
MLPs are fundamental building blocks:
- Feed-forward layers in Transformers
- Classification heads in CNNs
- Value/policy networks in RL
- Embedding projections
∑ Chapter 4.1 Summary โ Neural Networks: Perceptron to MLP
- Biological inspiration: dendrites (inputs) → soma (weighted sum + threshold) → axon (output); artificial neurons abstract this as inputs → weighted sum → activation function
- Perceptron: ŷ = step(w·x + b) โ learns linear decision boundaries only; update rule wᵢ ← wᵢ + α(y−ŷ)xᵢ; converges if and only if data is linearly separable
- XOR problem (Minsky & Papert, 1969): a single-layer network cannot solve non-linearly separable problems โ this caused the 15-year first AI winter
- Solution: add hidden layers with non-linear activation functions โ each layer learns intermediate representations; without non-linearity, stacked linear layers collapse to a single linear layer
- Forward pass: Z⁽ˡ⁾ = A⁽ˡ⁻¹⁾ · W⁽ˡ⁾ + b⁽ˡ⁾, then A⁽ˡ⁾ = f(Z⁽ˡ⁾) โ repeated layer by layer from input to output
- Universal Approximation Theorem: an MLP with sufficient width can represent any continuous function โ but expressiveness ≠ learnability; depth makes representation exponentially more efficient
- Parameter count = ∑ (layer_in × layer_out + layer_out) โ even small networks have thousands of trainable parameters; modern models have billions
Without non-linear activation functions, stacking any number of linear layers produces exactly one linear transformation. It is the activation function โ applied element-wise after every layer โ that gives neural networks their ability to learn arbitrarily complex mappings. Choosing the right activation is one of the most consequential architectural decisions you will make.
Why Non-Linearity Matters Core
Consider two linear layers stacked directly: Z = Wโ(Wโx + bโ) + bโ = (WโWโ)x + (Wโbโ + bโ). The result is another linear function with a combined weight matrix W = WโWโ and a combined bias. No matter how many linear layers you stack, the composition remains a single affine (linear + shift) transformation. This means a deep linear network is no more expressive than a logistic regression โ it can only learn straight hyperplane decision boundaries.
An activation function f applied between layers breaks this collapse: Aโ = Wโ ยท f(Wโx + bโ) + bโ. Now the result is genuinely non-linear and the two layers are no longer collapsible into one. A good activation function must satisfy three practical requirements: it must be non-linear (obviously), differentiable almost everywhere (so gradients can flow during backpropagation), and computationally cheap (it is applied millions of times per forward pass).
Sigmoid In-depth
The sigmoid function was the default activation in neural networks throughout the 1980s and 1990s. It takes any real-valued input and squashes it into the range (0, 1), which made it a natural fit for modelling probabilities. The S-shaped curve rises steeply near z = 0 and flattens toward 0 for very negative inputs and toward 1 for very positive inputs. This flattening is the source of its central problem in deep networks.
The derivative of ฯ(z) is elegantly expressed as ฯ(z)(1 โ ฯ(z)). This has a maximum of 0.25 at z = 0, and falls to near-zero as |z| grows large. In a deep network, gradients are multiplied together as they propagate backward through layers. If most neurons saturate (i.e., z is large in magnitude), each multiplication by a derivative near 0 shrinks the gradient exponentially โ this is the vanishing gradient problem. A network with 10 sigmoid layers loses a gradient factor of 0.25ยนโฐ โ 0.0000001 before it reaches the first layer.
A second, subtler problem is that sigmoid outputs are never negative โ they are always in (0, 1). This means gradients are always the same sign (all positive or all negative), which causes zig-zag updates in weight space. Tanh, which we examine next, solves this by being zero-centred. Today, sigmoid is used almost exclusively at the output layer of binary classifiers (where you genuinely want a probability) and in the gating mechanisms of LSTMs.
Using sigmoid as the activation for hidden layers in deep networks almost always causes the vanishing gradient problem. If your training loss stops decreasing very early and the gradients of the first layers are near zero when inspected, this is the most likely cause. Switch to ReLU or GELU for all hidden layers; reserve sigmoid for binary classification output only.
Tanh Core
Tanh (hyperbolic tangent) is a scaled and shifted version of sigmoid: tanh(z) = 2ฯ(2z) โ 1. It squashes inputs to (โ1, +1) instead of (0, 1). The critical improvement over sigmoid is that tanh is zero-centred โ its outputs are balanced around zero. When the activation outputs are always positive (as in sigmoid), the gradient updates to all weights in the next layer always have the same sign. This forces the optimiser into a zig-zag path through weight space. Tanh's zero-centred outputs allow positive and negative gradients, enabling more direct paths toward the minimum.
Tanh still suffers from the vanishing gradient problem for large |z|, where the derivative tanh'(z) = 1 โ tanhยฒ(z) approaches zero. The maximum derivative is 1.0 at z = 0 โ four times larger than sigmoid's maximum of 0.25 โ which makes it somewhat less prone to gradient collapse. Tanh remains the preferred activation inside LSTM and GRU gates, where its zero-centered outputs help regulate cell state updates.
ReLU Family In-depth
The Rectified Linear Unit (ReLU) is disarmingly simple: pass the input through unchanged if it is positive, otherwise output zero. This single function โ introduced to deep learning at scale by AlexNet in 2012 โ transformed the field. Before ReLU, training deep networks beyond 5โ6 layers was nearly impossible due to vanishing gradients from sigmoid and tanh. ReLU's constant gradient of 1 for positive inputs means gradients flow freely through activated neurons, enabling networks of 50, 100, or even 1000 layers.
ReLU also introduces sparse activation: on average, about 50% of neurons output exactly zero for any given input. This sparsity provides implicit regularisation โ only the "relevant" neurons participate in each forward pass. However, this same property creates the Dead ReLU problem: if a neuron's pre-activation is always negative (e.g., because the bias drifts negative during training), its gradient is permanently zero and it never recovers. This can kill 10โ40% of neurons in poorly initialised or high-learning-rate networks.
Leaky ReLU fixes dead neurons by allowing a small negative slope ฮฑ (typically 0.01) for z < 0, ensuring the gradient is never exactly zero. ELU (Exponential Linear Unit) goes further with a smooth exponential curve for negative inputs, producing outputs closer to zero-mean โ which can improve convergence. PReLU (Parametric ReLU) treats ฮฑ as a learnable parameter, letting the network decide the optimal negative slope per channel.
If you use a high learning rate or bad weight initialisation, a large fraction of ReLU neurons can get stuck with permanently negative pre-activations โ the "dead ReLU" problem. Gradients through these neurons are exactly zero, so they never recover. Signs: training loss stops improving but there is no NaN; inspecting neuron outputs shows many always-zero activations. Fix: use Leaky ReLU, reduce learning rate, or use proper He initialisation (nn.init.kaiming_normal_).
GELU & Modern Activations Core
The Gaussian Error Linear Unit (GELU) was introduced by Hendrycks and Gimpel (2016) and quickly became the dominant activation in Transformer-based models. The key motivation: ReLU has a hard kink at z = 0 โ the derivative jumps discontinuously from 0 to 1. GELU replaces this with a smooth curve by weighting the input by the probability that it is positive under a standard Gaussian distribution: f(z) = z ยท ฮฆ(z), where ฮฆ is the Gaussian CDF.
In practice, GELU is computed via a fast approximation: f(z) โ 0.5z(1 + tanh[โ(2/ฯ)(z + 0.044715zยณ)]). This smooth transition means GELU has a continuous gradient everywhere, which empirically improves training stability for deep Transformers. GPT-2, GPT-3, BERT, BART, T5, and virtually every large language model published since 2019 uses GELU in its feed-forward sublayers.
Swish (also called SiLU, Sigmoid Linear Unit) is another smooth variant: f(z) = z ยท ฯ(z). The input gates itself โ neurons with large positive values pass through at full strength, while negative values are softly suppressed. Swish is used in EfficientNet, MobileNetV3, and several LLM variants. Mish extends this idea: f(z) = z ยท tanh(softplus(z)), and has achieved state-of-the-art performance on some computer vision benchmarks. All three share the property of being smooth, non-monotonic, and having a small negative dip near z โ โ0.2, which provides a weak self-normalising property.
GELU is to Transformers what ReLU is to CNNs: the empirically dominant choice. Its smooth gradient everywhere avoids dead neurons and allows stable training at great depth. If you are building any Transformer-based model โ language, vision, or multimodal โ start with GELU.
Softmax In-depth
Softmax is not an activation function in the same sense as ReLU or GELU โ it is not applied element-wise independently to each neuron. Instead, it is a normalisation operation over an entire output vector, converting a vector of raw logits (unbounded real numbers) into a valid probability distribution. Every output is positive, and all outputs sum to exactly 1.0, making the output directly interpretable as class probabilities.
Softmax amplifies differences between logits. The largest logit receives a disproportionately high probability โ the exponentiation makes differences exponential before normalisation. With logits [3.0, 1.0, 0.5], the first class dominates the probability. Subtracting the maximum logit before computing exponentials โ max-trick โ prevents numerical overflow without changing the output: softmax(z โ max(z)) = softmax(z).
The temperature parameter T controls the sharpness of the distribution: softmax(z/T). With T โ 0, the distribution collapses to a one-hot (greedy) selection of the highest logit. With T โ โ, it becomes uniform. In LLM token sampling, temperature is a key knob: T = 0.7 gives creative but coherent text; T = 1.5 gives more random, diverse outputs. At training time T = 1 is almost always used.
In PyTorch, never apply softmax before passing logits to nn.CrossEntropyLoss โ this loss already applies log-softmax internally for numerical stability. Applying softmax beforehand causes the loss to compute log(softmax(logits)), introducing numerical errors. Always pass raw logits to CrossEntropyLoss.
A very common bug: applying torch.softmax(logits) in the model's forward method, then passing the result to nn.CrossEntropyLoss. Since CrossEntropyLoss internally calls log_softmax, you end up computing log(softmax(logits)) instead of log_softmax(logits), which is numerically unstable and gives wrong gradients. Always output raw logits from the model.
Choosing an Activation Function Core
The choice of activation function is one of the most important and most misunderstood hyperparameters. The practical rule is simple: use ReLU as your baseline for CNNs and general MLPs, switch to GELU for anything Transformer-based, use Sigmoid only at binary classification output, and use Softmax only at multi-class output. The table below summarises when and why to use each.
| Activation | Range | Vanishing Gradient | Zero-Centred | Where Used | Default Choice? |
|---|---|---|---|---|---|
| Sigmoid | (0, 1) | Yes โ severe | No | Binary output, LSTM gates | Only for binary output |
| Tanh | (โ1, 1) | Yes โ moderate | Yes | LSTM/GRU gates, RNNs | Legacy RNNs only |
| ReLU | [0, โ) | No (positive) | No | CNNs, MLPs (pre-2018) | โ CNNs still |
| Leaky ReLU | (โโ, โ) | No | Near | When dead neurons are a problem | Good fallback |
| GELU | (โ0.17, โ) | No | Near | GPT, BERT, T5, Transformers | โ Transformers |
| Swish/SiLU | (โ0.28, โ) | No | Near | EfficientNet, some LLMs | โ Modern CNNs |
| Softmax | (0, 1), ฮฃ=1 | โ | No | Multi-class output layer only | Only for output |
∑ Chapter 4.2 Summary โ Activation Functions
- Without non-linear activations, stacking layers = still a single linear transformation โ depth buys nothing expressively
- Sigmoid ฯ(z) = 1/(1+eโz): saturates โ vanishing gradients; not zero-centred โ use only for binary classification output
- Tanh: same saturation problem but zero-centred โ better gradient flow; still used in LSTM/GRU gates
- ReLU: max(0,z) โ fast, no saturation for positive inputs, default for CNNs; suffers from Dead ReLU (permanently zero neurons)
- Leaky ReLU fixes dead neurons with small negative slope ฮฑ โ good fallback when ReLU causes training issues
- GELU: smooth ReLU variant f(z) = zยทฮฆ(z) โ used in GPT, BERT, and virtually all modern Transformers; smooth gradient everywhere
- Softmax: multi-class output only โ temperature T controls sharpness of probability distribution; never apply before CrossEntropyLoss
Backpropagation is not magic โ it is the chain rule of calculus applied systematically to a computational graph. The genius is not the mathematics (which dates to Leibniz) but the engineering insight that all gradients in a network can be computed in a single backward pass, as cheaply as one forward pass. Without this, deep learning would be computationally impossible.
Intuition Core
The central question of training is: "For every weight in the network, how much does the loss change if I nudge that weight by a tiny amount?" This quantity โ the partial derivative of the loss with respect to each weight โ is the gradient. To reduce the loss, we move each weight in the direction opposite to its gradient.
The naive approach is finite differences: for each weight w, compute loss(w + ฮต) โ loss(w) / ฮต. This gives an approximate gradient for that weight. The problem is scale. GPT-4 has an estimated 1.8 trillion parameters. Computing one gradient update this way requires 1.8 trillion forward passes โ at, say, 1 second per pass on a cluster, that is 57,000 years per update step. Completely impossible.
Backpropagation solves this by computing all gradients simultaneously in a single backward pass through the computational graph. The backward pass is no more expensive than the forward pass โ it visits the same operations in reverse. The key ingredient is the chain rule, which tells us how to compose local gradients as they flow backward from the loss to the inputs.
Finite differences: O(W) forward passes for W weights. Backpropagation: ONE backward pass for all W weights simultaneously. This efficiency gap โ many orders of magnitude โ is what makes modern deep learning possible.
Computational Graph In-depth
Every computation a neural network performs can be represented as a directed acyclic graph (DAG). Each node in the graph is a mathematical operation โ addition, multiplication, exp, sigmoid, max. Each directed edge carries a tensor value from one operation to the next. The leaf nodes on the left are the inputs and weights; the single root node on the right is the scalar loss.
The forward pass is data flowing left to right through this graph โ compute zโ = w ร x, then zโ = zโ โ y, then L = zโยฒ. Each intermediate value is stored (this is why training uses more memory than inference). The backward pass is gradients flowing right to left โ starting with โL/โL = 1 and applying the chain rule at each node. Every node knows how to compute its local gradient (e.g., the gradient through a multiplication node is the other operand), and backprop just multiplies local gradients together along each path.
PyTorch builds this graph dynamically as you execute Python code โ every tensor operation with requires_grad=True records itself into the graph. When you call loss.backward(), PyTorch traverses the graph in reverse topological order and accumulates gradients into each leaf tensor's .grad attribute. JAX uses a slightly different approach (function transformation) but the computational graph concept is identical.
Chain Rule in Neural Networks In-depth
The chain rule is calculus's rule for differentiating composed functions: if L = f(g(x)), then dL/dx = (dL/dg) ยท (dg/dx). In a neural network, every layer is a composed function. The loss is a composition of all the layer operations stacked together. Backprop is simply the chain rule applied methodically in reverse order through every layer.
For a single layer l with pre-activation Zโฝหกโพ = Aโฝหกโปยนโพ ยท Wโฝหกโพ + bโฝหกโพ and output Aโฝหกโพ = f(Zโฝหกโพ), the gradient of the loss with respect to the weights Wโฝหกโพ decomposes into three factors by the chain rule: how the loss changes with the activation, how the activation changes with the pre-activation (the derivative of the activation function), and how the pre-activation changes with the weights (which is simply Aโฝหกโปยนโพ). Multiplied together, these give the weight gradient for that layer.
The error signal ฮดโฝหกโพ is the gradient of the loss with respect to the pre-activation Zโฝหกโพ. It packages the chain rule product up to layer l. To propagate backward one more layer, we multiply ฮดโฝหกโพ by the weight matrix Wโฝหกโพ transposed (to "route" gradients back to the correct inputs), then element-wise multiply by f'(Zโฝหกโปยนโพ) โ the local derivative of the activation. This recursion continues all the way to the first layer.
PyTorch accumulates (adds) gradients into .grad by default โ it does not overwrite them. If you call loss.backward() twice without calling optimizer.zero_grad() in between, the gradients double. The canonical training loop order is always: zero_grad โ forward โ loss โ backward โ step. Gradient accumulation over multiple mini-batches is intentional use of this behaviour, but it must be explicit.
Vanishing Gradients In-depth
Gradients propagate backward by multiplication. If the gradient at each layer is a number less than 1, repeated multiplication makes the product shrink exponentially. Sigmoid's maximum derivative is 0.25. In a 10-layer sigmoid network, the gradient arriving at layer 1 has been multiplied by at most 0.25 per layer โ giving 0.25ยนโฐ โ 9.5 ร 10โปโท, effectively zero. The first layers receive no gradient signal and learn nothing while the last few layers update normally.
This is why networks deeper than 5โ6 layers were impractical before 2012. The symptom is clear in training: the loss decreases at first but then plateaus far above the optimal, and inspecting per-layer gradients shows near-zero values in the early layers. The activations in these layers also collapse โ either all outputs are near 0 or near 1 (for sigmoid), with near-zero variance.
The primary solutions in order of importance: (1) ReLU activations โ gradient is exactly 1 for positive inputs, breaking the exponential decay. (2) Residual connections (ResNet, Ch 4.5) โ add a "skip" path that carries gradients directly from the loss to early layers, bypassing the layer multiplications entirely. (3) Batch Normalisation (Ch 4.4) โ normalises activations to prevent saturation. (4) He initialisation โ initialises weights to maintain gradient scale across layers.
The telltale sign: training loss stops improving very early, even with sufficient model capacity and data. To confirm, log the gradient norm per layer: for name, p in model.named_parameters(): print(name, p.grad.norm()). If early-layer norms are 10โปโถ or smaller while final-layer norms are ~1.0, you have a vanishing gradient problem. First fix: switch sigmoid โ ReLU. Second fix: add residual connections.
Exploding Gradients Core
The opposite pathology occurs when the gradient magnitudes grow exponentially as they propagate backward โ if the weight matrices have large singular values, each multiplication amplifies rather than shrinks the gradient. This is especially common in Recurrent Neural Networks (RNNs) processing long sequences: the gradient at time step 1 is the product of 100 Jacobian matrices, and if each has norm slightly above 1, the product explodes exponentially.
The symptom is unmistakable: the loss goes to NaN within the first few training steps, and weights become inf. The standard fix is gradient clipping: compute the global norm of all gradients, and if it exceeds a threshold, scale all gradients down proportionally. This preserves the direction of the gradient update but caps its magnitude. A clipping value of 1.0 is a widely used default.
Symptoms of Explosion
- Loss jumps to NaN
- Weights become inf
- Gradient norm > 100
- Loss erratic, huge oscillations
Solutions
- Gradient clipping (max_norm=1.0)
- Lower learning rate
- LSTM/GRU gating (Ch 4.6)
- Layer normalisation
Monitoring Gradients
- Log gradient norm per step
- Use WandB/TensorBoard
- Check for inf/NaN in params
- Early layers vs late layers
∑ Chapter 4.3 Summary โ Backpropagation & Gradient Flow
- Backprop answers: how does the loss change w.r.t. every single weight โ in one backward pass, as cheap as one forward pass
- Computational graph: every operation is a node; forward pass computes values; gradients flow backward through edges via the chain rule
- Chain rule: โL/โWโฝหกโพ = ฮดโฝหกโพ ยท (Aโฝหกโปยนโพ)แต โ upstream error signal ร input activations transposed
- Vanishing: sigmoid derivatives multiply to near-zero in deep networks (0.25ยนโฐ โ 10โปโท) โ ReLU, residual connections, BatchNorm solve this
- Exploding: large weight matrices multiply gradients to NaN loss โ
clip_grad_norm_(max_norm=1.0)is the standard fix - PyTorch autograd: dynamic computation graph โ
.backward()computes all gradients; always callzero_grad()before each backward pass
A neural network architecture is only half the story. The other half is the engineering that makes it trainable: how weights are initialised, how activations are kept stable, how overfitting is controlled, and how the optimiser navigates the loss landscape. These techniques are what separate a network that converges from one that never learns at all.
Weight Initialisation In-depth
Before a single training example is shown, every weight must be given a starting value. This choice has enormous consequences. If all weights start at zero, every neuron in a layer computes exactly the same function and receives exactly the same gradient โ no matter how many epochs you train, all neurons in a layer remain identical forever. This is the symmetry breaking problem: weights must differ to learn different features.
Initialising with random values breaks symmetry, but the variance of those values is critical. If weights are too small, activations shrink exponentially with depth โ by layer 10, inputs have collapsed to near zero and there is no gradient signal. If weights are too large, activations explode exponentially โ inputs saturate sigmoid/tanh and gradients vanish, or the network numerically overflows. The goal is to choose a variance that keeps activation magnitudes approximately stable across all layers.
Xavier/Glorot initialisation (Glorot & Bengio, 2010) derives the optimal variance analytically for linear activations and symmetric non-linearities like Tanh. It sets the weight variance to 2/(nแตขโ + nโแตคโ), balancing the signal variance across both forward and backward passes. He/Kaiming initialisation (He et al., 2015) adjusts for the fact that ReLU kills half of all activations (setting them to zero), which halves the effective variance. He init compensates by scaling up by โ2, using variance 2/nแตขโ. For any ReLU-based network, He initialisation is the correct default.
Using Xavier init with ReLU or He init with Tanh gives suboptimal results. The mismatch causes activation variance to drift across layers. Rule: He/Kaiming for ReLU/LeakyReLU/GELU networks, Xavier/Glorot for Tanh/Sigmoid networks. When in doubt, use He โ most modern networks use ReLU-family activations.
Batch Normalisation In-depth
Ioffe and Szegedy (2015) diagnosed a key training instability they called internal covariate shift: as the parameters of layer l change during training, the distribution of inputs seen by layer l+1 shifts. The later layer must constantly readjust to its changing input distribution, slowing convergence. Their solution โ Batch Normalisation โ normalises each layer's pre-activation values across the mini-batch, forcing the distribution to approximately N(0,1) regardless of what the previous layer learned.
The normalisation has four steps. First, compute the mean ฮผ_B and variance ฯยฒ_B of the current mini-batch. Second, subtract the mean and divide by the standard deviation to get xฬ โ a zero-mean, unit-variance vector. Third โ and critically โ apply a learnable scale ฮณ and shift ฮฒ: y = ฮณxฬ + ฮฒ. These learned parameters let the network undo the normalisation if that is optimal; without them, BatchNorm would permanently constrain every layer's activations to N(0,1), which is too restrictive.
At inference time there is no mini-batch, so BatchNorm uses running statistics โ exponential moving averages of ฮผ_B and ฯยฒ_B accumulated during training โ to normalise. This is why you must call model.eval() before inference: it switches BatchNorm from batch statistics to running statistics. Forgetting this is one of the most common and damaging bugs in deep learning practice.
Forgetting to call model.eval() before inference causes BatchNorm to use the mini-batch statistics of a single inference batch (which may be size 1) instead of the running statistics accumulated during training. With batch size 1, the batch mean equals the input, normalised output is always zero, and predictions are garbage. Always: model.train() during training, model.eval() during evaluation and inference.
Dropout In-depth
Srivastava et al. (2014) introduced dropout as a computationally cheap approximation to training an ensemble of exponentially many networks. During each forward pass, every neuron is independently deactivated with probability p. The remaining (1โp) fraction of neurons process the input and update normally. At inference, all neurons are active โ but since the network was trained with only (1โp) of neurons active on average, the outputs are scaled down by (1โp) to keep the expected activation magnitude consistent. In practice, inverted dropout is used: scale activations up by 1/(1โp) during training so no adjustment is needed at inference.
The theoretical justification has three complementary perspectives. The ensemble view: with N neurons, there are 2^N possible sub-networks; dropout samples a different one each forward pass, and inference approximates their average. The co-adaptation view: neurons cannot rely on specific other neurons being present, so they learn more independent, redundant features. The noise injection view: randomly zeroing neurons adds multiplicative noise, acting like a data augmentation that prevents the network from memorising specific training patterns.
Practical guidance: dropout rates of 0.1โ0.2 work well for earlier or convolutional layers; 0.3โ0.5 for large fully connected layers. Dropout is rarely applied to convolutional feature maps (DropBlock is preferred there). In Transformer models, dropout is applied after attention and after the feed-forward sublayer with rates of 0.1 being standard. For very large models, lower dropout rates (0.05โ0.1) are preferred as the model already has strong regularisation from scale.
Do not apply standard Dropout after every layer indiscriminately. Applying it after BatchNorm can interfere with BN's running statistics. Applying it in convolutional layers often hurts performance (use DropBlock instead). Applying it at the output layer is always wrong. Standard rule: dropout in the fully connected classifier head only (or after transformer attention layers at p=0.1).
Optimisers for Deep Learning In-depth
Stochastic Gradient Descent (SGD) updates each weight by subtracting a fraction of its gradient: ฮธ โ ฮธ โ ฮฑโL. Plain SGD oscillates badly in directions with high curvature (the narrow valleys common in deep loss landscapes) and moves too slowly in flat directions. SGD with Momentum adds a velocity term that accumulates gradient history, smoothing oscillations and accelerating through flat regions. It remains the preferred optimiser for training ResNets and CNNs on image classification.
Adam (Kingma & Ba, 2014) computes an adaptive learning rate per parameter: it tracks the first moment (mean of gradients) and second moment (mean of squared gradients) and uses their ratio to scale each parameter's update independently. A parameter whose gradient has been consistently large gets a smaller effective step; a parameter with small, consistent gradients gets a larger step. This makes Adam dramatically faster to converge on most problems and largely insensitive to the global learning rate choice.
AdamW (Loshchilov & Hutter, 2019) fixes a subtle mathematical bug in Adam's weight decay implementation. In Adam, L2 regularisation (weight decay) was applied to the gradient before the adaptive scaling โ which means the actual weight penalty is scaled by the adaptive term and varies per parameter. AdamW decouples weight decay from the gradient update, applying it directly to the parameters after the Adam step: ฮธ โ ฮธ โ ฮปฮธ (separately from the Adam gradient term). This is now the mandatory default for training large language models and Transformers.
| Optimiser | Best For | Typical LR | Weight Decay | Notes |
|---|---|---|---|---|
| SGD | Legacy CNNs | 0.01โ0.1 | via L2 penalty | Requires careful LR schedule |
| SGD + Momentum | CV fine-tuning, ResNets | 0.01โ0.1 | via L2 | momentum=0.9 standard |
| Adam | Prototyping, NLP | 1e-4 to 3e-4 | Broken โ use AdamW | Default for quick experiments |
| AdamW | Transformers, LLMs | 1e-4 to 3e-4 | Decoupled (1e-2) | Mandatory for modern models |
Learning Rate Schedules Core
A fixed learning rate is rarely optimal throughout training. Early in training, large steps are desirable โ the network is far from a good solution and can afford rough updates. Later in training, large steps overshoot the loss minimum and cause oscillation โ smaller steps are needed for fine-grained convergence. Learning rate schedules adjust the learning rate automatically over the course of training.
The warmup + cosine annealing schedule has become the dominant approach for Transformer training. For the first few percent of training steps (the "warmup"), the learning rate increases linearly from near-zero to the target learning rate. This protects the model from large, destabilising gradient updates at the start of training, when the parameters are random and gradients are noisy. After warmup, the learning rate follows a cosine curve from the peak down to a small minimum โ providing a smooth, continuous decay that typically outperforms staircase schedules.
The Complete Training Loop In-depth
Putting it all together: a production-quality PyTorch training loop that incorporates gradient clipping, the model.train()/eval() switch, a validation loop, and a cosine learning rate schedule. Every line is intentional โ understanding why each piece is there is as important as knowing what it does.
import torch import torch.nn as nn from torch.utils.data import DataLoader def train_epoch(model, loader, optimizer, criterion, device, clip_grad=1.0): model.train() # dropout ON, BN uses batch stats total_loss, correct = 0.0, 0 for X, y in loader: X, y = X.to(device), y.to(device) logits = model(X) # 1. forward loss = criterion(logits, y) optimizer.zero_grad() # 2. clear grads loss.backward() # 3. backprop nn.utils.clip_grad_norm_(model.parameters(), clip_grad) # 4. clip optimizer.step() # 5. update total_loss += loss.item() correct += (logits.argmax(1) == y).sum().item() return total_loss / len(loader), correct / len(loader.dataset) def evaluate(model, loader, criterion, device): model.eval() # dropout OFF, BN uses running stats total_loss, correct = 0.0, 0 with torch.no_grad(): # disable grad tracking โ saves memory for X, y in loader: X, y = X.to(device), y.to(device) logits = model(X) total_loss += criterion(logits, y).item() correct += (logits.argmax(1) == y).sum().item() return total_loss / len(loader), correct / len(loader.dataset) # Setup device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = MLP(784, 256, 10).to(device) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2) criterion = nn.CrossEntropyLoss() scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) best_val_loss, best_epoch = float('inf'), 0 for epoch in range(50): train_loss, train_acc = train_epoch(model, train_loader, optimizer, criterion, device) val_loss, val_acc = evaluate(model, val_loader, criterion, device) scheduler.step() if val_loss < best_val_loss: # early stopping / best model best_val_loss = val_loss torch.save(model.state_dict(), 'best_model.pt') if epoch % 5 == 0: print(f"Epoch {epoch:3d}: train={train_loss:.4f} val_acc={val_acc:.4f}")Other Regularisation Techniques Reference
Beyond dropout and BatchNorm, a range of regularisation techniques are routinely used in modern deep learning. The right combination depends on the task, architecture, and dataset size. The table below summarises the most important ones with their use cases and PyTorch APIs.
| Technique | Mechanism | When to Use | PyTorch API |
|---|---|---|---|
| L2 / Weight Decay | Add ฮปโWโยฒ penalty to loss โ shrinks weights | Always โ small ฮป (1e-4 to 1e-2) | weight_decay= in optimizer |
| Dropout | Randomly zero neurons during training | FC layers, Transformers (p=0.1โ0.5) | nn.Dropout(p=0.3) |
| Batch Normalisation | Normalise activations per mini-batch | After linear/conv, before activation | nn.BatchNorm1d/2d |
| Layer Normalisation | Normalise across features (not batch) | Transformers โ no batch-size dependency | nn.LayerNorm |
| Data Augmentation | Random transforms of training inputs | Image tasks (flip, crop, colour jitter) | torchvision.transforms |
| Early Stopping | Stop when validation loss stops improving | Always โ monitor val_loss with patience | Manual or PyTorch Lightning |
| Label Smoothing | Soften hard 0/1 targets to ฮต/(K-1) | Classification โ prevents overconfidence | nn.CrossEntropyLoss(label_smoothing=0.1) |
∑ Chapter 4.4 Summary โ Training Deep Networks
- He init for ReLU: W ~ N(0, โ(2/nแตขโ)) โ prevents vanishing/exploding activations before training even starts
- BatchNorm: normalise per mini-batch โ stable training, higher LR tolerance, less sensitivity to initialisation; always call model.eval() at inference
- Dropout: randomly drop p of neurons each forward pass โ ensemble effect, prevents co-adaptation; use in FC layers at p=0.1โ0.5
- AdamW: Adam with decoupled weight decay โ the mandatory standard optimiser for Transformer and LLM training (ฮฒโ=0.9, ฮฒโ=0.999, ฮป=1e-2)
- Warmup + cosine annealing: protect early training instability then smoothly decay LR โ standard for all large-scale training runs
- Training loop order: zero_grad โ forward โ loss โ backward โ clip โ step; run evaluate() separately with model.eval() and torch.no_grad()
Convolutional Neural Networks are not a minor variation on the MLP. They encode a powerful prior about visual data โ that meaningful patterns are local and translation-invariant โ directly into the architecture. This inductive bias, combined with weight sharing, reduces parameters by orders of magnitude while improving generalisation. The result transformed computer vision from hand-crafted features to end-to-end learning.
Why Not MLP for Images? Core
A standard 224ร224 RGB image contains 224 ร 224 ร 3 = 150,528 individual pixel values. A single hidden layer of 1,024 neurons in a fully connected MLP requires 150,528 ร 1,024 = 154 million weights โ for the first layer alone, before any useful representation has been learned. Scale this to the thousands of neurons in a real network and you have a parameter count that dwarfs the available training data, making the network impossible to train effectively.
The parameter count is only the first problem. A deeper issue is that the MLP treats every pixel as equally related to every other pixel โ it has no concept of spatial locality. A cat's eye in the top-left corner and a cat's eye in the bottom-right corner are unrelated to the MLP; it must learn to recognise them as independent patterns, requiring separate learned features for every possible position. Images have three structural properties the MLP ignores: local structure (nearby pixels are more related than distant ones), translation invariance (a cat is a cat wherever it is), and compositionality (parts compose into objects).
CNNs address all three with a single idea: weight sharing via convolution. Instead of connecting each pixel to each neuron, a CNN applies a small learned filter (e.g., 3ร3) across the entire image. The same 27 weights (3ร3ร3 channels) are reused at every spatial position. This reduces the first layer's parameters from 154 million to a few hundred, while the filter learns to detect the same feature (an edge, a colour gradient) wherever it appears.
The Convolution Operation In-depth
A convolution applies a small learnable filter โ called a kernel โ to an input feature map by sliding it across every position and computing an element-wise dot product at each location. At position (i, j), the output value is the sum of all products between the kernel weights and the corresponding input patch. If the kernel has learned to detect horizontal edges, positions where horizontal edges are present produce large activations; other positions produce small ones. With 64 different kernels, you get 64 different feature maps, each detecting a different pattern.
Three hyperparameters control the output size. Kernel size K (typically 3ร3 in modern networks): larger kernels see more context but use more parameters. Stride S: how many pixels the kernel jumps between applications. Stride=2 halves the spatial resolution. Padding P: zeros added around the input. "Same" padding (P=(K-1)/2) preserves the input spatial size, which is standard for 3ร3 convolutions.
The parameter count scales with kernel size and channel counts, not with image size โ this is the core efficiency of CNNs. A 3ร3 conv layer with 64 input channels and 128 output channels has 3 ร 3 ร 64 ร 128 + 128 = 73,856 parameters regardless of whether the input is 32ร32 or 512ร512. This size-invariance is why a single CNN trained on 224ร224 images can be applied to any input resolution at inference.
Pooling Core
After convolution extracts local features, the spatial dimensions of the feature maps are often larger than necessary โ and carrying large feature maps through many layers is expensive. Pooling layers reduce the spatial size while retaining the most important information. Max pooling (the dominant choice) partitions the feature map into non-overlapping windows and takes the maximum value in each. With a 2ร2 window and stride 2, max pooling halves both height and width, reducing the number of activations by 4ร while making features more invariant to small spatial shifts.
The intuition behind max pooling: each feature map cell contains an activation measuring "how strongly is this feature present at this position?" The maximum within a 2ร2 region answers "was this feature present anywhere in this region?" โ a coarser, position-invariant question that is still useful for recognition. An eye is an eye whether it is 2 pixels to the left or right. Max pooling discards that 2-pixel difference.
Global Average Pooling (GAP) is a key modern innovation: instead of pooling 2ร2 regions, it averages the entire spatial extent of each channel into a single scalar. Applied after the last convolutional block, GAP converts a [B, C, H, W] tensor into [B, C], replacing the large fully-connected layers that were responsible for most parameters in early CNN architectures. ResNet and subsequent models use GAP as the bridge between convolutional features and the classification head.
CNN Architecture In-depth
A typical CNN follows a regular pattern: alternating convolution blocks and pooling layers, progressively reducing the spatial dimensions while increasing the number of feature channels. The spatial compression concentrates local features into increasingly compact representations. The channel expansion gives the network more "vocabulary" for describing what it sees. The final stage converts the 3D feature tensor into a class prediction via either a series of fully-connected layers or Global Average Pooling.
The hierarchical feature learning in CNNs is perhaps their most important property. Visualisation studies show that early layers (Layer 1-2) learn to detect simple patterns: oriented edges, colour gradients, and textures. Middle layers (Layer 3-4) detect parts: corners, curves, texture patches that resemble scales, fur, or brickwork. Late layers detect whole objects or object parts: faces, wheels, paws. This hierarchy emerges from training alone โ it is not hand-crafted. The network discovers it is useful by virtue of gradient descent on classification loss.
ResNet & Skip Connections In-depth
By 2014, the empirical pattern was clear: deeper networks should perform better, because more layers can learn more complex functions. Attempts to train networks with 20โ30 layers consistently produced worse results than 10โ15 layer networks โ not just on validation, but on the training set. This degradation problem was not overfitting. It meant the optimiser was fundamentally unable to train very deep networks, even when additional capacity should have helped.
He et al.'s insight was deceptively simple. If a shallower network achieves some accuracy A, then a deeper network that copies the shallower network's layers and sets all additional layers to identity (f(x) = x) should achieve at least accuracy A. But gradient descent cannot easily learn the identity mapping โ pushing all weights in a layer toward zero is hard, because zero weights produce zero outputs (not the input x). The residual block makes this easy by reformulating the learning objective: instead of learning f(x), the block learns the residual r(x) = f(x) โ x. The shortcut connection adds the original input directly: output = r(x) + x. Now learning the identity is trivial โ just set r(x) = 0.
The practical impact was enormous. ResNet-152 (152 layers, 2015) achieved 3.57% Top-5 error on ImageNet โ surpassing human-level performance (~5%). The skip connection also dramatically improves gradient flow: gradients can propagate directly from the loss to any earlier layer through the shortcut path, bypassing the multiplicative chain that causes vanishing gradients. This is why skip connections appear in virtually every modern architecture โ Transformers include them as a core component.
The skip connection adds x directly to F(x). This requires x and F(x) to have the same shape. When a residual block changes the number of channels or uses stride > 1 (to downsample), the shortcut must include a 1ร1 convolution (called a "projection shortcut") to match dimensions: self.shortcut = nn.Conv2d(in_ch, out_ch, 1, stride=stride). Forgetting this causes a shape mismatch error at the addition step.
Modern CNNs Reference
The history of CNNs on ImageNet is a story of architectural innovations compounding: each milestone introduced one key idea that is now ubiquitous. AlexNet proved GPU-trained deep networks could work. VGG showed depth with only 3ร3 convolutions was sufficient and more principled. Inception introduced parallel multi-scale filters. ResNet introduced skip connections enabling extreme depth. EfficientNet used Neural Architecture Search to jointly scale depth, width, and resolution. Vision Transformers (ViT) ultimately replaced convolutions entirely with attention โ showing that the convolutional inductive bias, while useful, is not necessary given enough data.
| Architecture | Year | Depth | Key Innovation | ImageNet Top-5 |
|---|---|---|---|---|
| LeNet-5 | 1998 | 5 | First successful CNN for digits | ~25% (MNIST era) |
| AlexNet | 2012 | 8 | GPU training, ReLU, Dropout | 15.3% โ 11% improvement |
| VGG-16/19 | 2014 | 16โ19 | Only 3ร3 convolutions throughout | 7.3% |
| GoogleNet/Inception | 2014 | 22 | Inception modules, global avg pool | 6.7% |
| ResNet-50/152 | 2015 | 50โ152 | Residual skip connections | 3.57% โ superhuman |
| DenseNet-121 | 2017 | 121โ264 | Dense connections (all-to-all) | 3.46% |
| EfficientNet-B7 | 2019 | Variable | Neural Architecture Search (NAS) | 2.9% |
| Vision Transformer (ViT) | 2020 | Variable | Pure self-attention โ no convolution | 2.0%+ |
Receptive Field Core
The receptive field of a neuron in layer l is the region of the original input image that can influence that neuron's activation. A neuron in the first conv layer with a 3ร3 kernel sees a 3ร3 region. A neuron in the second conv layer sees a 5ร5 region (each of its 9 input cells saw a 3ร3 region, overlapping to cover 5ร5). With each additional 3ร3 conv layer, the receptive field grows by 2 in each dimension. After k layers of 3ร3 convolutions: receptive field = 2k + 1 pixels.
Pooling layers and strided convolutions multiply the receptive field growth. After a 2ร pooling layer, subsequent convolutional layers grow the receptive field twice as fast. This is why deep CNNs develop neurons in later layers that respond to large, complex objects: they have receptive fields spanning the entire image. The final convolutional layer in ResNet-50 has a theoretical receptive field of 483ร483 โ larger than the 224ร224 input โ ensuring every output cell has seen the full input context.
Receptive Field Growth
- 1 conv (3ร3): RF = 3ร3
- 2 convs (3ร3): RF = 5ร5
- 3 convs (3ร3): RF = 7ร7
- k convs: RF = (2k+1)ร(2k+1)
- Pooling 2ร: doubles growth rate
Why Large RF Matters
- Small RF โ misses global context
- Object recognition needs full object
- Dilated conv: large RF without depth
- Attention (ViT): global RF layer 1
- ResNet-50 RF > input size โ
PyTorch CNN in 10 Lines
nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2),
nn.AdaptiveAvgPool2d(1),
nn.Flatten(),
nn.Linear(64, num_classes))
∑ Chapter 4.5 Summary โ Convolutional Neural Networks
- Convolution: slide a learned filter across input โ weight sharing = same feature detector everywhere โ 154M MLP params โ 27 CNN params for first layer
- KรKรC_inรC_out parameters per layer โ size-invariant: same params for 32ร32 or 512ร512 inputs
- CNN hierarchy: edges (L1) โ textures (L2) โ parts (L3) โ objects (L4+) โ all learned automatically
- Max pooling: keep maximum per 2ร2 window โ translation invariance + spatial compression; Global Average Pooling replaces FC layers
- ResNet skip connections: y = F(x) + x โ reformulate as residual learning โ solves degradation, enables 150+ layer networks, gradients flow freely
- ResNet (2015) โ EfficientNet โ ViT (2020): Transformers now rival CNNs on vision โ attention replaces convolution with global receptive field from layer 1
The RNN and LSTM are not obsolete โ they are the conceptual bedrock of sequence modelling. Understanding why RNNs struggle with long-range dependencies, and how LSTMs solve this with gating, is the essential preparation for understanding why the Transformer replaced them. Every concept in attention mechanisms traces directly back to this chapter.
Why Sequences Need Memory Core
An MLP processes each input independently. Feed it the word "bank" and it produces a prediction โ but it has no way to know whether the previous words were "river" or "money". CNNs add local spatial context via convolution windows, but they still process a fixed-size input with no persistent state across positions. Neither architecture is suited to data where order matters and length varies: text, speech, time series, video.
The semantic difference between "The dog bit the man" and "The man bit the dog" lies entirely in word order โ same vocabulary, opposite meaning. Processing each word independently destroys this information. A model needs to carry a memory of what it has already seen as it processes each new token. This is the core motivation for the Recurrent Neural Network: maintain a hidden state hǕₜ that accumulates information from all previous time steps.
Basic RNN In-depth
The vanilla RNN cell takes the current input xₜ and previous hidden state hₜ₋₁, applies a weighted sum, and passes through tanh. The same weight matrices Wₕ and Wₓ are used at every single time step โ weight sharing across time, analogous to how a CNN shares weights across space. Unrolling through T steps creates an effective computational graph T layers deep: a sequence of 100 words = 100 effective layers = severe vanishing gradient risk.
Backpropagation Through Time (BPTT) In-depth
BPTT applies backprop to the unrolled RNN graph. To update Wₕ, gradients must flow from the loss at step T back through every time step, multiplying by the Jacobian ∂hₜ/∂hₜ₋₁ = Wₕᵀ · diag(tanh’(·)) at each step. Across T steps: product of T such matrices. If ‖Wₕ‖ < 1, the product shrinks exponentially โ vanishing gradient. If > 1, it grows โ exploding. Practical limit: vanilla RNNs cannot reliably learn dependencies beyond ~10โ20 steps.
Training on very long sequences with full BPTT is expensive (memory scales linearly with sequence length). Truncated BPTT splits sequences into chunks and backpropagates only within each chunk, carrying the hidden state forward without gradients. In PyTorch, detach the hidden state between chunks: h = h.detach() before each new chunk.
LSTM โ Long Short-Term Memory In-depth
Hochreiter & Schmidhuber (1997) added a cell state cₜ โ a horizontal "highway" running through all time steps with only element-wise operations. Gradients flowing through cₜ are multiplied by learned scalar gate values (not weight matrices), dramatically reducing vanishing. Three sigmoid gates ∈ (0,1) control information flow: forget gate fₜ decides what to erase from cₜ₋₁; input gate iₜ decides what new candidate c̃ₜ to write; output gate oₜ decides what to expose as hₜ.
The cell state update is the key: cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ โ only addition and element-wise multiply, no matrix multiplication. The forget gate can learn to stay near 1 for important long-range information, creating an almost unimpeded gradient path over hundreds of steps.
When processing long sequences in chunks, detach both h_n AND c_n: h = (h_n.detach(), c_n.detach()). Forgetting to detach c_n keeps the computation graph alive across chunks causing unbounded memory growth until OOM โ one of the most common LSTM training bugs.
GRU โ Gated Recurrent Unit Core
Cho et al. (2014) introduced the GRU as a simplified LSTM with only 2 gates: an update gate zₜ (blend old vs new hidden state) and a reset gate rₜ (how much past state to use for candidate). No separate cell state โ just one hidden vector. Fewer parameters, faster training, often competitive with LSTM.
Seq2Seq & Attention Preview Core
Seq2Seq (Sutskever et al. 2014) combined two RNNs: an encoder compressing the input into a fixed context vector, and a decoder generating output conditioned on it. The fundamental flaw: the entire input โ regardless of length โ is squeezed into one fixed-size vector. Bahdanau et al. (2015) fixed this with attention: at each decoder step compute a weighted sum over ALL encoder states cₜ = ∑ αₜᴵ · hᴵ. This is the direct predecessor of Transformer self-attention in Ch 4.7.
The attention mechanism did not replace the RNN in 2015 โ it made the RNN dramatically better. It took Vaswani et al. (2017) to ask: "What if we remove the RNN and use attention exclusively?" The answer was the Transformer. Chapter 4.7 completes this story.
∑ Chapter 4.6 Summary โ RNNs & LSTMs
- RNN: hidden state hₜ = tanh(Wₕhₜ₋₁ + Wₓxₜ + b) โ same weights each step, memory flows forward through hₜ
- BPTT: gradients multiply through T Jacobians → vanishing/exploding for long sequences (practical limit ~10–20 steps vanilla RNN)
- LSTM: cell state cₜ = gradient highway; 3 gates (forget, input, output) control what to erase, write, expose
- LSTM key: cₜ = fₜ⊙cₜ₋₁ + iₜ⊙c̃ₜ — only element-wise ops on gradient path → no vanishing through cell state
- GRU: 2 gates (update, reset), single hidden state, fewer parameters — competitive with LSTM; use first
- Seq2Seq + Bahdanau attention: decoder attends to all encoder states — solved the bottleneck; direct ancestor of Transformer self-attention (Ch 4.7)
“Attention Is All You Need” (Vaswani et al., 2017) is the most consequential paper in the history of AI. It removed the recurrence entirely and showed that attention alone — applied in parallel across all tokens — outperforms every RNN variant at every scale. Every major AI system since 2018 is built on this architecture. Understanding it fully is not optional.
Why Transformers Replaced RNNs Core
The RNN’s fundamental flaw is its sequential nature. To process token t, you must first finish token t−1. This means training on a sequence of 10,000 tokens requires 10,000 sequential steps — no amount of hardware parallelism can help. Training GPT-3 (which processes sequences of 2,048 tokens) on an RNN would be computationally impossible at scale. The Transformer abolishes this constraint: all tokens are processed simultaneously, turning a sequential problem into a parallel matrix multiplication problem that GPUs excel at.
The second flaw is information decay. Even with LSTM gating, information from 500 tokens ago is weakly represented in the current hidden state. In contrast, the Transformer’s attention mechanism creates a direct path between any two tokens regardless of distance. The word “it” 300 tokens after “the cat” can attend directly to “cat” with no intermediate steps — the path length is always 1. This is why Transformers handle long documents, code files, and entire books in ways that RNNs fundamentally cannot.
Scaled Dot-Product Self-Attention In-depth
Self-attention is the mechanism that allows each token to gather information from all other tokens in the sequence. For each token, three vectors are computed by applying learned linear projections: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“what information do I carry?”). The attention weight between token i and token j is computed as the dot product of token i’s Query with token j’s Key, scaled by √dₖ (to prevent the dot products from growing large and saturating the softmax), then normalised via softmax across all tokens. The output for token i is the weighted sum of all Value vectors.
The scaling by √dₖ is critical. For dₖ=64, a random unit vector has dot product with another of approximately 8 in expectation. Without scaling, these large values push the softmax into near-zero gradient regions. Dividing by √64=8 normalises the variance and keeps gradients healthy. This is why the formula specifically includes the √dₖ denominator.
The decoder must not attend to future tokens during training (it would “cheat” by reading the answer). Apply a causal mask: a lower-triangular matrix where position i can only attend to positions ≤i. In PyTorch: mask = torch.tril(torch.ones(n,n)). Forgetting this mask means the model sees the target during training but not during inference — causing a catastrophic train/eval mismatch where generated text is gibberish.
Multi-Head Attention In-depth
A single attention head can only attend to information from one representational subspace at a time. Multi-head attention runs h independent attention functions in parallel, each with its own learned projection matrices Wᴵᵈ, Wᴵᴷ, Wᴵᵝ. Each head learns to attend to a different type of relationship: one head might track syntactic subject-verb agreement, another resolves coreferences (“he” → “John”), another focuses on positional proximity, and yet another captures semantic similarity. All h head outputs are concatenated and projected back to dᵐᵒᵑᵉℹ via Wᵊ — a learned combination of what each head discovered.
The dimension of each head is dₖ = dᵐᵒᵑᵉℹ / h, so the total computation is the same as a single attention with dᵐᵒᵑᵉℹ dimensions. GPT-3 uses 96 attention heads with dᵐᵒᵑᵉℹ=12,288, giving each head a 128-dimensional subspace. This is one of the key scaling choices: more heads = more types of relationships the model can simultaneously track.
Positional Encoding Core
Attention is permutation invariant: swap any two tokens and the attention scores are identical (just reordered). The Transformer has no inherent concept of token order — “cat sat mat” and “mat sat cat” would produce the same attention weights if not corrected. To inject positional information, the original paper adds a fixed sinusoidal positional encoding to each token embedding before the first attention layer.
The sinusoidal encoding uses different frequencies for different embedding dimensions: PE(pos, 2i) = sin(pos/10000^(2i/d)) and PE(pos, 2i+1) = cos(pos/10000^(2i/d)). Each position gets a unique vector. The model can learn to extract relative position from these encodings because PE(pos+k) can be expressed as a linear function of PE(pos) — the Transformer can learn to do relative position arithmetic via attention.
Modern LLMs use Rotary Positional Embeddings (RoPE) instead — used in LLaMA, GPT-NeoX, and most recent architectures. RoPE encodes position by rotating the Q and K vectors by an angle proportional to position before computing dot products. The key advantage: attention scores naturally depend on the relative position between tokens (pos_i − pos_j), not absolute positions — enabling better generalisation to longer sequences than seen during training.
Full Transformer Architecture In-depth
The complete Transformer block repeats the same structure N times: Multi-Head Attention → Add&Norm → Feed-Forward Network → Add&Norm. The Feed-Forward Network is a two-layer MLP applied independently to each token position: FFN(x) = max(0, xW₁+b₁)W₂+b₂, expanding from dᵐᵒᵑᵉℹ to 4·dᵐᵒᵑᵉℹ then back. This 4× expansion and contraction lets the model compute complex non-linear transformations per token. The Add&Norm step adds the input as a residual connection and applies Layer Normalisation — enabling stable training at depth and solving the vanishing gradient problem (Ch 4.3).
The original Transformer had two components. The encoder is bidirectional: every token attends to all other tokens in both directions. It produces contextualised representations of the input sequence. The decoder is causal: each output token attends only to previously generated tokens (enforced by the causal mask), plus cross-attention to all encoder outputs. This asymmetry is fundamental: the encoder understands the full input at once, while the decoder generates output token-by-token, attending to what it has already produced.
BERT vs GPT vs T5 In-depth
The original Transformer (2017) had both an encoder and decoder. Within a year, two research groups realised that you could use just one half and pretrain it on massive text corpora to create a general-purpose language model. Google Brain introduced BERT (Bidirectional Encoder Representations from Transformers, 2018) using the encoder only, pretrained with Masked Language Modelling: randomly mask 15% of tokens and train the model to predict them using bidirectional context. OpenAI introduced GPT (Generative Pre-trained Transformer, 2018) using the decoder only, pretrained with standard next-token prediction. These two approaches define the landscape of modern NLP.
| Model | Architecture | Attention | Pre-training Task | Best For |
|---|---|---|---|---|
| BERT / RoBERTa | Encoder only | Bidirectional | Masked Language Model (MLM) | Classification, NER, QA |
| GPT-2/3/4, LLaMA | Decoder only | Causal (L→R) | Next token prediction | Generation, chatbots, LLMs |
| T5 / BART | Encoder-Decoder | Enc: bidirectional, Dec: causal | Text-to-text / denoising | Translation, summarisation |
“Attention Is All You Need” (2017) is the most consequential paper in the history of AI. BERT and GPT were both published in 2018. By 2022, GPT-3 demonstrated few-shot learning at a scale nobody had anticipated. Every major AI system since 2018 — GPT-4, Claude, Gemini, LLaMA, Stable Diffusion, AlphaFold 2, Whisper — is built on the Transformer.
∑ Chapter 4.7 Summary — The Transformer
- Self-attention: Attention(Q,K,V) = softmax(QKᵀ/√dₖ)·V — each token attends to all others in one parallel operation
- Parallel processing: all tokens computed simultaneously → massive GPU efficiency gain over RNNs — enabled training at GPT-3/4 scale
- Multi-head: h parallel attention heads → each learns different relationship patterns (syntax, coreference, position, semantics)
- Positional encoding: must be added — attention alone is position-blind; modern LLMs use RoPE for relative position
- Transformer block: Multi-Head Attention → Add&Norm → FFN → Add&Norm — residual connections at every step
- BERT=encoder (bidirectional, understanding tasks), GPT=decoder (causal, generation), T5=encoder-decoder (seq2seq tasks)
Pre-training a large model once on vast data, then adapting it cheaply to specific tasks, is the defining paradigm of modern AI. Without transfer learning there would be no GPT-4, no BERT, no Stable Diffusion — the compute required to train each from scratch would be prohibitive. Understanding when to freeze, when to fine-tune, and when to use LoRA separates practitioners from theorists.
What is Transfer Learning? Core
Training a large neural network from scratch requires two things that most practitioners do not have: millions of labelled examples and millions of GPU-hours. GPT-3 cost approximately $4.6M to pre-train; BERT took 4 days on 64 TPU v3 chips. Transfer learning solves this by splitting the problem into two phases. Pre-training: train on a large, general dataset (entire internet text, 1.2M ImageNet images, all of Wikipedia) until the model learns rich, reusable representations. Fine-tuning (or adaptation): start from those learned weights and update them toward a specific downstream task — with far less data and compute.
The foundational insight is that early layers of neural networks learn general features that transfer across tasks. In CNNs, layer 1 universally detects oriented edges regardless of whether the network was trained for cats, cars, or faces. In language models, early layers build syntactic representations applicable to any NLP task. This hierarchy of generality — general features at the bottom, task-specific at the top — is what makes transfer learning work. The analogy: a radiologist who spent years in medical school (pre-training on general anatomy) can specialise in chest X-ray reading (fine-tuning on specific task data) far faster than someone starting from scratch.
Feature Extraction Core
Feature extraction freezes every parameter of the pre-trained backbone and trains only a small task-specific head attached to the top. Because no gradients need to flow through the frozen backbone, forward passes do not require gradient tracking — making this dramatically faster and less memory-intensive than fine-tuning. For image tasks, the head is typically a linear classifier or small MLP on top of the backbone’s pooled output. For text tasks with BERT, the [CLS] token embedding (position 0 of the last hidden state) serves as a fixed-size sentence representation, since BERT is trained to pack sentence-level information into it during pre-training (via the Next Sentence Prediction objective).
Feature extraction works best when your task is similar to the pre-training distribution and you have limited labelled data. If your task is quite different from pre-training (e.g., medical imaging from a model pre-trained on natural images), the frozen features may not be informative enough — full fine-tuning is needed. The key question: do the features the backbone learned happen to be useful for your task?
from transformers import BertModel, BertTokenizer import torch, torch.nn as nn tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') # FREEZE all BERT parameters — no gradients flow through backbone for param in model.parameters(): param.requires_grad = False # Only the task head is trainable classifier = nn.Linear(768, 2) # binary sentiment: 768-dim BERT output → 2 classes text = "The model extracts semantic features." inputs = tokenizer(text, return_tensors='pt') with torch.no_grad(): # no gradients needed for frozen backbone outputs = model(**inputs) # [CLS] token embedding = sentence representation cls_embedding = outputs.last_hidden_state[:, 0, :] # shape: (1, 768) logits = classifier(cls_embedding) # shape: (1, 2) print(f"Embedding: {cls_embedding.shape}") # (1, 768) print(f"Logits: {logits.shape}") # (1, 2) # Trainable params: only the 768×2 + 2 = 1,538 classifier params trainable = sum(p.numel() for p in classifier.parameters()) frozen = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable:,} / Total: {frozen+trainable:,}") # 1,538 / 110MFull Fine-Tuning In-depth
Full fine-tuning updates all weights of the pre-trained model on the downstream task. The critical constraint is the learning rate. Pre-trained weights encode years of training signal from massive datasets — a large learning rate will destroy this knowledge within a few gradient steps in a process called catastrophic forgetting. The pre-training knowledge disappears as new task gradients overwrite it. The standard remedy: use a learning rate 10–100× smaller than the original pre-training LR (typically 2e-5 to 5e-5 for BERT/GPT-sized models, vs 3e-4 for pre-training).
Layer-wise learning rate decay (LLRD) refines this further: assign progressively smaller learning rates to earlier layers. The final layer gets the full (small) LR; each preceding layer gets the LR multiplied by a decay factor (typically 0.9 per layer). This preserves the most general representations in early layers while allowing later layers to adapt more aggressively. Google’s ULMFiT (Howard & Ruder, 2018) pioneered this technique and it remains standard practice for fine-tuning large Transformers.
Using a learning rate ≥ 1e-4 for fine-tuning a pre-trained Transformer will typically cause catastrophic forgetting within 1–2 epochs. The model learns your task quickly but destroys its general language understanding. Signs: training loss drops fast but evaluation on any other task collapses. Fix: use LR ≤ 5e-5, add warmup (100–500 steps), and monitor performance on a held-out validation set from the original task distribution.
Parameter-Efficient Fine-Tuning (PEFT) In-depth
Full fine-tuning a 7B-parameter model requires the same GPU memory as pre-training it — typically 80GB+ in fp16. For 175B (GPT-3) or 540B (PaLM) models, full fine-tuning is simply impossible on any commercially available hardware. Parameter-Efficient Fine-Tuning (PEFT) methods address this by updating only a tiny fraction of parameters — usually 0.1–5% — while keeping the rest frozen, achieving near-full-fine-tuning quality at a fraction of the cost.
Adapter layers (Houlsby et al., 2019) insert small bottleneck networks between Transformer layers — project down to a small dimension, apply non-linearity, project back up. Only adapter parameters are trained. Prefix tuning (Li & Liang, 2021) prepends learnable virtual tokens to the Key and Value matrices at every layer — the model sees these as additional context but they are just learned parameter vectors. Prompt tuning simply prepends soft tokens to the input embedding — the simplest form, effective only for large models (>10B parameters).
| Method | Trainable Params | Storage per Task | Quality | Inference Overhead |
|---|---|---|---|---|
| Full fine-tuning | 100% | Full model copy | Best | None |
| Adapter layers | 0.5–5% | Small adapter | Good | Small (extra forward pass) |
| Prefix tuning | 0.1–1% | Prefix vectors | Moderate | Small (extra KV) |
| Prompt tuning | <0.01% | Just prompts | Good (>10B only) | None |
| LoRA | 0.1–1% | Low-rank matrices | Very good | None (merge at inference) |
| QLoRA | 0.1–1% | Even smaller | Good | None (4-bit base model) |
LoRA — Low-Rank Adaptation In-depth
Hu et al. (2021) observed that the weight matrices of large language models have low intrinsic rank — meaning their information content can be well-approximated with far fewer dimensions than the full d×d matrix. The hypothesis is that fine-tuning induces weight updates खW that also have low intrinsic rank: the task adaptation doesn’t require updating all d² values independently, because the update lies in a lower-dimensional subspace.
LoRA exploits this by decomposing खW = A·B, where A is d×r and B is r×d with r ≪ d (typically r=4, 8, or 16). The original weight W is frozen. Only A and B are trained. At the end, the adaptation is merged: W’ = W + (α/r)·A·B — a simple matrix addition — so inference has zero added latency. For a typical d=4096 matrix: full खW = 16.7M parameters; LoRA r=8: A+B = 2×4096×8 = 65,536 parameters (256× smaller).
Rank r is the main hyperparameter. Too low (r=1–2): insufficient expressivity, model underfits the task. Too high (r=64+): approaches full fine-tuning cost with no inference merge advantage. Standard starting point: r=8 for most tasks, try r=4 if memory is tight, r=16–32 if quality is insufficient. Also critical: only applying LoRA to Q and V (not K) is a common default but applying to all projection matrices (Q, K, V, O) often improves quality with minimal extra cost.
RLHF — Reinforcement Learning from Human Feedback Core
The final alignment step that transforms a raw language model into a helpful assistant is RLHF — the technique behind ChatGPT, Claude, and Gemini. A pre-trained LLM knows how to predict text distributions but has no concept of “helpful” or “harmful”. RLHF shapes the model’s outputs toward human preferences through a three-stage pipeline: supervised fine-tuning on demonstrations, training a reward model from pairwise human judgements, and then using reinforcement learning to optimise the LLM to maximise the learned reward.
The RL stage uses Proximal Policy Optimisation (PPO) with a KL divergence constraint: the policy (the LLM being trained) cannot stray too far from the SFT baseline. Without this constraint, the model learns to “game” the reward model — generating outputs that score high on the RM but are not actually helpful (reward hacking). The KL term penalises outputs that diverge greatly from the pre-RLHF distribution, maintaining general capability while nudging toward preferred behaviour. Anthropic’s Constitutional AI (CAI) replaces human annotators in Stage 2 with AI-generated critiques, scaling the process to millions of preference pairs.
SFT Data Requirements
- High-quality human demos
- InstructGPT: ~13k examples
- Diverse tasks and formats
- Quality >> quantity
Reward Model
- Same architecture as LLM
- Final layer: scalar score
- InstructGPT: ~33k comparisons
- Bradley-Terry ranking model
Modern Alternatives
- DPO: skip RL, direct preference
- Constitutional AI (Claude)
- RLAIF: AI feedback at scale
- Reward-free: RLVR (reasoning)
∑ Chapter 4.8 Summary — Transfer Learning & Fine-Tuning
- Transfer learning: pre-train on large data, adapt cheaply — reuse expensive knowledge; early layers learn universal features
- Feature extraction: freeze backbone, train head only — minutes not days, works with hundreds of labelled examples
- Fine-tuning: update all weights with small LR (2e-5 to 5e-5) — catastrophic forgetting risk; use warmup + LLRD
- LoRA: खW = A·B, rank r ≪ d — train 0.06% of parameters, match full fine-tuning quality, zero inference overhead
- RLHF: SFT → Reward Model → PPO with KL constraint — the recipe behind ChatGPT, Claude, and Gemini
- QLoRA = LoRA on 4-bit quantised base model — fine-tune 70B models on a single 24GB consumer GPU
Generative models learn the distribution of data well enough to create new data indistinguishable from real examples. Every AI-generated image, synthesised voice, and hallucinated protein structure is the output of a generative model. VAEs introduced structured latent spaces, GANs introduced adversarial training, and diffusion models combined the best of both — producing the current state of the art.
Generative vs Discriminative Models Core
Most models covered so far are discriminative: given an input x, predict a label y. They learn P(y|x) — the conditional distribution of outputs given inputs. A discriminative model draws a decision boundary in input space but has no model of what the input data actually looks like.
Generative models instead learn P(x) — the distribution of the data itself. Once you have a good model of P(x), you can sample from it: draw a new x that was never in the training set but looks like it could have been. This is qualitatively different from classification: you are not deciding which bucket an input belongs to, you are learning what valid inputs look like and manufacturing new ones. Applications span every domain: faces, voices, molecules, code, music, 3D shapes.
Variational Autoencoders (VAEs) In-depth
Kingma & Welling (2013) introduced the VAE as the first principled neural generative model. A regular autoencoder compresses input x to a latent code z, then reconstructs x — useful for compression but not generation, because the latent space has unpredictable gaps: points between training examples decode to garbage. The VAE fixes this by making the encoder probabilistic: instead of outputting a point z, it outputs a Gaussian distribution N(μ, σ²). The network is then trained to ensure this distribution stays close to a standard normal N(0, I) (via a KL divergence penalty), forcing all latent representations to occupy a continuous, organised neighbourhood around the origin.
The reparameterisation trick is the key engineering insight that makes training possible. You cannot backpropagate through a sampling operation z ~ N(μ, σ²) directly, because sampling is stochastic. The trick: write z = μ + σ·ε where ε ~ N(0,1) is sampled externally. Now μ and σ are deterministic outputs of the encoder, gradients can flow through them, and the stochasticity is isolated in ε which has no parameters to update. This simple algebraic rearrangement is what makes VAE training feasible.
Latent Space Properties Core
The VAE’s KL penalty forces all class clusters in latent space to overlap near the origin, eliminating gaps. This creates a structured, continuous latent space where every point decodes to a plausible output. You can smoothly interpolate between any two encoded examples: a straight line from the latent code of a young face to the code of an old face passes through intermediate codes that decode to faces of intermediate age. This is qualitatively impossible with regular autoencoders, which have large empty voids between training examples.
Generative Adversarial Networks (GANs) In-depth
Goodfellow et al. (2014) proposed a radically different generative approach: instead of maximising a likelihood, frame generation as a two-player game. A Generator G takes random noise z as input and outputs synthetic data G(z). A Discriminator D receives either a real data point x or a fake G(z) and must decide which is which. G is trained to fool D; D is trained to detect fakes. Neither player sees the other’s loss function directly — they only see each other’s outputs. The result of this adversarial dynamic, when it works, is a Generator that produces data indistinguishable from real training examples — because any distinguishable fake will be caught by D and penalised.
The theoretical optimum is a Nash equilibrium: G generates data exactly matching the true distribution P(x), and D can do no better than random guessing (P(real) = 0.5 for all inputs). In practice, reaching this equilibrium is notoriously difficult. The training is unstable, sensitive to hyperparameters, and prone to mode collapse — covered in the next section.
Notable GAN Variants
- DCGAN (2015) — CNN-based GAN
- cGAN — conditional generation
- CycleGAN — image-to-image
- StyleGAN (2018–22) — faces
Training Challenges
- Mode collapse (next section)
- Vanishing D gradient
- D/G balance is fragile
- No convergence guarantee
Stabilisation Techniques
- WGAN — Wasserstein distance
- Spectral normalisation
- Progressive growing
- Gradient penalty (WGAN-GP)
GAN Training Challenges & Mode Collapse Core
Mode collapse is the most notorious GAN failure mode. If G discovers one type of output that consistently fools D, it stops exploring the rest of the distribution and produces the same output (or a small set) regardless of the input noise. The discriminator adapts, G shifts to another single mode, and training cycles without covering the full data distribution. The GAN loss gives no signal that this is happening — the loss values look normal while G has abandoned 90% of the training distribution.
Training instability arises from the balance requirement: D and G must improve at similar rates. If D becomes too powerful early in training, G receives near-zero gradients and cannot improve (D correctly labels everything with high confidence, so the loss for G becomes flat). If G outpaces D early, D provides no meaningful feedback. The WGAN (Wasserstein GAN) addresses both problems by replacing the original loss with the Wasserstein distance, which provides smoother gradients and is more robust to D/G imbalance.
Diffusion Models In-depth
Ho et al. (2020) introduced Denoising Diffusion Probabilistic Models (DDPM), which now power Stable Diffusion, DALL-E 3, Midjourney, and Sora. The key idea is elegant: define a forward process that gradually corrupts data by adding Gaussian noise over T steps until the image becomes pure noise, then train a neural network to learn the reverse process — predicting what noise was added at each step, and thus denoising one step at a time. Generation is simply running the reverse process starting from random noise.
Unlike GANs, diffusion training is stable: the objective is a simple regression (predict the noise added at step t), there is no adversarial game to balance, and the model sees all noise levels during training. Unlike VAEs, there is no explicit latent space to constrain, so the generative quality is not limited by the bottleneck. The tradeoff is sampling speed: generating one image requires T=100–1000 denoising steps, each requiring a full forward pass through the network. Modern techniques like DDIM (deterministic sampling) and SDXL-Turbo (distillation) reduce this to 1–4 steps, largely eliminating the speed disadvantage.
Generative Model Comparison Core
| Property | VAE | GAN | Diffusion |
|---|---|---|---|
| Training stability | Stable | Unstable (adversarial) | Stable |
| Sample quality | Blurry (over-smooth) | Sharp (when works) | State-of-the-art |
| Latent space | Structured, continuous | Less structured | Not explicit |
| Sampling speed | Fast (1 pass) | Fast (1 pass) | Slow (T=100-1000) |
| Controllability | Good (interpolation) | Moderate | Excellent (conditioning) |
| Mode coverage | Good | Mode collapse risk | Good |
| Best use today | Compression, anomaly detection | Video gen, GAN editing | Image/video synthesis SOTA |
| Examples | VQ-VAE, VQ-VAE-2 | StyleGAN-3, BigGAN | Stable Diffusion, DALL-E 3, Sora |
∑ Chapter 4.9 Summary — Generative Models
- Generative models learn P(x) — the data distribution itself — enabling new data synthesis; discriminative models learn P(y|x) (labels from inputs)
- VAE: encoder → (μ, σ) → z = μ + σ·ε → decoder; KL penalty forces continuous structured latent space — enables interpolation and generation
- Reparameterisation trick: z = μ + σ·ε, ε~N(0,1) — makes sampling differentiable for backpropagation
- GAN: Generator G(z) fools Discriminator D(x) — minimax game; Nash eq: D(x)=0.5 everywhere; notorious for mode collapse and training instability
- WGAN, spectral normalisation, gradient penalty — stabilisation techniques that largely solved GAN training (StyleGAN-3, BigGAN)
- Diffusion: forward process adds Gaussian noise over T steps; reverse process (εθ) learns to denoise — training objective: E[||ε − εθ(xₜ,t)||²]
- Diffusion = current SOTA for image/video generation — Stable Diffusion, DALL-E 3, Sora — stable training, no mode collapse, excellent conditioning
🎓 Domain 4 Complete — Deep Learning & Neural Networks
- Ch 4.1 Perceptron to MLP: weighted sum + step function. XOR killed neural nets for 15 years; hidden layers + non-linearity solved it by learning hierarchical representations.
- Ch 4.2 Activation Functions: ReLU for CNNs, GELU for Transformers — sigmoid only for binary output. Non-linearity is what makes stacked layers more powerful than one.
- Ch 4.3 Backpropagation: chain rule through the computational graph. Vanishing gradients: sigmoid → 0.25ᴿ per layer; ReLU fixes this with gradient=1 for positive inputs.
- Ch 4.4 Training Deep Networks: He init, BatchNorm, Dropout, AdamW + warmup–cosine LR — the engineering stack that makes 100+ layer networks trainable in practice.
- Ch 4.5 CNNs: local receptive fields + weight sharing. ResNet y=F(x)+x solved depth degradation — enabled going from 16 to 152+ layers without degradation.
- Ch 4.6 RNNs & LSTMs: hidden state = sequential memory. LSTM gating (forget/input/output) solves vanishing gradients; attention preview leads directly to the Transformer.
- Ch 4.7 Transformer: Attention(Q,K,V)=softmax(QKᵀ/√dₖ)V — parallel, direct long-range dependencies. GPT=decoder, BERT=encoder, T5=both.
- Ch 4.8 Transfer Learning: pre-train then adapt. LoRA trains 0.1% of parameters with zero inference overhead. RLHF (SFT→RM→PPO) creates aligned helpful LLMs.
- Ch 4.9 Generative Models: VAE = structured latent space. GAN = adversarial game. Diffusion = learn to reverse Gaussian noise — Stable Diffusion, DALL-E 3, Sora.
Domain 4 is the mathematical engine behind every frontier AI system. The Transformer (Ch 4.7) is the single most important architecture in AI today — GPT-4, Claude, Gemini, DALL-E, AlphaFold, Whisper, and virtually every LLM runs on it. Domain 5 (NLP & LLMs) explores what happens when you scale the Transformer to trillions of tokens. Domain 8 (Agentic AI) shows what happens when you give it tools, memory, and the ability to act in the world.