AI Foundation · Domain 06 · Chapter 6.1

Image Fundamentals & Classical Computer Vision

How images become numbers, how classical algorithms extract meaning from pixels, and why deep learning ultimately replaced them all.

6.1

Chapter 6.1

A digital image is a 3D array of numbers — width × height × channels. Everything in computer vision, from a simple edge detector to Stable Diffusion, is ultimately operations on this array. Understanding what those numbers represent is where computer vision begins.

Image Representation In-depth

Every digital image is, at its core, a grid of numbers. Each cell in this grid is a pixel — the atomic unit of a digital image, representing a single colour sample at a specific grid position. Before any algorithm can process an image, you must understand how those pixels are encoded and what they represent.

Grayscale images use a single channel — each pixel is an integer from 0 (black) to 255 (white), giving 256 possible intensity levels. RGB images use three channels (Red, Green, Blue), each ranging 0–255, producing 256³ = 16.7 million possible colours per pixel. Every pixel in an RGB image is defined by exactly three numbers.

In deep learning, the tensor representation matters enormously. PyTorch uses channels-first ordering: (C, H, W) — so a 224×224 RGB image becomes shape (3, 224, 224). NumPy and OpenCV use height-first: (H, W, C). Confusing these axes is one of the most common bugs in computer vision code.

🔢

MNIST

28 × 28 × 1

784 values

🖼️

CIFAR-10

32 × 32 × 3

3,072 values

📸

ImageNet

224 × 224 × 3

150,528 values

🎨

Stable Diffusion

512 × 512 × 3

786,432 values

Data types matter for performance. Raw images are stored as uint8 (0–255) — compact but unsuitable for gradients. Neural networks require float32 in range [0.0, 1.0] or normalised with ImageNet statistics: subtract mean [0.485, 0.456, 0.406] and divide by std [0.229, 0.224, 0.225]. This normalisation centres values around zero, helping gradient-based optimisation converge faster.

RGB Image as 3D Tensor — 3 channels × H × W array of pixel values

import torch import torchvision.transforms as T from PIL import Image import numpy as np # Load image img = Image.open("cat.jpg") print(f"PIL size: {img.size}, mode: {img.mode}") # (width, height), 'RGB' # Convert to NumPy — shape: (H, W, 3), dtype: uint8 img_np = np.array(img) print(f"NumPy shape: {img_np.shape}") # (480, 640, 3) print(f"Pixel [0,0]: R={img_np[0,0,0]} G={img_np[0,0,1]} B={img_np[0,0,2]}") # Convert to PyTorch tensor — shape: (3, H, W), float32 [0,1] transform = T.Compose([ T.Resize((224, 224)), T.ToTensor(), # HWC uint8 → CHW float32, scales to [0,1] T.Normalize(mean=[0.485, 0.456, 0.406], # ImageNet mean std=[0.229, 0.224, 0.225]) # ImageNet std ]) tensor = transform(img) print(f"Tensor shape: {tensor.shape}") # torch.Size([3, 224, 224]) print(f"Value range: [{tensor.min():.2f}, {tensor.max():.2f}]") # roughly [-2, 2]

⚠️ Common Pitfall

OpenCV loads images as BGR, not RGB. When you use cv2.imread(), the channels are Blue-Green-Red. Feeding BGR to a model trained on RGB will silently produce wrong results. Always convert with cv2.cvtColor(img, cv2.COLOR_BGR2RGB).

Colour Spaces Core

RGB is the natural colour space for displays — your screen uses red, green, and blue LEDs. But RGB mixes luminance (brightness) and chrominance (colour) together, making it poor for many computer vision tasks. Different colour spaces separate these properties, each suited to specific algorithms.

🔴🟢🔵

RGB

Red, Green, Blue — each 0–255

Natural for display hardware
Mixes brightness with colour
Default for most image libraries

🌈

HSV

Hue (0°–360°), Saturation, Value

Separates colour from brightness
Hue: red=0°, green=120°, blue=240°
Ideal for colour-based segmentation

🎨

LAB

L=lightness, A=green↔red, B=blue↔yellow

Perceptually uniform
Equal ΔE = equal visual difference
Best for colour similarity & style transfer

Grayscale conversion is not a simple average of R, G, B. The human eye is most sensitive to green, so the standard formula is: Luminance = 0.299R + 0.587G + 0.114B. Green contributes nearly 60% of perceived brightness. This is why most classical CV algorithms operate on grayscale — it reduces data by 3× while preserving the structural information humans rely on.

Colour Spaces — RGB, HSV, LAB for different CV tasks

🏗️ Real-World Deployment

In production, HSV colour filtering is still used for fast pre-processing — e.g., isolating red traffic lights before running a neural network detector. It's computationally cheap and works reliably when lighting is controlled. LAB is used in image quality assessment and style transfer where perceptual accuracy matters.

Image Filters & Convolutions In-depth

A filter (or kernel) is a small matrix — typically 3×3 or 5×5 — that slides across an image computing a weighted sum at each position. This operation is called convolution, and it's the single most important operation in all of computer vision. Classical filters are hand-designed for specific effects. Deep learning CNNs learn their filters from data — but the underlying mathematical mechanism is identical.

🌫️

Gaussian Blur

Smooths out noise by averaging each pixel with its neighbours using bell-shaped weights. The centre pixel has the highest weight; further pixels contribute less.

Reduces high-frequency noise
Pre-processing step before edge detection
Kernel: [[1,2,1],[2,4,2],[1,2,1]] / 16

🔪

Sharpening

Enhances edges by amplifying the centre pixel and subtracting neighbours — effectively adding the difference between the pixel and its surroundings.

Large centre weight (5), negative neighbours (-1)
Kernel: [[0,-1,0],[-1,5,-1],[0,-1,0]]
Increases contrast at edges

📐

Sobel Filter

Detects edges by computing the intensity gradient. Separate kernels for horizontal and vertical edges.

Horizontal: [[-1,-2,-1],[0,0,0],[1,2,1]]
Vertical: [[-1,0,1],[-2,0,2],[-1,0,1]]
Foundation for Canny edge detection

🧹

Median Filter

Replaces each pixel with the median of its neighbourhood. Non-linear — not technically a convolution.

Excellent for salt-and-pepper noise
Preserves edges better than Gaussian
Cannot be learned by a standard CNN layer

Classical Image Filters — hand-designed convolution kernels

The convolution operation is identical in classical CV and deep learning. The only difference: classical engineers design the kernel weights by hand, while CNNs learn them via backpropagation. This single insight connects 40 years of computer vision history.

Edge Detection Core

Edges are boundaries between regions of different intensity — they mark where objects begin and end, where surfaces change orientation, and where textures shift. Edges are the most information-dense locations in an image: they encode shape, structure, and object boundaries while discarding uniform regions that carry little useful signal.

The Canny Edge Detector (John Canny, 1986) remains the gold standard for classical edge detection. It's a five-step pipeline, each step carefully designed to balance noise rejection against edge localisation. Nearly every classical CV system used Canny as a preprocessing step — from document scanning to lane detection to augmented reality registration.

Canny Edge Detector — 5-step classical edge detection pipeline

The Harris Corner Detector extends edge detection to find corners — points where intensity changes significantly in two directions simultaneously. Corners are more distinctive than edges (an edge looks the same along its length), making them better landmarks for matching between images. Harris corners are still used in camera calibration and simple tracking systems.

Classical Feature Descriptors Core

Before deep learning, the central challenge of computer vision was: how do you represent an image patch as a compact, distinctive vector that's robust to scale, rotation, and illumination changes? Researchers hand-crafted feature descriptors that encode local image structure — these dominated from roughly 2000 to 2012.

🔑

SIFT (2004)

Scale-Invariant Feature Transform — David Lowe

128-dimensional descriptor per keypoint
Invariant to scale, rotation, minor illumination
Used in: panorama stitching, SLAM, image matching

📊

HOG (2005)

Histogram of Oriented Gradients — Dalal & Triggs

Divide image into cells, count gradient orientations
Concatenate histograms → feature vector
Used in: pedestrian detection, object recognition

⚡

ORB (2011)

Oriented FAST + Rotated BRIEF — patent-free

Fast, rotation-invariant binary descriptor
10–100× faster than SIFT
Used in: real-time mobile feature matching

HOG Descriptor — gradient orientation histograms capture local shape

Classical CV Pipeline Core

The classical computer vision pipeline dominated the field from approximately 1980 to 2012. Every step was designed and tuned by hand — a brittle, domain-specific process that required deep expertise and did not generalise well across tasks or visual domains.

Raw Imagecamera input

Preprocessingresize, blur, normalise

Feature ExtractionSIFT, HOG, Haar

Feature SelectionPCA, filter

ClassificationSVM, AdaBoost

Post-processingNMS, smoothing

Outputlabel, bbox

⚠️ Why this pipeline failed

Each stage was optimised independently — features that were good for one task (pedestrians) performed poorly on another (faces, cars, medical images). Every new domain required starting over: new features, new tuning, new expertise. This is why end-to-end learning was so revolutionary.

Why Deep Learning Won In-depth

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered the ImageNet Large-Scale Visual Recognition Challenge with AlexNet — a deep convolutional neural network trained on two GPUs. It achieved a top-5 error rate of 15.3%, compared to 26.2% for the best classical method. That 41% relative improvement in a single year was the most dramatic result in the competition's history, and it permanently changed computer vision.

Three fundamental advantages explain why deep learning dominates:

🧠

Learned Features

No manual feature engineering. Optimal representations emerge automatically from data — the network discovers what matters.

🔗

End-to-End

Directly maps pixels to outputs. No intermediate representation bottleneck — the entire pipeline is optimised jointly.

📈

Scale

Performance improves with more data and compute. Classical methods plateau — deep learning keeps getting better.

Classical CV

Deep Learning CV

Hand-crafted features (HOG, SIFT)

Learned features from data

Domain expertise required

No domain expertise needed

Features optimised separately from classifier

End-to-end optimisation

Plateaus with more data

Improves with more data

Brittle to domain shift

Generalises across domains

Fast inference, tiny models

Slower inference, large models

ImageNet Competition — Deep Learning Dominates from 2012

🏗️ Real-World Deployment

Despite deep learning's dominance, classical CV is not dead. Edge devices with limited compute (microcontrollers, drones) still use Canny, HOG, and ORB. Classical methods also serve as fast pre-filters before running expensive neural networks — e.g., using simple motion detection to trigger a deep learning classifier only when something moves in frame.

Chapter 6.1 — Summary

Images are (C, H, W) tensors — normalise to float32 [0,1] before feeding to neural networks
RGB mixes colour and brightness — HSV separates hue from value for colour-based algorithms
Classical filters are hand-designed convolution kernels — CNNs learn these automatically from data
Canny edge detector: 5 steps from blur → gradient → NMS → threshold → hysteresis
HOG and SIFT: hand-crafted local feature descriptors that dominated pre-2012 computer vision
AlexNet (2012): 41% error reduction proved end-to-end learned features beat manual engineering

6.2

Chapter 6.2

CNN Architectures for Vision

From AlexNet's breakthrough in 2012 to EfficientNet's compound scaling in 2019, each generation of CNN architecture solved a specific problem — depth, efficiency, scale. Understanding why each design was invented is more important than memorising layer counts.

CNN Architecture Recap Core

Before diving into specific architectures, recall the three inductive biases that make CNNs ideal for images (covered in Domain 4, Ch 4.5):

🔍

Local Connectivity

Each neuron sees only a small spatial patch. This exploits spatial locality — nearby pixels are more related than distant ones.

🔄

Weight Sharing

The same filter slides across every position. This gives translation equivariance — a cat is detected regardless of where it appears.

🏗️

Hierarchical Representation

Stacked conv layers build edges → textures → parts → objects. Each layer composes features from the layer below.

The core building block pattern used in almost every CNN: Conv → BatchNorm → ReLU → Pooling. As you go deeper, spatial dimensions shrink (via pooling/stride) while channel depth grows (more filters).

Feature Map Evolution — spatial dimensions shrink, channel depth grows

AlexNet & VGG In-depth

AlexNet (Krizhevsky et al., 2012) was the CNN that launched the deep learning revolution. It didn't just beat the competition on ImageNet — it obliterated it, reducing top-5 error from 25.8% to 15.3%. Every key idea it introduced became standard practice.

🏆

AlexNet — Key Innovations

GPU training — trained on 2 GTX 580 GPUs (first practical GPU training)
ReLU activation — 6× faster training than tanh/sigmoid
Dropout (p=0.5) in FC layers — regularisation against overfitting
Data augmentation — random crops, horizontal flips, colour jitter
Architecture: 5 conv + 3 FC layers, 60M parameters
11×11 first-layer filters — large receptive field to capture coarse features

🎨

VGG — The "Beautiful" Architecture

3×3 convolutions only — architectural simplicity as a virtue
Two 3×3 convs = same receptive field as one 5×5, but fewer parameters and more non-linearity
VGG-16: 16 weight layers, 138M parameters
VGG-19: 19 layers, 144M parameters
Still widely used as a feature extractor backbone
Simple, understandable — the "ImageNet of architectures"

VGG-16 — 5 blocks of 3×3 convolutions, 138M parameters

Architecture	Year	Layers	Params	Top-5 Error	Kernel Sizes	Key Innovation
LeNet-5	1998	5	60K	~25% (MNIST)	5×5	First practical CNN
AlexNet	2012	8	60M	15.3%	11×11, 5×5, 3×3	GPU, ReLU, Dropout
VGG-16	2014	16	138M	7.3%	3×3 only	Pure 3×3, deep simple
VGG-19	2014	19	144M	7.1%	3×3 only	Even deeper VGG

Why VGG Still Matters

Despite being "outdated" in accuracy, VGG is still the default feature extractor for perceptual loss (style transfer, super-resolution) and neural texture synthesis. Its intermediate features are remarkably good at capturing visual similarity.

Inception & GoogLeNet In-depth

Szegedy et al. (Google, 2014) asked: how do you go deeper more efficiently? VGG's approach of stacking 3×3 convolutions worked, but parameters and computation grew together. The Inception module solved this with a radically different idea.

The Inception module applies multiple filter sizes (1×1, 3×3, 5×5) in parallel at each layer, plus a max-pooling branch. Outputs are concatenated along the channel dimension. The network learns which spatial scale to attend to at each layer.

The critical insight: 1×1 convolutions as dimensionality reduction. Before the expensive 3×3 and 5×5 convolutions, a 1×1 "bottleneck" conv reduces channel count dramatically. This made GoogLeNet just 5M parameters — 12× fewer than AlexNet with better accuracy.

Inception Module — parallel multi-scale convolutions with 1×1 bottlenecks

GoogLeNet Efficiency 22 layers · 5M parameters · 6.7% top-5 error 12× fewer params than AlexNet (60M) with 2× better accuracy. The 1×1 bottleneck is the key.

ResNet & DenseNet In-depth

ResNet (He et al., 2015) solved the most puzzling problem in deep learning at the time: deeper networks had higher training error. Not overfitting — the networks simply couldn't learn. The solution was deceptively simple.

The residual block learns the change rather than the full mapping: y = F(x) + x. The skip connection allows gradients to flow directly through the identity path, enabling 50, 101, and even 152-layer networks.

DenseNet (Huang et al., 2017) extended this idea: connect every layer to all subsequent layers within a block. With L layers in a dense block, there are L(L+1)/2 direct connections. Each layer receives feature maps from ALL preceding layers, enabling maximum feature reuse with fewer parameters.

ResNet Skip vs DenseNet Dense Connectivity

ResNet Variants

ResNet-50: 25.6M params, 76.1% top-1 — the workhorse
ResNet-101: 44.5M params, 77.4% top-1
ResNet-152: 60.2M params, 78.3% top-1
Bottleneck block: 1×1→3×3→1×1 (reduces computation)

DenseNet Advantages

Feature reuse: fewer parameters than ResNet for same accuracy
Better gradient flow: direct paths to every layer
Implicit deep supervision: early layers get strong gradients
Growth rate k=32: each layer adds 32 new channels

MobileNet & Efficient CNNs In-depth

ResNet and VGG are too heavy for mobile phones, embedded systems, and real-time applications. MobileNet (Howard et al., 2017) introduced depthwise separable convolutions — splitting one expensive operation into two cheap ones.

Standard Convolution

Depthwise Separable

One operation: K×K×C_in×C_out filter
Cost: K²·C_in·C_out·H·W
For 3×3, 256→512: 1,179,648 ops/pixel

Step 1 — Depthwise: K×K filter per channel (C_in filters)
Step 2 — Pointwise: 1×1 conv mixing channels (C_in→C_out)
Cost: (K²·C_in + C_in·C_out)·H·W
For 3×3, 256→512: 133,376 ops/pixel — ~9× cheaper

Depthwise Separable Convolution — MobileNet's efficiency trick (~9× faster)

MobileNetV2 — Inverted Residuals

MobileNetV2 (2018) added inverted residual blocks: expand channels with 1×1 → depthwise 3×3 → compress back with 1×1. The skip connection goes between the narrow layers (inverted compared to ResNet). Also: SqueezeNet, ShuffleNet, and GhostNet target edge deployment.

EfficientNet & NAS Core

Tan & Le (Google, 2019) observed that previous architectures scaled only one dimension at a time — depth (more layers), width (more channels), or resolution (larger images). EfficientNet scales all three simultaneously with a fixed compound coefficient.

Compound Scaling depth: d = α^φ · width: w = β^φ · resolution: r = γ^φ Subject to α · β² · γ² ≈ 2 (FLOP budget roughly doubles per step). φ controls how much to scale.

EfficientNet Compound Scaling — width, depth, and resolution scaled jointly

Neural Architecture Search (NAS) was used to find the optimal base architecture (EfficientNet-B0). NAS automates architecture design by searching over a space of possible layers, connections, and hyperparameters. The EfficientNet family (B0–B7) uses the same base architecture at different compound scales.

Model	Top-1 Acc	Params	FLOPs	Key Feature
ResNet-50	76.0%	25M	4.1B	Skip connections
EfficientNet-B0	77.1%	5.3M	0.39B	NAS + compound scaling
EfficientNet-B3	81.6%	12M	1.8B	Medium scale
EfficientNet-B7	84.4%	66M	37B	Largest, SOTA 2019

Data Augmentation In-depth

Training a CNN requires millions of images — but labelled data is expensive. Data augmentation creates diverse training samples by applying label-preserving transformations. A flipped cat is still a cat. A slightly rotated stop sign is still a stop sign.

Critical Rule

Augmentations must be label-preserving. Flipping a "6" into a "9" changes the label — don't do horizontal flips for digit recognition. Always think about whether the transform preserves meaning for your specific task.

Data Augmentation — 8 transforms to increase training diversity

Standard Augmentations (Always Use)

RandomResizedCrop: random area crop + resize to target
RandomHorizontalFlip: 50% chance mirror
ColorJitter: brightness, contrast, saturation, hue
Normalize: ImageNet mean/std — technically preprocessing, always required

Advanced Augmentations

RandAugment: randomly pick N of 14 transforms — simple, effective
Cutout / RandomErasing: mask random rectangle — forces robustness
CutMix: paste patch from one image onto another, blend labels
Mixup: linear blend of two images + their labels

import torchvision.transforms as T
import torchvision.transforms.v2 as T2

# Standard training augmentation (ImageNet-style)
train_transform = T.Compose([
    T.RandomResizedCrop(224, scale=(0.08, 1.0)),  # random crop, resize to 224
    T.RandomHorizontalFlip(p=0.5),                 # 50% chance flip
    T.ColorJitter(brightness=0.4, contrast=0.4,    # colour distortion
                  saturation=0.4, hue=0.1),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225])
])

# Strong augmentation (RandAugment, used in EfficientNet training)
strong_transform = T2.Compose([
    T2.RandomResizedCrop(224),
    T2.RandomHorizontalFlip(),
    T2.RandAugment(num_ops=2, magnitude=9),        # random 2 of 14 augmentations
    T2.RandomErasing(p=0.25),                      # cutout
    T2.ToDtype(torch.float32, scale=True),
    T2.Normalize(mean=[0.485, 0.456, 0.406],
                 std=[0.229, 0.224, 0.225])
])

# Validation — no augmentation, just resize + normalise
val_transform = T.Compose([
    T.Resize(256),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225])
])

Test-Time Augmentation (TTA)

At inference, augment the test image multiple times (flip, multi-crop), run each through the model, and average predictions. TTA typically boosts accuracy by 1–2% with no retraining. Standard in competitions and medical imaging.

Transfer Learning in CV In-depth

Never train from scratch. This is the single most important practical rule in computer vision. ImageNet pre-trained models have already learned universal visual features — edges, textures, shapes — that transfer remarkably well to nearly any image task.

🧊

Feature Extraction

Freeze the entire backbone. Train only a new classification head.

Best when: small dataset (<1K images)
Training: very fast (few params)
Risk: underfitting if task is very different

🔓

Partial Fine-Tuning

Freeze early layers, fine-tune later layers + head.

Best when: moderate dataset (1K–10K)
Rationale: early layers = universal edges; later layers = task-specific
Common: freeze first 2–3 blocks

🔥

Full Fine-Tuning

Unfreeze all layers with a small learning rate.

Best when: large dataset (>10K images)
Key: use differential LR — backbone 10–100× smaller than head
Risk: catastrophic forgetting if LR too high

Learning Rate Rule of Thumb Pre-trained backbone: 1e-5 to 1e-4 · New head: 1e-3 to 1e-2 The backbone has already learned good features — large updates would destroy them. The new head needs to learn from scratch.

import torch
import torch.nn as nn
import torchvision.models as models

# ── Option 1: Feature Extraction (freeze backbone) ──
model = models.resnet50(weights='IMAGENET1K_V2')  # pretrained
for param in model.parameters():
    param.requires_grad = False                     # freeze everything

# Replace final classification layer
num_classes = 10    # your task: 10 classes (not 1000)
model.fc = nn.Linear(model.fc.in_features, num_classes)
# Only model.fc has requires_grad=True — very few parameters to train

# ── Option 2: Full Fine-tuning with differential LR ──
model2 = models.resnet50(weights='IMAGENET1K_V2')
model2.fc = nn.Linear(model2.fc.in_features, num_classes)

# Differential learning rates: backbone gets 10× smaller LR
backbone_params = [p for n, p in model2.named_parameters() if 'fc' not in n]
head_params     = list(model2.fc.parameters())

optimizer = torch.optim.AdamW([
    {'params': backbone_params, 'lr': 1e-5},  # small LR for pre-trained layers
    {'params': head_params,     'lr': 1e-3},  # larger LR for new head
], weight_decay=1e-4)

Common Mistake

Forgetting to match the pre-processing. If the pre-trained model was trained with ImageNet normalisation (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), you must use the same normalisation at inference. Mismatched preprocessing is the #1 silent bug in transfer learning.

Chapter 6.2 — Summary

AlexNet (2012): ReLU + dropout + GPU training — kicked off the deep learning era in CV
VGG: 3×3 convolutions only — architectural simplicity with depth; still the go-to feature extractor for perceptual loss
Inception module: parallel multi-scale convolutions (1×1, 3×3, 5×5) + 1×1 bottlenecks; GoogLeNet = 22 layers, only 5M params
ResNet: y = F(x) + x skip connections solve the degradation problem and enable 100+ layer networks
DenseNet: every layer connects to all subsequent — maximum feature reuse with fewer parameters
MobileNet: depthwise separable convolutions give ~9× speedup — essential for mobile and edge deployment
EfficientNet: compound scaling (depth × width × resolution) + NAS = best accuracy/efficiency trade-off
Data augmentation: always use random crop + flip + colour jitter; advanced: RandAugment, CutMix, Cutout
Transfer learning: always start with ImageNet pre-trained weights; use differential learning rates (backbone: 1e-5, head: 1e-3)

6.3

Chapter 6.3

Object Detection

Classification tells you what. Detection tells you what AND where — outputting a variable number of bounding boxes, each with a class label and confidence score. The evolution from R-CNN's 47 seconds per image to YOLO's 90 FPS is one of the most dramatic speedups in deep learning history.

The Detection Task Core

Image classification assigns ONE label to an entire image: "cat." Object detection finds MULTIPLE objects, drawing a bounding box around each and labelling it: "cat at (x,y,w,h) with 97% confidence, dog at (x',y',w',h') with 89% confidence."

The output format is a list of detections, each containing: (class_id, confidence, x_centre, y_centre, width, height). Coordinates are typically normalised to 0–1 relative to image dimensions.

🏷️

Variable Object Count

An image may contain 0, 1, or 100 objects. The model must handle all cases — unlike classification which always outputs one label.

📏

Multi-Scale Objects

A pedestrian 20px tall and a bus 400px wide must both be detected. Scale variation is the hardest challenge.

🫣

Occlusion & Overlap

Objects hide behind other objects. The detector must still find partially visible objects and avoid merging overlapping ones.

Vision Task Hierarchy — Classification → Detection → Segmentation

Sliding Window & Anchor Boxes Core

The naive approach: slide a window across the image at multiple scales and aspect ratios, run a classifier on each crop. This is conceptually simple but catastrophically slow — hundreds of thousands of crops per image, each requiring a full forward pass.

Anchor boxes solved this elegantly. Instead of sliding a window, divide the feature map into a grid and place pre-defined bounding box shapes (anchors) at each cell. The model predicts offsets from these anchors: (δx, δy, δw, δh) plus an objectness score and class probabilities. Anchors are designed to cover common shapes — wide rectangles for cars, tall rectangles for people, squares for faces.

Anchor Boxes — 3 pre-defined shapes per grid cell, model predicts offsets

R-CNN Family In-depth

The R-CNN family represents the two-stage approach to detection: first propose regions that might contain objects, then classify each region. Three papers over two years went from painfully slow to real-time.

🐢

R-CNN (2014)

Selective search: ~2000 region proposals
Warp each to 227×227
CNN feature extraction per region
SVM classifier + bbox regression
Speed: 47 sec/image

🐇

Fast R-CNN (2015)

CNN on entire image once → shared feature map
Project proposals onto feature map (RoI Pooling)
Classify + regress from RoI features
Bottleneck: selective search still external
Speed: 2 sec/image

🚀

Faster R-CNN (2015)

Replace selective search with Region Proposal Network (RPN)
RPN shares CNN backbone with detection head
End-to-end trainable
73.2% mAP on PASCAL VOC
Speed: 0.2 sec/image (5 FPS)

R-CNN Family Evolution — from 2000 crops to shared feature map + RPN

Two-Stage vs One-Stage

Faster R-CNN is a two-stage detector: stage 1 proposes regions (RPN), stage 2 classifies them. Two-stage detectors are generally more accurate but slower. One-stage detectors (YOLO, SSD) skip the proposal step entirely — they predict boxes and classes in a single pass.

YOLO: You Only Look Once In-depth

Redmon et al. (2015) made a radical move: frame detection as a single regression problem. Divide the image into an S×S grid, and in a single forward pass, predict all bounding boxes and class probabilities simultaneously. No proposals, no second stage — just one neural network, one pass, done.

The result: 45 FPS on 2015 hardware, over 200× faster than R-CNN. The trade-off was accuracy — YOLO struggled with small objects and nearby objects in the same grid cell. But the speed was revolutionary for real-time applications like autonomous driving and robotics.

YOLOv3 — grid division, multi-anchor predictions, multi-scale detection

Version	Year	Speed	mAP (VOC/COCO)	Key Feature
YOLOv1	2015	45 FPS	63.4% VOC	Single-pass regression
YOLOv2	2016	40 FPS	78.6% VOC	Anchor boxes, batch norm, multi-scale
YOLOv3	2018	30 FPS	33.0% COCO	Multi-scale detection, Darknet-53
YOLOv5	2020	30+ FPS CPU	50.7% COCO	PyTorch, Ultralytics, easy API
YOLOv8	2023	80+ FPS	53.9% COCO	Anchor-free, SOTA single-stage

from ultralytics import YOLO
import cv2

# Load pre-trained model (downloads automatically)
model = YOLO('yolov8n.pt')   # 'n'=nano, 's'=small, 'm'=medium, 'l'=large, 'x'=xlarge

# Run inference on image
results = model('street.jpg', conf=0.5, iou=0.45)

# Parse results
for result in results:
    boxes = result.boxes
    for box in boxes:
        cls  = int(box.cls[0])        # class index
        conf = float(box.conf[0])     # confidence
        x1, y1, x2, y2 = box.xyxy[0].tolist()  # bounding box coords
        print(f"{model.names[cls]}: {conf:.2f} at [{x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f}]")

# Visualise
result_image = results[0].plot()
cv2.imwrite('detected.jpg', result_image)

# Fine-tune on custom dataset
model.train(
    data='custom_dataset.yaml',   # YAML with train/val paths and class names
    epochs=100,
    imgsz=640,
    batch=16,
    device=0                       # GPU device
)

SSD & Single-Stage Detectors Core

SSD (Liu et al., 2016) combined YOLO's single-pass speed with an elegant multi-scale approach. Instead of predicting from just one feature map, SSD extracts predictions from 6 different feature map scales within the CNN. Early (larger) feature maps detect small objects; later (smaller) feature maps detect large objects.

This solved YOLO's weakness with small objects: 59 FPS at 300×300 input, 74.3% mAP — better small-object detection with comparable speed.

SSD — predictions from multiple feature map scales in a single pass

Modern Single-Stage Detectors

RetinaNet (2017): focal loss — solves class imbalance (background vs objects)
FCOS (2019): fully convolutional, anchor-free — predicts directly per pixel
DETR (2020): transformer-based detection — no anchors, no NMS needed

Two-Stage vs One-Stage Summary

Two-stage (Faster R-CNN): more accurate, slower (~5 FPS)
One-stage (YOLO, SSD): faster (30–90 FPS), slightly less accurate
Modern one-stage detectors have nearly closed the accuracy gap

NMS & Detection Metrics In-depth

Three concepts underpin detection evaluation: IoU measures box overlap quality, NMS removes duplicate detections, and mAP quantifies overall detector performance.

📐

IoU (Intersection over Union)

Measures overlap between predicted and ground-truth box.

IoU > 0.5: correct (PASCAL VOC)
IoU > 0.75: stricter standard
COCO: average over 0.5:0.05:0.95

🧹

NMS (Non-Max Suppression)

Removes duplicate overlapping boxes:

Sort boxes by confidence ↓
Keep highest-scoring box
Remove boxes with IoU > 0.45
Repeat for remaining boxes

📊

mAP (Mean Average Precision)

The standard detection metric:

Per class: precision-recall curve
AP = area under PR curve
mAP = mean AP across all classes
COCO mAP averages over 10 IoU thresholds

Detection Metrics IoU = |A ∩ B| / |A ∪ B| AP = ∫₀¹ p(r) dr (area under precision-recall curve) mAP = (1/C) Σ_c AP_c (mean over C classes) PASCAL VOC uses IoU@0.5. COCO averages IoU from 0.5 to 0.95 in steps of 0.05 → much harder benchmark.

IoU and Non-Maximum Suppression — removing duplicate detections

NMS Limitations

Standard NMS struggles when objects genuinely overlap (e.g., a crowd of people). It may suppress correct detections. Solutions: Soft-NMS (reduces confidence instead of removing), DETR (transformer-based, no NMS needed), or learnable NMS modules.

Chapter 6.3 — Summary

Detection output: list of (class, confidence, x, y, w, h) per object in image
Anchor boxes: pre-defined shapes at each grid cell; model predicts offsets + class + objectness
R-CNN family: propose → extract → classify; Faster R-CNN adds RPN for end-to-end training (0.2s/img)
YOLO: single forward pass predicts all objects — real-time at 30–90 FPS; YOLOv8 is SOTA single-stage
SSD: YOLO-like but uses multiple feature map scales → better small object detection
IoU: intersection / union measures predicted vs ground truth box overlap; threshold 0.5 (VOC) or 0.5:0.95 (COCO)
NMS: remove duplicate overlapping boxes keeping highest-confidence detections; Soft-NMS for crowded scenes
Modern trend: anchor-free (FCOS) and transformer-based (DETR) detectors eliminating hand-designed components

6.4

Chapter 6.4

Image Segmentation

Detection draws boxes. Segmentation colours every pixel. From U-Net's elegant encoder-decoder to Meta's Segment Anything Model, segmentation has evolved from a niche medical imaging task to a foundation capability — segment any object in any image with a single click.

Segmentation Types Core

All three segmentation types assign labels at the pixel level — but they differ fundamentally in what they distinguish.

🎨

Semantic Segmentation

Every pixel labelled with a class: road, car, sky, person.

All instances of same class get same colour
Two cats = one merged mask
Output: H×W label map (one int per pixel)

🧩

Instance Segmentation

Every pixel labelled with class AND instance ID.

Two cats = two separate masks (Cat₁, Cat₂)
Only "things" (countable objects)
Output: list of (mask, class, conf) per object

🌐

Panoptic Segmentation

Unifies semantic (stuff) + instance (things).

Every pixel: class + optional instance ID
"Stuff" (sky, road) + "things" (car₁, car₂)
Output: complete scene understanding

Semantic vs Instance vs Panoptic Segmentation

Semantic Segmentation In-depth

The core challenge: classification networks reduce spatial resolution (pooling, striding) to build semantic understanding. Segmentation needs to restore it — predict a class for every single pixel. The solution: encoder-decoder architectures.

Encoder (Downsampling)

Decoder (Upsampling)

Standard CNN backbone (ResNet, VGG)
Loses spatial resolution via pooling/stride
Builds semantic understanding — "what" is here
224×224 → 7×7 feature map

Reverse the downsampling
Restores spatial resolution to original size
Upsampling methods:
• Bilinear interpolation: simple, no learned params
• Transposed conv: learnable, can cause checkerboard artefacts

Skip connections are critical: they connect encoder layers to decoder layers at matching resolutions. Without them, the decoder must reconstruct fine spatial detail (edges, boundaries) from the bottleneck alone — a lossy process. With skips, fine spatial detail flows directly from encoder to decoder.

Loss Functions

Pixel-wise cross-entropy: standard classification loss per pixel
Dice loss: 2·|A∩B| / (|A|+|B|) — better for class imbalance
Combined: CE + Dice often used together in practice
Weighted CE: higher weight for rare classes (e.g., tumour vs background)

Key Architectures

FCN (2014): first fully convolutional — no FC layers, any input size
U-Net (2015): symmetric encoder-decoder + skip connections
DeepLab v3+ (2018): atrous convolutions + ASPP for multi-scale
SegFormer (2021): transformer-based encoder, MLP decoder

U-Net Architecture In-depth

Ronneberger et al. (2015) designed U-Net for biomedical image segmentation — but it became the universal segmentation architecture. The U-shaped design features a symmetric encoder (contracting path) and decoder (expanding path) connected by skip connections at every level.

The critical innovation: skip connections don't just add features (like ResNet) — they concatenate entire feature maps from encoder to decoder. This preserves the fine spatial detail (textures, edges) lost during downsampling. The decoder gets both the upsampled abstract features AND the original high-resolution details.

U-Net — Symmetric Encoder-Decoder with Skip Connections at Every Level

import torch
import torch.nn as nn
import torch.nn.functional as F

class DoubleConv(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1), nn.BatchNorm2d(out_ch), nn.ReLU(),
            nn.Conv2d(out_ch, out_ch, 3, padding=1), nn.BatchNorm2d(out_ch), nn.ReLU()
        )
    def forward(self, x): return self.conv(x)

class UNet(nn.Module):
    def __init__(self, n_classes=2):
        super().__init__()
        # Encoder
        self.enc1 = DoubleConv(1, 64)
        self.enc2 = DoubleConv(64, 128)
        self.enc3 = DoubleConv(128, 256)
        self.pool  = nn.MaxPool2d(2)
        # Bottleneck
        self.bottleneck = DoubleConv(256, 512)
        # Decoder
        self.up3 = nn.ConvTranspose2d(512, 256, 2, stride=2)
        self.dec3 = DoubleConv(512, 256)   # 512 = 256 (up) + 256 (skip)
        self.up2 = nn.ConvTranspose2d(256, 128, 2, stride=2)
        self.dec2 = DoubleConv(256, 128)
        self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2)
        self.dec1 = DoubleConv(128, 64)
        self.out_conv = nn.Conv2d(64, n_classes, 1)  # 1×1 final

    def forward(self, x):
        e1 = self.enc1(x)                                   # skip 1
        e2 = self.enc2(self.pool(e1))                       # skip 2
        e3 = self.enc3(self.pool(e2))                       # skip 3
        b  = self.bottleneck(self.pool(e3))
        d3 = self.dec3(torch.cat([self.up3(b), e3], dim=1)) # concat skip
        d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))
        d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))
        return self.out_conv(d1)  # H×W×n_classes

U-Net Beyond Medical Imaging

U-Net's encoder-decoder + skip connection pattern became foundational far beyond segmentation. Stable Diffusion uses a U-Net (with attention layers) as its denoising backbone. The same architecture that segments tumours also generates images from text prompts.

Mask R-CNN & Instance Segmentation In-depth

Mask R-CNN (He et al., Facebook AI, 2017) extends Faster R-CNN with a simple but powerful addition: a third prediction head that outputs a pixel-level mask for each detected object. Three heads run in parallel after RoI features are extracted:

🏷️

Classification Head

FC layers → Softmax

Output: class label for this region

📦

Box Regression Head

FC layers → (Δx, Δy, Δw, Δh)

Output: refined bounding box

🎭

Mask Head (NEW)

FCN → 28×28 binary mask per class

Output: pixel-level mask for the object

A critical improvement: RoI Align replaced RoI Pooling. RoI Pooling quantises coordinates (rounding to nearest pixel), causing spatial misalignment that doesn't matter for bounding boxes but is catastrophic for pixel-accurate masks. RoI Align uses bilinear interpolation — no quantisation, precise alignment.

Mask R-CNN — three prediction heads: class, box, and pixel mask

Why RoI Align Matters

RoI Pooling rounds coordinates to the nearest integer, creating up to 0.5px misalignment. For 7×7 pooled features from a 224px image, that's a 7px error — invisible for classification, catastrophic for pixel masks. RoI Align uses bilinear interpolation at exact floating-point positions, eliminating this misalignment entirely.

Panoptic Segmentation In-depth

Panoptic segmentation (Kirillov et al., 2019) unifies semantic and instance segmentation into a single coherent output. Every pixel gets a class label. "Things" (countable: cars, people) also get unique instance IDs. "Stuff" (uncountable: sky, road, grass) gets class labels only.

"Things" (Instance)

"Stuff" (Semantic)

Countable objects: person, car, dog, chair
Each instance gets a unique ID
car₁ ≠ car₂ even though both are "car"
Predicted by instance segmentation branch

Amorphous regions: sky, road, grass, water
No instance distinction — just class label
All sky pixels = "sky" (no sky₁, sky₂)
Predicted by semantic segmentation branch

Modern panoptic models like Panoptic FPN and Mask2Former (2022) use a unified architecture that handles both things and stuff with a single decoder. Mask2Former treats all segments (things + stuff) as mask queries processed by a transformer decoder — achieving SOTA on all three segmentation tasks simultaneously.

Panoptic Quality (PQ) PQ = SQ × RQ = (Σ IoU / |TP|) × (|TP| / (|TP| + ½|FP| + ½|FN|)) SQ = Segmentation Quality (average IoU of matched segments). RQ = Recognition Quality (F1 of matching). PQ combines both into a single metric.

SAM: Segment Anything Model In-depth

Kirillov et al. (Meta AI, 2023) built a foundation model for segmentation. Trained on SA-1B — 11 million images with 1.1 billion segmentation masks — SAM can segment any object in any image with a simple prompt: click a point, draw a box, or provide text.

🖼️

Image Encoder

MAE pre-trained ViT-H (Vision Transformer Huge)

Encodes image into embedding
256×64×64 feature map
Heavy: ~100ms (done once)

👆

Prompt Encoder

Encodes user prompts as tokens:

Point click: segment object at that point
Bounding box: segment within the box
Mask: refine an existing mask
Text (SAM2): "the red car"

🎭

Mask Decoder

Lightweight transformer decoder

Outputs 3 candidate masks
Whole object / part / subpart
Fast: ~50ms per prompt
Interactive — prompt → mask instantly

SAM — Image Encoder (ViT) + Prompt Encoder + Lightweight Mask Decoder

SA-1B — The Largest Segmentation Dataset

SAM was trained on 1.1 billion masks across 11 million images — over 100× larger than any previous segmentation dataset. The data engine used SAM itself in a loop: model assists human annotators → annotators correct → model improves → repeat. This "model-in-the-loop" approach is now standard for building large-scale datasets.

SAM Limitations

SAM excels at segmenting arbitrary objects but does NOT classify them. It tells you "here's an object boundary" but not "this is a cat." For applications needing both segmentation and classification, combine SAM with a classifier or use specialised models like Mask R-CNN or Mask2Former.

Chapter 6.4 — Summary

Semantic segmentation: every pixel gets a class label — same mask for all instances of a class
Instance segmentation: each object gets a separate mask + class + confidence — distinguishes individual objects
Panoptic segmentation: unifies semantic (stuff) + instance (things) — complete scene understanding
U-Net: symmetric encoder-decoder with skip connections (copy + concat) — preserves spatial detail; also used in Stable Diffusion
Mask R-CNN: Faster R-CNN + 28×28 mask head + RoI Align — no quantisation, pixel-accurate masks; SOTA 2017–2022
Panoptic Quality: PQ = SQ × RQ — single metric combining segmentation accuracy and recognition accuracy
SAM: foundation model — segment anything with a point click, trained on 1.1B masks, interactive at <50ms per prompt
SAM2 (2024): extends to video — track and segment objects across frames with interactive prompts

6.5

Chapter 6.5

Generative Adversarial Networks

Two neural networks locked in an adversarial game — one forges, the other detects. From blurry 64×64 bedrooms to photorealistic 1024×1024 faces, GANs evolved from an elegant mathematical idea into one of the most impactful generative frameworks in computer vision.

GAN Fundamentals In-depth

Goodfellow et al. (2014) introduced one of the most cited frameworks in deep learning: two networks in competition. The Generator G takes random noise z ~ N(0,I) and produces fake images G(z). The Discriminator D receives an image (real or fake) and outputs the probability that it's real.

Training alternates: update D for k steps (improve its detection ability), then update G for 1 step (improve its forgery). At Nash equilibrium, G produces perfect fakes and D outputs 0.5 for everything — it literally cannot tell real from fake.

GAN Minimax Objective min_G max_D V(D,G) = 𝔼_x[log D(x)] + 𝔼_z[log(1 − D(G(z)))] Generator wants: D(G(z)) → 1 (fool discriminator). Discriminator wants: D(x) → 1, D(G(z)) → 0 (classify correctly).

GAN Training Loop — adversarial updates between Generator and Discriminator

GAN Loss Functions Core

The original GAN loss has two critical failure modes. Vanishing gradients: when D becomes too good, D(G(z)) ≈ 0, and log(0) = −∞ gives no useful gradient for G. Mode collapse: G finds a few outputs that fool D and keeps generating only those, ignoring the rest of the distribution.

📉

Original GAN Loss

JS divergence-based.

Vanishes when distributions don't overlap
Unstable training dynamics
Loss curves not meaningful

🌊

WGAN (2017)

Wasserstein (Earth Mover's) distance.

Gradient never vanishes — even with no distribution overlap
D must be 1-Lipschitz
WGAN-GP: gradient penalty for stability

📐

LSGAN

Least-squares loss for D.

MSE instead of log-likelihood
Penalises samples far from boundary
More stable, less mode collapse

WGAN Loss L_D = 𝔼[D(G(z))] − 𝔼[D(x)] (maximise for D — no log!) L_G = −𝔼[D(G(z))] (minimise for G) Constraint: D must be 1-Lipschitz (‖∇D‖ ≤ 1 everywhere). WGAN-GP adds gradient penalty: λ · 𝔼[(‖∇D‖₂ − 1)²].

DCGAN Core

Radford et al. (2015) established the first stable recipe for CNN-based GANs. Before DCGAN, most GAN experiments produced noise or collapsed. DCGAN's design rules became gospel for all subsequent work:

DCGAN Design Rules

Replace pooling with strided convolutions (D) and transposed convolutions (G)
BatchNorm in both G and D (except G output and D input)
Remove fully connected hidden layers
ReLU in G (except output: Tanh), LeakyReLU in D

DCGAN Results

Generated 64×64 bedroom images — first realistic CNN-generated images
Meaningful latent space: interpolating between z vectors = smooth visual transitions
"Smiling woman" − "neutral woman" + "neutral man" = "smiling man"
Proved CNNs could learn rich image priors unsupervised

DCGAN Generator — noise vector upsampled through transposed convolutions to image

Conditional GAN Core

Vanilla GANs generate random images — you have no control over what comes out. Mirza & Osindero (2014) fixed this by feeding a condition label y to both G and D. Now G(z, y) generates an image of class y, and D(x, y) verifies it's a real image of that class.

Conditional GAN — class label y conditions both generator and discriminator

pix2pix — Image-to-Image Translation

Isola et al. (2016) used conditional GANs for paired image-to-image translation: sketch → photo, edges → handbag, day → night, satellite → map. The condition is the input image itself. pix2pix showed that conditional GANs are a universal framework for image transformation tasks.

StyleGAN In-depth

Karras et al. (NVIDIA, 2019) produced the first truly photorealistic synthetic faces. The key insight: separate the latent code into a style that controls appearance at each resolution level, rather than feeding noise directly into the first layer.

StyleGAN Innovations

Mapping network: z → w (8-layer FC) — less entangled intermediate latent space
AdaIN: inject style w at each resolution level via Adaptive Instance Normalisation
Progressive growing: train at 4×4, gradually grow to 1024×1024
Stochastic variation: per-pixel noise at each layer for fine details (hair, freckles)

StyleGAN Evolution

StyleGAN (2019): 1024×1024 photorealistic faces, FFHQ dataset
StyleGAN2 (2020): removes AdaIN artefacts, path length regularisation
StyleGAN3 (2021): alias-free — proper translation/rotation equivariance
StyleGAN-XL (2022): scaled to ImageNet-level diversity

StyleGAN — Mapping network + per-resolution style injection via AdaIN

CycleGAN Core

pix2pix requires paired training data — the exact same scene in both domains (e.g., the same street in day and night). This is often impossible to collect. Zhu et al. (2017) solved this with CycleGAN: learn translation between domains using only unpaired examples.

The trick: cycle consistency. Two generators (G_AB: A→B, G_BA: B→A) must satisfy G_BA(G_AB(a)) ≈ a. If you translate a horse to a zebra and back, you should get the original horse. This constraint prevents the generators from hallucinating arbitrary outputs.

CycleGAN — cycle consistency ensures translation is invertible (no paired data)

CycleGAN Applications

Horse ↔ zebra, summer ↔ winter, photo ↔ Monet painting, day ↔ night, apple ↔ orange. CycleGAN works on any two unpaired image domains. The cycle consistency loss is also used in unsupervised machine translation (text) and audio style transfer.

GAN Training Challenges In-depth

🔄

Mode Collapse

G produces limited variety — finds a few modes that fool D, ignores the rest.

"Same face no matter what z is"
Fix: mini-batch discrimination
Fix: unrolled GANs, WGAN

⚖️

Training Instability

D and G must stay balanced — if one dominates, the other gets no gradient.

Loss curves are NOT meaningful
Fix: spectral normalisation
Fix: D slower LR than G

📏

Evaluation Challenge

How to measure "realistic and diverse"?

FID: Fréchet Inception Distance
Lower FID = better quality + diversity
Needs 1000s of samples to estimate

Technique	Problem	How It Works	Used In
WGAN / WGAN-GP	Vanishing gradients	Wasserstein distance + gradient penalty	Most modern GANs
Spectral Normalisation	Training instability	Constrain weight matrix spectral norm	SN-GAN, BigGAN
Progressive Growing	High-res training	Start 4×4, gradually increase resolution	StyleGAN
Mini-batch Discrimination	Mode collapse	Pass statistics across batch to D	Original GAN improvements
Label Smoothing	Overconfident D	D target = 0.9 not 1.0	Standard practice

GANs vs Diffusion Models

By 2022, diffusion models (Ch 6.6) largely replaced GANs for image generation. Diffusion models are more stable to train, don't suffer mode collapse, and achieve better FID scores. GANs remain relevant for real-time generation (single forward pass vs diffusion's iterative denoising) and image-to-image translation (CycleGAN, pix2pix).

Chapter 6.5 — Summary

GAN: Generator fools Discriminator; D trains to detect fakes — minimax game with Nash equilibrium at D(G(z)) = 0.5
WGAN: replaces JS divergence with Wasserstein distance — solves vanishing gradients; WGAN-GP adds gradient penalty
DCGAN: strided convolutions + BatchNorm + LeakyReLU = first stable image GAN; established CNN-GAN design rules
Conditional GAN: label y conditions both G and D — enables class-conditional generation and pix2pix image translation
StyleGAN: mapping network z → w + AdaIN style injection per resolution = first photorealistic 1024² face generation
CycleGAN: unpaired image translation via cycle consistency loss — G_BA(G_AB(a)) ≈ a; no paired data needed
Mode collapse: G generates limited variety; training instability: D/G balance is fragile; FID: standard evaluation metric
Modern trend: diffusion models replacing GANs for generation quality; GANs still best for real-time and image-to-image tasks

6.6

Chapter 6.6

Vision Transformers (ViT) & Modern Architectures

In 2020, a single idea flipped computer vision upside-down: what if we treated an image as a sequence of patches — just like tokens in a sentence? The Vision Transformer proved that convolutions are not necessary. Pure self-attention, given enough data and compute, learns to see.

CNN Limitations That Motivated ViT Core

CNNs have two hard-wired inductive biases: locality (convolutions see only a small neighbourhood) and translation equivariance (same filter applied everywhere). These are powerful priors — but they also limit expressiveness.

A 3×3 conv at layer 1 sees only 9 pixels. To see the whole image, information must pass through many layers of pooling — getting progressively diluted. Transformers have no such constraints: self-attention connects every patch to every other patch in a single operation. Global context from layer 1.

CNN Strengths

CNN Limitations

→ Data-efficient (inductive biases help)
→ Translation equivariant by design
→ Computationally efficient: O(HW) not O(HW)²
→ Works well on small datasets
→ Fast inference on edge devices

→ Local receptive field by default
→ Long-range dependencies require many layers
→ Fixed spatial hierarchy (downsampling loses info)
→ Inductive biases may hurt on novel domains
→ Hard to model global context at early layers

ViT: An Image is Worth 16×16 Words In-depth

Dosovitskiy et al. (Google Brain, 2020) applied the standard Transformer encoder — unchanged from NLP — directly to images. The trick: divide the image into fixed-size non-overlapping patches, flatten each patch, linearly project it, and treat the resulting sequence exactly like word tokens.

1. Patch
224² → 196 patches (16×16)

2. Flatten
16×16×3 = 768 values each

3. Embed
Linear: 768→D + pos embed

4. [CLS] + Transformer
12 layers, global attention

5. Classify
MLP on [CLS] → class

ViT Patch Embedding Patches: x^p ∈ ℝ^N×(P²·C) where N = HW/P² = 196, P = 16 z₀ = [x_cls ; x^p₁E ; x^p₂E ; ... ; x^p_NE] + E_pos E ∈ ℝ^{(P²C × D)} = patch projection, E_pos ∈ ℝ^{((N+1) × D)} = learnable 1D position embeddings (2D not needed!)

ViT Pipeline — 16×16 patches → 196 tokens → Transformer → class label

ViT Attention Maps — self-attention captures global structure immediately

import torch
import timm

# Load pre-trained ViT-Base/16 (ImageNet-21k → ImageNet-1k)
model = timm.create_model('vit_base_patch16_224', pretrained=True)
model.eval()

print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")  # 86,567,656
print(f"Patch size: {model.patch_embed.patch_size}")                   # (16, 16)
print(f"Num patches: {model.patch_embed.num_patches}")                 # 196
print(f"Embedding dim: {model.embed_dim}")                             # 768
print(f"Num heads: {model.blocks[0].attn.num_heads}")                  # 12
print(f"Num layers: {len(model.blocks)}")                              # 12

# Inference
from torchvision import transforms
from PIL import Image

transform = transforms.Compose([
    transforms.Resize(256), transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

img = transform(Image.open("cat.jpg")).unsqueeze(0)  # (1, 3, 224, 224)
with torch.no_grad():
    logits = model(img)                               # (1, 1000)
    probs = logits.softmax(-1)
    top5 = probs.topk(5)
    print("Top-5 predictions:", top5.indices)

ViT Training Requirements Core

The original paper's most critical finding: ViT-Large trained on ImageNet-1k ONLY achieved 77.9% — worse than ResNet-50 (76.1%). But pre-trained on JFT-300M (Google's internal 300M-image dataset), it reached 88.55% — crushing every CNN. The conclusion: ViT needs massive data to overcome its lack of inductive biases.

ViT vs CNN — ViT overtakes CNNs at ~14M pre-training images

Why ViT Needs More Data

CNNs have built-in priors: locality and translation equivariance guide learning even with limited data. ViT must learn these properties from data. With enough examples, ViT discovers even better representations — but on small datasets, it overfits to training patterns without discovering general visual structure.

DeiT: Data-Efficient Image Transformers In-depth

Touvron et al. (Facebook, 2021) asked: can we train ViT on ImageNet-1k without Google's proprietary JFT-300M? The answer: yes, with three key tricks — knowledge distillation from a CNN teacher, aggressive data augmentation, and a careful training recipe.

DeiT adds a second special token: [DIST] (distillation token) alongside the standard [CLS]. [DIST] is trained to match the CNN teacher's soft labels, while [CLS] learns from ground-truth hard labels. At inference, both outputs are averaged. Result: DeiT-Base: 81.8% vs original ViT-Base on same data: 77.9%.

DeiT Distillation — [DIST] token learns from CNN teacher alongside [CLS]

Swin Transformer In-depth

Liu et al. (Microsoft, 2021) solved ViT's two biggest problems for dense prediction tasks: fixed single-scale output and quadratic attention cost. Swin Transformer (ICCV 2021 Best Paper) introduced window attention and hierarchical feature maps — making transformers practical for detection and segmentation.

Window Attention

Compute self-attention within local 7×7 windows instead of globally.

Cost: O(M²) per window, not O(HW)²
Linear complexity in image size
Shifted windows alternate each layer
Cross-boundary info flow restored

Hierarchical Features

Multi-scale feature maps like ResNet's pyramid.

Stage 1: 56×56 (fine, small objects)
Stage 2: 28×28 → Stage 3: 14×14
Stage 4: 7×7 (coarse, large objects)
Plug directly into FPN for detection

Swin Transformer — window attention (Layer ℓ) and shifted window (Layer ℓ+1)

ConvNeXt: A ConvNet for the 2020s Core

Liu et al. (Facebook, 2022) asked: "What if we took ResNet and applied every Transformer design decision?" Starting from ResNet-50, they applied 7 systematic changes. The result: ConvNeXt-Base: 83.8% — matching Swin-Base (83.5%) without any attention mechanism.

ConvNeXt Block — ResNet + Transformer design principles

Architecture Showdown In-depth

Model	Inductive Bias	Attention	Hierarchy	Best For	Params	Top-1
ResNet-50	Strong (conv)	None	Yes	Small data, fast inference	25M	76.1%
EfficientNetV2-M	Strong (conv+NAS)	None	Yes	Efficient production	54M	85.1%
ViT-Base/16	None	Global	No	Large-scale pre-training	86M	81.8%
DeiT-Base/16	Weak (distill)	Global	No	ImageNet-scale tasks	86M	81.8%
Swin-Base	Weak (window)	Local+shift	Yes	Detection, segmentation	88M	83.5%
ConvNeXt-Base	Moderate (conv)	None	Yes	All-around modern CNN	89M	83.8%
ViT-L/16 (MAE)	None	Global	No	Large-scale SOTA	307M	87.8%
SAM2 (ViT-H)	None	Global	No	Zero-shot segmentation	641M	—

Accuracy vs Throughput — finding the best architecture for your constraints

There is no single "best" architecture in 2024. For classification accuracy: ViT-Large with MAE pre-training. For efficiency: ConvNeXt or EfficientNetV2. For detection/segmentation: Swin or ConvNeXt backbone. For multimodal tasks: ViT dominates — it connects naturally to language models via shared attention.

Chapter 6.6 — Summary

ViT splits image into 196 non-overlapping 16×16 patches, treats them as a token sequence for a standard Transformer encoder
Self-attention is global from layer 1 — no locality constraint unlike CNNs; every patch sees every other patch directly
ViT needs massive pre-training data: outperforms CNNs only above ~14M images; on small data, CNNs still win
DeiT: trains ViT on ImageNet-1k via knowledge distillation from a CNN teacher + [DIST] token — 81.8% vs 77.9%
Swin Transformer: window attention (linear cost) + shifted windows + hierarchical feature maps = best backbone for detection/segmentation
ConvNeXt: ResNet updated with 7 Transformer design choices — matches Swin without any attention (83.8% vs 83.5%)
No single best architecture: ViT for accuracy, ConvNeXt for efficiency, Swin for dense prediction, ViT for multimodal

6.7

Chapter 6.7

Multimodal AI — CLIP, DALL-E & Vision-Language Models

For decades, vision and language models lived in separate worlds. CLIP changed that in 2021 by learning a shared embedding space where "a photo of a dog" and an actual photo of a dog sit close together. One model. Two modalities. Zero task-specific training.

What Is Multimodal AI? Core

Unimodal models process only one type of data — a text-only LLM or an image-only CNN. Multimodal models process and relate multiple data types simultaneously: text + image, image + audio, video + text. Real-world tasks are inherently multimodal — "describe what's in this photo" requires both vision and language.

🖼️→📝

Vision → Language

Image captioning (BLIP, CoCa)
Visual QA — "What colour is the car?"
Document parsing (GPT-4V, DocVQA)
Medical image report generation
Chart and figure understanding

📝→🖼️

Language → Vision

Text-to-image (DALL-E 3, SDXL, Midjourney)
Image editing by text instruction
Text-guided inpainting
Text-to-video (Sora, Runway)
Text-to-3D generation

💬🖼️→📝

Text + Image → Text

Multimodal chat (GPT-4V, LLaVA, Gemini)
"Explain this chart step by step"
Visual code debugging
Document Q&A with screenshots

🔍↔️

Text ↔ Image Retrieval

Find images matching a text query (CLIP)
Find text matching an image
Pinterest visual search, Google Lens
Open-vocabulary detection (GLIP, OwL-ViT)

CLIP: Contrastive Language-Image Pre-training In-depth

Radford et al. (OpenAI, 2021) trained two encoders — one for images, one for text — jointly on 400 million (image, text) pairs scraped from the internet. No manual labels: web captions ARE the supervision. The goal: learn a shared embedding space where matching image-text pairs are close and non-matching pairs are far apart.

Image Encoder

Text Encoder

ResNet-50 or ViT (ViT-B/32, ViT-L/14)
Input: 224×224 RGB image
Output: d-dim embedding (e.g., 512 or 1024)
Projected to shared embedding space via linear layer

Transformer with masked self-attention (GPT-style)
Input: text caption (up to 77 tokens)
Output: d-dim embedding
[EOS] token representation projected to shared space

CLIP InfoNCE Contrastive Loss L = −(1/2N) Σᵢ [log softmax(sim(Iᵢ,Tᵢ)/τ) + log softmax(sim(Tᵢ,Iᵢ)/τ)] sim(I, T) = (I·T) / (‖I‖·‖T‖) (cosine similarity) τ = learnable temperature. Maximise similarity of N matching pairs, minimise N²−N non-matching pairs. Actual batch size: 32,768 pairs.

CLIP Contrastive Training — align matching image-text pairs, separate non-matching

What CLIP Enables In-depth

🎯

Zero-Shot Image Classification

Write text prompts: "a photo of a {class}" for each class. Embed all prompts. Compare with image embedding via cosine similarity. Nearest = predicted class.

76.2% Top-1 on ImageNet — zero task-specific training
Competitive with supervised ResNet-50 (76.1%)

🔍

Visual Similarity Search

Embed a text query, find images by cosine similarity in shared space.

Pinterest visual search, Google Lens
Stock photo search by description
Medical image retrieval

🌐

Open-Vocabulary Detection

Combine CLIP with detection models → detect ANY category described in text.

OwL-ViT, GLIP, Grounding DINO
No retraining for new classes
"Find all red objects in this image"

🗂️

Data Filtering & Curation

Filter web-crawled images by semantic content using text queries.

Built LAION-5B (5B image-text pairs)
LAION = training data for Stable Diffusion
CLIP score filters low-quality pairs

CLIP Zero-Shot Classification — no retraining, just text prompts for each class

import torch
import open_clip
from PIL import Image

# Load pre-trained CLIP model
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32', pretrained='openai'
)
tokenizer = open_clip.get_tokenizer('ViT-B-32')
model.eval()

# Load and preprocess image
image = preprocess(Image.open("cat.jpg")).unsqueeze(0)  # (1, 3, 224, 224)

# Zero-shot classification: define class prompts
classes = ["cat", "dog", "car", "bird", "elephant"]
text_prompts = [f"a photo of a {c}" for c in classes]
text_tokens = tokenizer(text_prompts)                   # (5, 77) token sequences

with torch.no_grad():
    image_features = model.encode_image(image)           # (1, 512)
    text_features  = model.encode_text(text_tokens)      # (5, 512)

    # Normalise to unit vectors before cosine similarity
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features  /= text_features.norm(dim=-1, keepdim=True)

    # Cosine similarity: (1,512) @ (512,5) → (1,5)
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

for cls, prob in zip(classes, similarity[0]):
    print(f"{cls:12s}: {prob.item():.3f}")
# cat         : 0.823 ← highest
# dog         : 0.091
# car         : 0.031
# bird        : 0.034
# elephant    : 0.021

DALL-E 1, 2, and 3 In-depth

🔢

DALL-E 1 (2021)

Text → BPE tokens (256) + VQ-VAE image tokens (1024)
12B-param autoregressive Transformer
Predicts image tokens sequentially
Creative combinations but limited fidelity
Slow: sequential generation

🌊

DALL-E 2 (2022)

Text → CLIP text embedding → Prior → CLIP image embedding
Diffusion decoder (unCLIP) → high-res image
Much higher fidelity than DALL-E 1
Supports image variations (encode → re-decode)
Uses CLIP embeddings as bridge

📝

DALL-E 3 (2023)

Key innovation: highly descriptive synthetic captions
Re-captioned all training data with detailed descriptions
Much better instruction following
Integrated into ChatGPT
Handles complex prompts faithfully

DALL-E Architecture Evolution — autoregressive tokens → CLIP-guided diffusion

Vision-Language Models (VLMs) In-depth

VLMs accept both images and text as input and generate text as output. The core challenge: how do you connect a vision encoder to a language model? Three main approaches have emerged:

1. Feature Concatenation

Vision tokens prepended to text tokens before LLM.

LLM processes visual + text tokens together
Requires LLM pre-training on multimodal data
Example: Flamingo (cross-attention layers)
Limitation: LLM must learn vision interpretation

2. Projector / Adapter

MLP projector bridges vision encoder → LLM input space.

Most common approach
Freeze LLM (or LoRA), train only projector
Examples: LLaVA-1.5, InternVL, Qwen-VL
Efficient: minimal new parameters

3. Native Multimodal

Vision and language trained together from scratch.

Best cross-modal reasoning
Most expensive to train
Examples: GPT-4V, Gemini, PaliGemma
Unified architecture — no connector needed

VLM Architecture — ViT + MLP Projector + LLM (e.g., LLaVA-1.5 design)

GPT-4V, Gemini & Open VLMs Core

🌐

GPT-4V / GPT-4o (OpenAI)

First frontier VLM (2023)
Read documents, analyse charts
Solve visual math, describe scenes
GPT-4o: native voice + vision + text
128K token context

💎

Gemini 1.5 Pro (Google)

Natively multimodal from pre-training
1M token context window
Process 1 hour of video or 1000 images
Images, video, audio, code, text
Best for long-document & video tasks

🦙

Open-Source VLMs

LLaVA-1.5: CLIP + MLP + Vicuna — strong baseline
InternVL-2: competitive with GPT-4V on benchmarks
Qwen-VL: multilingual, multi-image
PaliGemma: SigLIP + Gemma, efficient open model

Model	Architecture	Vision Encoder	LLM Base	Context	Notable
GPT-4V / 4o	Proprietary	Undisclosed	GPT-4	128K	Best overall, native voice+vision
Gemini 1.5 Pro	Native multimodal	Proprietary	Gemini	1M tokens	Long video, multi-image
Claude 3.5 Sonnet	Proprietary	Undisclosed	Claude 3.5	200K	Document analysis, charts
LLaVA-1.5	ViT+Projector	CLIP ViT-L/336	Vicuna-13B	4K	Strong open baseline
InternVL-2	ViT+MLP	InternViT-6B	InternLM2-20B	8K	Near-frontier open
Qwen-VL-Plus	ViT+Adapter	Qwen ViT	Qwen-7B	8K	Multilingual, multi-image
PaliGemma	ViT+Linear	SigLIP-So400M	Gemma-2B/9B	8K	Open, small, efficient

Multimodal Benchmarks Core

Multimodal Benchmark Tasks — chart QA, document QA, spatial reasoning, scene

Benchmark	Tests	Format	GPT-4V	Best Open	Human
VQAv2	General visual QA	Open-ended	77.2%	~75%	80.9%
TextVQA	Text in images	Open-ended	78.0%	76.1%	~85%
DocVQA	Document understanding	Open-ended	87.2%	82.6%	96%
ChartQA	Chart comprehension	Open-ended	78.5%	74.8%	80.5%
MMMU	University-level multimodal	MCQ	56.8%	49.3%	56.2%
MMBench	Comprehensive multimodal	MCQ	75.8%	72.4%	—

Benchmark Saturation

Many multimodal benchmarks are rapidly saturating — models now exceed human performance on DocVQA and approach human-level on ChartQA. MMMU (university-level expert questions across 57 subjects) remains the most challenging, with GPT-4V barely reaching human-level performance. New harder benchmarks (MMMU-Pro, MathVista) are being developed constantly.

Chapter 6.7 — Summary

CLIP: jointly trained image + text encoders via contrastive loss on 400M image-text pairs — shared embedding space
Shared embedding: "a photo of a dog" and a dog photo have similar vectors — the key to zero-shot capabilities
CLIP zero-shot: 76.2% on ImageNet with no task-specific training — just text prompts per class
DALL-E 1: autoregressive over VQ-VAE tokens; DALL-E 2/3: CLIP + diffusion = higher fidelity + better instruction following
VLMs: vision encoder → MLP projector → LLM — visual tokens treated exactly like text tokens inside the language model
Frontier models: GPT-4V, Gemini 1.5 Pro (1M context), Claude 3.5 Sonnet lead on benchmarks
Open alternatives: LLaVA-1.5, InternVL-2, PaliGemma — competitive with proprietary models on most benchmarks
Key benchmarks: VQAv2, DocVQA, ChartQA, MMMU — MMMU remains hardest; most others nearly saturated

6.8

Chapter 6.8

Video Understanding & 3D Vision

A video is not just a sequence of images — it is time, motion, causality, and physics. 3D vision goes further: understanding the world as volumetric space, not flat projections. These are the hardest problems in computer vision, and also the most important for autonomous systems that must act in the real world.

The Video Challenge Core

Video adds the temporal dimension to images: motion, change, causality, events. Processing frames independently with image models misses all temporal information — you can't tell if a person is walking left or right. Adjacent frames are also ~95% identical pixels, making naive processing extremely redundant.

⏱️

Temporal Modelling

Must capture short-term motion (running, gestures) AND long-term events (scoring a goal over 10 seconds). Both time scales matter.

💻

Computational Cost

30fps × 1 min = 1,800 frames. Can't process all at full resolution. Solutions: sampling, temporal pooling, sparse attention.

🔀

Temporal Alignment

Two videos of "making coffee" have the same steps in different order and timing. Models must be robust to temporal variation.

Video as Spatial-Temporal Volume — (T, C, H, W) tensor

Optical Flow In-depth

Optical flow computes a dense per-pixel motion vector field between two consecutive frames. For each pixel (x,y) in frame t, it asks: where does this pixel move to in frame t+1? The result is an H×W×2 flow field (Δx, Δy per pixel). Used in action recognition, video stabilisation, compression, and motion segmentation.

Classical Methods

Lucas-Kanade (1981): sparse flow on corners/keypoints — fast, robust
Horn-Schunck (1981): dense regularised flow — smooth but slow
Farneback (2003): dense flow via polynomial expansion
All assume: brightness constancy + small motion

Deep Learning Methods

FlowNet (2015): first end-to-end CNN for optical flow
PWC-Net (2018): coarse-to-fine with cost volume
RAFT (2020): iterative refinement on 4D correlation volume — SOTA
RAFT generalises across domains — no motion assumptions

Optical Flow — per-pixel motion vectors between consecutive frames

Video Understanding Models In-depth

The history of video understanding mirrors the history of image understanding: hand-crafted features → CNNs → Transformers. Each generation solved the temporal modelling problem differently.

Video Architecture Evolution — Two-Stream → 3D Conv → Video Transformer

Model	Year	Architecture	Temporal Modelling	Kinetics-400 Acc	Speed
Two-Stream	2014	Dual CNN	Optical flow	88.0%	Slow (flow)
C3D	2015	3D CNN	3D convolution	79.9%	Moderate
I3D	2017	Inflated 3D	3D conv (ImageNet init)	95.6%	Moderate
R(2+1)D	2018	Factorised 3D	2D spatial + 1D temporal	96.8%	Moderate
SlowFast	2019	Dual-speed CNN	Slow (semantics) + Fast (motion)	79.0% (val)	Fast
TimeSformer	2021	ViT + Attn	Factorised spatial+temporal attn	80.7%	Moderate
VideoMAE-H	2022	ViT-H MAE	Masked video pre-training	86.6%	Moderate

Video Generation Core

Video generation is dramatically harder than image generation: objects must stay consistent across hundreds of frames, motion must follow physics, and coherent storylines span many seconds. The field progressed rapidly from 2022–2024.

2022Imagen VideoGoogle — cascade diffusion

2022Make-A-VideoMeta — spacetime attn

2023Gen-2Runway — commercial

2023Pika 1.0Accessible short clips

2024 Feb ⭐SoraOpenAI — 1 min coherent

Sora Deep Dive — Spacetime Patches

Sora's key innovation: treat video as a sequence of spacetime patches rather than frames. A spacetime patch spans Δt × Δh × Δw — capturing motion intrinsically. These patches become tokens for a Diffusion Transformer (DiT), replacing the U-Net with a scalable Transformer architecture. This allows Sora to generate variable duration, resolution, and aspect ratio from a single model.

Sora's Spacetime Patches — video as 3D spatiotemporal token sequence

Depth Estimation In-depth

Depth estimation predicts the distance from camera to each pixel, producing an H×W depth map. Monocular depth (single camera) is ambiguous — a nearby toy car looks like a distant real car. Deep learning now handles this, trained on stereo pairs or synthetic data to learn scale cues like perspective and size.

Monocular Depth

One RGB image → depth map. Scale ambiguous — requires learned priors.

MiDaS (Intel 2020): relative depth, robust
DPT (2021): ViT backbone for accuracy
Depth Anything v2 (2024): 62M images, SOTA foundation model

Stereo Depth

Two cameras with known baseline → triangulate from disparity (pixel shift).

Absolute scale available (unlike monocular)
Standard in autonomous driving hardware
IGEV-Stereo, RAFT-Stereo: learned stereo

RGB-D Sensors

Direct depth measurement hardware.

Structured light: Kinect, Intel RealSense
Time-of-Flight: iPhone LiDAR, Velodyne
Used in: AR, robotics, autonomous vehicles

Depth Map — per-pixel distance from camera (red=near, blue=far)

Point Clouds & LiDAR Core

A point cloud is an unordered set of 3D points, each a (x,y,z) coordinate. LiDAR sensors emit laser pulses and measure return time to build dense 3D point clouds at 1.3M points/second. Unlike images, point clouds have no grid structure, variable density, and only capture visible surfaces.

Image Grid vs Point Cloud — ordered pixels vs unordered 3D points

PointNet Key Insight

PointNet (Qi et al., 2017): process each point independently with a shared MLP, then aggregate via max pooling. Max pooling over all points is permutation-invariant — doesn't matter what order you feed the points in, you get the same result. PointNet++ extends this with hierarchical local feature extraction (like CNN for point clouds).

NeRF & 3D Gaussian Splatting In-depth

Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) learn a 3D scene representation from 2D photos — given any new camera viewpoint, the model synthesises a photorealistic image. A small MLP maps (x, y, z, direction) → (colour, density). Volume rendering integrates these values along camera rays. Given 20–100 posed photos → synthesise any novel angle.

3D Gaussian Splatting (Kerbl et al., 2023) replaces NeRF's implicit MLP with explicit 3D Gaussians — each has a centre, covariance shape, colour, and opacity. Rendering projects Gaussians to 2D and rasterises directly on GPU. Result: near-real-time novel view synthesis at 30fps, better quality, and 30-minute training vs NeRF's hours.

🧠

NeRF (2020)

MLP: (x,y,z,dir) → (colour, density)
Volume rendering along camera rays
Training: hours, Rendering: seconds/frame
Many variants: Instant-NGP (fast), Mip-NeRF (anti-aliased)
Foundation for all novel-view synthesis methods

🌟

3D Gaussian Splatting (2023)

Explicit 3D Gaussians: centre + shape + colour + opacity
Rasterise to screen — GPU-native, differentiable
Training: ~30 min, Rendering: 30fps real-time
Better quality, sharper edges than NeRF
Most impactful 3D vision paper of 2023

NeRF vs 3D Gaussian Splatting — implicit neural vs explicit Gaussian representation

# gsplat: fast differentiable Gaussian splatting renderer
import torch
from gsplat import rasterization

def render_gaussians(
    means:     torch.Tensor,  # (N, 3): Gaussian centres in 3D
    quats:     torch.Tensor,  # (N, 4): quaternion rotations
    scales:    torch.Tensor,  # (N, 3): scale along each axis
    opacities: torch.Tensor,  # (N,)  : opacity values
    colors:    torch.Tensor,  # (N, 3): RGB colours
    viewmat:   torch.Tensor,  # (C, 4, 4): camera extrinsics
    K:         torch.Tensor,  # (C, 3, 3): camera intrinsics
    width: int, height: int
) -> torch.Tensor:
    # Differentiable rasterisation via tile-based splatting
    renders, alphas, meta = rasterization(
        means=means, quats=quats, scales=scales,
        opacities=opacities, colors=colors,
        viewmats=viewmat, Ks=K,
        width=width, height=height
    )
    return renders  # (C, H, W, 3) rendered images

# Training: optimise Gaussian parameters to minimise L1 + SSIM vs training images
# Init from SfM point cloud → optimise ~30 min on RTX 4090 → real-time 30fps render

🎓 Domain 6 Complete — Computer Vision & Multimodal AI

Ch 6.1: Images = 3D tensors (N,C,H,W); always normalise with ImageNet mean/std for pre-trained models. Canny, HOG, and SIFT dominated before 2012.
Ch 6.2: AlexNet 2012 = the inflection point. ResNet skip connections F(x)+x solved depth degradation; EfficientNet compound-scales depth, width, and resolution jointly.
Ch 6.3: Detection = localise + classify all objects. YOLO: single forward pass predicts all boxes at 30–160fps. IoU and mAP are the standard metrics.
Ch 6.4: Segmentation = per-pixel labelling. U-Net: symmetric encoder-decoder with concatenation skip connections. SAM: promptable zero-shot segmentation — click any point, get a mask.
Ch 6.5: GAN: Generator fools Discriminator via minimax game. StyleGAN: mapping network + AdaIN style injection per resolution = photorealistic faces. CycleGAN: unpaired domain translation.
Ch 6.6: ViT: image as 16×16 patch tokens fed into a Transformer. Needs large pre-training data (~14M+); Swin adds hierarchy and shifted-window attention for detection/segmentation.
Ch 6.7: CLIP: shared image-text embedding via contrastive learning on 400M pairs → 76.2% zero-shot ImageNet. VLMs: ViT + MLP projector + LLM = visual question answering at scale.
Ch 6.8: Video = (T,C,H,W) tensor; Sora treats it as spacetime patches in a Diffusion Transformer. 3D Gaussian Splatting: real-time novel-view synthesis from photos. NeRF, depth maps, and LiDAR power autonomous systems.

Domain 6 traces how AI learned to see, understand, and recreate the visual world. The key progression: raw pixels → hand-crafted features (HOG, SIFT) → learned features (CNNs) → global attention (ViT) → language-grounded vision (CLIP, VLMs). Multimodal AI bridges Domain 6 and Domain 5 — vision and language are converging into unified foundation models that process any modality through shared embedding spaces and transformer architectures.

← Domain 05 — NLP & LLMs Domain 07 →