Image Fundamentals & Classical Computer Vision
How images become numbers, how classical algorithms extract meaning from pixels, and why deep learning ultimately replaced them all.
A digital image is a 3D array of numbers β width Γ height Γ channels. Everything in computer vision, from a simple edge detector to Stable Diffusion, is ultimately operations on this array. Understanding what those numbers represent is where computer vision begins.
Image Representation In-depth
Every digital image is, at its core, a grid of numbers. Each cell in this grid is a pixel β the atomic unit of a digital image, representing a single colour sample at a specific grid position. Before any algorithm can process an image, you must understand how those pixels are encoded and what they represent.
Grayscale images use a single channel β each pixel is an integer from 0 (black) to 255 (white), giving 256 possible intensity levels. RGB images use three channels (Red, Green, Blue), each ranging 0β255, producing 256Β³ = 16.7 million possible colours per pixel. Every pixel in an RGB image is defined by exactly three numbers.
In deep learning, the tensor representation matters enormously. PyTorch uses channels-first ordering: (C, H, W) β so a 224Γ224 RGB image becomes shape (3, 224, 224). NumPy and OpenCV use height-first: (H, W, C). Confusing these axes is one of the most common bugs in computer vision code.
MNIST
28 Γ 28 Γ 1
784 values
CIFAR-10
32 Γ 32 Γ 3
3,072 values
ImageNet
224 Γ 224 Γ 3
150,528 values
Stable Diffusion
512 Γ 512 Γ 3
786,432 values
Data types matter for performance. Raw images are stored as uint8 (0β255) β compact but unsuitable for gradients. Neural networks require float32 in range [0.0, 1.0] or normalised with ImageNet statistics: subtract mean [0.485, 0.456, 0.406] and divide by std [0.229, 0.224, 0.225]. This normalisation centres values around zero, helping gradient-based optimisation converge faster.
OpenCV loads images as BGR, not RGB. When you use cv2.imread(), the channels are Blue-Green-Red. Feeding BGR to a model trained on RGB will silently produce wrong results. Always convert with cv2.cvtColor(img, cv2.COLOR_BGR2RGB).
Colour Spaces Core
RGB is the natural colour space for displays β your screen uses red, green, and blue LEDs. But RGB mixes luminance (brightness) and chrominance (colour) together, making it poor for many computer vision tasks. Different colour spaces separate these properties, each suited to specific algorithms.
RGB
Red, Green, Blue β each 0β255
- Natural for display hardware
- Mixes brightness with colour
- Default for most image libraries
HSV
Hue (0Β°β360Β°), Saturation, Value
- Separates colour from brightness
- Hue: red=0Β°, green=120Β°, blue=240Β°
- Ideal for colour-based segmentation
LAB
L=lightness, A=greenβred, B=blueβyellow
- Perceptually uniform
- Equal ΞE = equal visual difference
- Best for colour similarity & style transfer
Grayscale conversion is not a simple average of R, G, B. The human eye is most sensitive to green, so the standard formula is: Luminance = 0.299R + 0.587G + 0.114B. Green contributes nearly 60% of perceived brightness. This is why most classical CV algorithms operate on grayscale β it reduces data by 3Γ while preserving the structural information humans rely on.
In production, HSV colour filtering is still used for fast pre-processing β e.g., isolating red traffic lights before running a neural network detector. It's computationally cheap and works reliably when lighting is controlled. LAB is used in image quality assessment and style transfer where perceptual accuracy matters.
Image Filters & Convolutions In-depth
A filter (or kernel) is a small matrix β typically 3Γ3 or 5Γ5 β that slides across an image computing a weighted sum at each position. This operation is called convolution, and it's the single most important operation in all of computer vision. Classical filters are hand-designed for specific effects. Deep learning CNNs learn their filters from data β but the underlying mathematical mechanism is identical.
Gaussian Blur
Smooths out noise by averaging each pixel with its neighbours using bell-shaped weights. The centre pixel has the highest weight; further pixels contribute less.
- Reduces high-frequency noise
- Pre-processing step before edge detection
- Kernel: [[1,2,1],[2,4,2],[1,2,1]] / 16
Sharpening
Enhances edges by amplifying the centre pixel and subtracting neighbours β effectively adding the difference between the pixel and its surroundings.
- Large centre weight (5), negative neighbours (-1)
- Kernel: [[0,-1,0],[-1,5,-1],[0,-1,0]]
- Increases contrast at edges
Sobel Filter
Detects edges by computing the intensity gradient. Separate kernels for horizontal and vertical edges.
- Horizontal: [[-1,-2,-1],[0,0,0],[1,2,1]]
- Vertical: [[-1,0,1],[-2,0,2],[-1,0,1]]
- Foundation for Canny edge detection
Median Filter
Replaces each pixel with the median of its neighbourhood. Non-linear β not technically a convolution.
- Excellent for salt-and-pepper noise
- Preserves edges better than Gaussian
- Cannot be learned by a standard CNN layer
The convolution operation is identical in classical CV and deep learning. The only difference: classical engineers design the kernel weights by hand, while CNNs learn them via backpropagation. This single insight connects 40 years of computer vision history.
Edge Detection Core
Edges are boundaries between regions of different intensity β they mark where objects begin and end, where surfaces change orientation, and where textures shift. Edges are the most information-dense locations in an image: they encode shape, structure, and object boundaries while discarding uniform regions that carry little useful signal.
The Canny Edge Detector (John Canny, 1986) remains the gold standard for classical edge detection. It's a five-step pipeline, each step carefully designed to balance noise rejection against edge localisation. Nearly every classical CV system used Canny as a preprocessing step β from document scanning to lane detection to augmented reality registration.
The Harris Corner Detector extends edge detection to find corners β points where intensity changes significantly in two directions simultaneously. Corners are more distinctive than edges (an edge looks the same along its length), making them better landmarks for matching between images. Harris corners are still used in camera calibration and simple tracking systems.
Classical Feature Descriptors Core
Before deep learning, the central challenge of computer vision was: how do you represent an image patch as a compact, distinctive vector that's robust to scale, rotation, and illumination changes? Researchers hand-crafted feature descriptors that encode local image structure β these dominated from roughly 2000 to 2012.
SIFT (2004)
Scale-Invariant Feature Transform β David Lowe
- 128-dimensional descriptor per keypoint
- Invariant to scale, rotation, minor illumination
- Used in: panorama stitching, SLAM, image matching
HOG (2005)
Histogram of Oriented Gradients β Dalal & Triggs
- Divide image into cells, count gradient orientations
- Concatenate histograms β feature vector
- Used in: pedestrian detection, object recognition
ORB (2011)
Oriented FAST + Rotated BRIEF β patent-free
- Fast, rotation-invariant binary descriptor
- 10β100Γ faster than SIFT
- Used in: real-time mobile feature matching
Classical CV Pipeline Core
The classical computer vision pipeline dominated the field from approximately 1980 to 2012. Every step was designed and tuned by hand β a brittle, domain-specific process that required deep expertise and did not generalise well across tasks or visual domains.
Each stage was optimised independently β features that were good for one task (pedestrians) performed poorly on another (faces, cars, medical images). Every new domain required starting over: new features, new tuning, new expertise. This is why end-to-end learning was so revolutionary.
Why Deep Learning Won In-depth
In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered the ImageNet Large-Scale Visual Recognition Challenge with AlexNet β a deep convolutional neural network trained on two GPUs. It achieved a top-5 error rate of 15.3%, compared to 26.2% for the best classical method. That 41% relative improvement in a single year was the most dramatic result in the competition's history, and it permanently changed computer vision.
Three fundamental advantages explain why deep learning dominates:
Learned Features
No manual feature engineering. Optimal representations emerge automatically from data β the network discovers what matters.
End-to-End
Directly maps pixels to outputs. No intermediate representation bottleneck β the entire pipeline is optimised jointly.
Scale
Performance improves with more data and compute. Classical methods plateau β deep learning keeps getting better.
Despite deep learning's dominance, classical CV is not dead. Edge devices with limited compute (microcontrollers, drones) still use Canny, HOG, and ORB. Classical methods also serve as fast pre-filters before running expensive neural networks β e.g., using simple motion detection to trigger a deep learning classifier only when something moves in frame.
Chapter 6.1 β Summary
- Images are (C, H, W) tensors β normalise to float32 [0,1] before feeding to neural networks
- RGB mixes colour and brightness β HSV separates hue from value for colour-based algorithms
- Classical filters are hand-designed convolution kernels β CNNs learn these automatically from data
- Canny edge detector: 5 steps from blur β gradient β NMS β threshold β hysteresis
- HOG and SIFT: hand-crafted local feature descriptors that dominated pre-2012 computer vision
- AlexNet (2012): 41% error reduction proved end-to-end learned features beat manual engineering
From AlexNet's breakthrough in 2012 to EfficientNet's compound scaling in 2019, each generation of CNN architecture solved a specific problem β depth, efficiency, scale. Understanding why each design was invented is more important than memorising layer counts.
CNN Architecture Recap Core
Before diving into specific architectures, recall the three inductive biases that make CNNs ideal for images (covered in Domain 4, Ch 4.5):
Local Connectivity
Each neuron sees only a small spatial patch. This exploits spatial locality β nearby pixels are more related than distant ones.
Weight Sharing
The same filter slides across every position. This gives translation equivariance β a cat is detected regardless of where it appears.
Hierarchical Representation
Stacked conv layers build edges β textures β parts β objects. Each layer composes features from the layer below.
The core building block pattern used in almost every CNN: Conv β BatchNorm β ReLU β Pooling. As you go deeper, spatial dimensions shrink (via pooling/stride) while channel depth grows (more filters).
AlexNet & VGG In-depth
AlexNet (Krizhevsky et al., 2012) was the CNN that launched the deep learning revolution. It didn't just beat the competition on ImageNet β it obliterated it, reducing top-5 error from 25.8% to 15.3%. Every key idea it introduced became standard practice.
AlexNet β Key Innovations
- GPU training β trained on 2 GTX 580 GPUs (first practical GPU training)
- ReLU activation β 6Γ faster training than tanh/sigmoid
- Dropout (p=0.5) in FC layers β regularisation against overfitting
- Data augmentation β random crops, horizontal flips, colour jitter
- Architecture: 5 conv + 3 FC layers, 60M parameters
- 11Γ11 first-layer filters β large receptive field to capture coarse features
VGG β The "Beautiful" Architecture
- 3Γ3 convolutions only β architectural simplicity as a virtue
- Two 3Γ3 convs = same receptive field as one 5Γ5, but fewer parameters and more non-linearity
- VGG-16: 16 weight layers, 138M parameters
- VGG-19: 19 layers, 144M parameters
- Still widely used as a feature extractor backbone
- Simple, understandable β the "ImageNet of architectures"
| Architecture | Year | Layers | Params | Top-5 Error | Kernel Sizes | Key Innovation |
|---|---|---|---|---|---|---|
| LeNet-5 | 1998 | 5 | 60K | ~25% (MNIST) | 5Γ5 | First practical CNN |
| AlexNet | 2012 | 8 | 60M | 15.3% | 11Γ11, 5Γ5, 3Γ3 | GPU, ReLU, Dropout |
| VGG-16 | 2014 | 16 | 138M | 7.3% | 3Γ3 only | Pure 3Γ3, deep simple |
| VGG-19 | 2014 | 19 | 144M | 7.1% | 3Γ3 only | Even deeper VGG |
Despite being "outdated" in accuracy, VGG is still the default feature extractor for perceptual loss (style transfer, super-resolution) and neural texture synthesis. Its intermediate features are remarkably good at capturing visual similarity.
Inception & GoogLeNet In-depth
Szegedy et al. (Google, 2014) asked: how do you go deeper more efficiently? VGG's approach of stacking 3Γ3 convolutions worked, but parameters and computation grew together. The Inception module solved this with a radically different idea.
The Inception module applies multiple filter sizes (1Γ1, 3Γ3, 5Γ5) in parallel at each layer, plus a max-pooling branch. Outputs are concatenated along the channel dimension. The network learns which spatial scale to attend to at each layer.
The critical insight: 1Γ1 convolutions as dimensionality reduction. Before the expensive 3Γ3 and 5Γ5 convolutions, a 1Γ1 "bottleneck" conv reduces channel count dramatically. This made GoogLeNet just 5M parameters β 12Γ fewer than AlexNet with better accuracy.
ResNet & DenseNet In-depth
ResNet (He et al., 2015) solved the most puzzling problem in deep learning at the time: deeper networks had higher training error. Not overfitting β the networks simply couldn't learn. The solution was deceptively simple.
The residual block learns the change rather than the full mapping: y = F(x) + x. The skip connection allows gradients to flow directly through the identity path, enabling 50, 101, and even 152-layer networks.
DenseNet (Huang et al., 2017) extended this idea: connect every layer to all subsequent layers within a block. With L layers in a dense block, there are L(L+1)/2 direct connections. Each layer receives feature maps from ALL preceding layers, enabling maximum feature reuse with fewer parameters.
ResNet Variants
- ResNet-50: 25.6M params, 76.1% top-1 β the workhorse
- ResNet-101: 44.5M params, 77.4% top-1
- ResNet-152: 60.2M params, 78.3% top-1
- Bottleneck block: 1Γ1β3Γ3β1Γ1 (reduces computation)
DenseNet Advantages
- Feature reuse: fewer parameters than ResNet for same accuracy
- Better gradient flow: direct paths to every layer
- Implicit deep supervision: early layers get strong gradients
- Growth rate k=32: each layer adds 32 new channels
MobileNet & Efficient CNNs In-depth
ResNet and VGG are too heavy for mobile phones, embedded systems, and real-time applications. MobileNet (Howard et al., 2017) introduced depthwise separable convolutions β splitting one expensive operation into two cheap ones.
Cost: KΒ²Β·CinΒ·CoutΒ·HΒ·W
For 3Γ3, 256β512: 1,179,648 ops/pixel
Step 2 β Pointwise: 1Γ1 conv mixing channels (CinβCout)
Cost: (KΒ²Β·Cin + CinΒ·Cout)Β·HΒ·W
For 3Γ3, 256β512: 133,376 ops/pixel β ~9Γ cheaper
MobileNetV2 (2018) added inverted residual blocks: expand channels with 1Γ1 β depthwise 3Γ3 β compress back with 1Γ1. The skip connection goes between the narrow layers (inverted compared to ResNet). Also: SqueezeNet, ShuffleNet, and GhostNet target edge deployment.
EfficientNet & NAS Core
Tan & Le (Google, 2019) observed that previous architectures scaled only one dimension at a time β depth (more layers), width (more channels), or resolution (larger images). EfficientNet scales all three simultaneously with a fixed compound coefficient.
Neural Architecture Search (NAS) was used to find the optimal base architecture (EfficientNet-B0). NAS automates architecture design by searching over a space of possible layers, connections, and hyperparameters. The EfficientNet family (B0βB7) uses the same base architecture at different compound scales.
| Model | Top-1 Acc | Params | FLOPs | Key Feature |
|---|---|---|---|---|
| ResNet-50 | 76.0% | 25M | 4.1B | Skip connections |
| EfficientNet-B0 | 77.1% | 5.3M | 0.39B | NAS + compound scaling |
| EfficientNet-B3 | 81.6% | 12M | 1.8B | Medium scale |
| EfficientNet-B7 | 84.4% | 66M | 37B | Largest, SOTA 2019 |
Data Augmentation In-depth
Training a CNN requires millions of images β but labelled data is expensive. Data augmentation creates diverse training samples by applying label-preserving transformations. A flipped cat is still a cat. A slightly rotated stop sign is still a stop sign.
Augmentations must be label-preserving. Flipping a "6" into a "9" changes the label β don't do horizontal flips for digit recognition. Always think about whether the transform preserves meaning for your specific task.
Standard Augmentations (Always Use)
- RandomResizedCrop: random area crop + resize to target
- RandomHorizontalFlip: 50% chance mirror
- ColorJitter: brightness, contrast, saturation, hue
- Normalize: ImageNet mean/std β technically preprocessing, always required
Advanced Augmentations
- RandAugment: randomly pick N of 14 transforms β simple, effective
- Cutout / RandomErasing: mask random rectangle β forces robustness
- CutMix: paste patch from one image onto another, blend labels
- Mixup: linear blend of two images + their labels
import torchvision.transforms as T
import torchvision.transforms.v2 as T2
# Standard training augmentation (ImageNet-style)
train_transform = T.Compose([
T.RandomResizedCrop(224, scale=(0.08, 1.0)), # random crop, resize to 224
T.RandomHorizontalFlip(p=0.5), # 50% chance flip
T.ColorJitter(brightness=0.4, contrast=0.4, # colour distortion
saturation=0.4, hue=0.1),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Strong augmentation (RandAugment, used in EfficientNet training)
strong_transform = T2.Compose([
T2.RandomResizedCrop(224),
T2.RandomHorizontalFlip(),
T2.RandAugment(num_ops=2, magnitude=9), # random 2 of 14 augmentations
T2.RandomErasing(p=0.25), # cutout
T2.ToDtype(torch.float32, scale=True),
T2.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Validation β no augmentation, just resize + normalise
val_transform = T.Compose([
T.Resize(256),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
]) At inference, augment the test image multiple times (flip, multi-crop), run each through the model, and average predictions. TTA typically boosts accuracy by 1β2% with no retraining. Standard in competitions and medical imaging.
Transfer Learning in CV In-depth
Never train from scratch. This is the single most important practical rule in computer vision. ImageNet pre-trained models have already learned universal visual features β edges, textures, shapes β that transfer remarkably well to nearly any image task.
Feature Extraction
Freeze the entire backbone. Train only a new classification head.
- Best when: small dataset (<1K images)
- Training: very fast (few params)
- Risk: underfitting if task is very different
Partial Fine-Tuning
Freeze early layers, fine-tune later layers + head.
- Best when: moderate dataset (1Kβ10K)
- Rationale: early layers = universal edges; later layers = task-specific
- Common: freeze first 2β3 blocks
Full Fine-Tuning
Unfreeze all layers with a small learning rate.
- Best when: large dataset (>10K images)
- Key: use differential LR β backbone 10β100Γ smaller than head
- Risk: catastrophic forgetting if LR too high
import torch
import torch.nn as nn
import torchvision.models as models
# ββ Option 1: Feature Extraction (freeze backbone) ββ
model = models.resnet50(weights='IMAGENET1K_V2') # pretrained
for param in model.parameters():
param.requires_grad = False # freeze everything
# Replace final classification layer
num_classes = 10 # your task: 10 classes (not 1000)
model.fc = nn.Linear(model.fc.in_features, num_classes)
# Only model.fc has requires_grad=True β very few parameters to train
# ββ Option 2: Full Fine-tuning with differential LR ββ
model2 = models.resnet50(weights='IMAGENET1K_V2')
model2.fc = nn.Linear(model2.fc.in_features, num_classes)
# Differential learning rates: backbone gets 10Γ smaller LR
backbone_params = [p for n, p in model2.named_parameters() if 'fc' not in n]
head_params = list(model2.fc.parameters())
optimizer = torch.optim.AdamW([
{'params': backbone_params, 'lr': 1e-5}, # small LR for pre-trained layers
{'params': head_params, 'lr': 1e-3}, # larger LR for new head
], weight_decay=1e-4) Forgetting to match the pre-processing. If the pre-trained model was trained with ImageNet normalisation (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), you must use the same normalisation at inference. Mismatched preprocessing is the #1 silent bug in transfer learning.
Chapter 6.2 β Summary
- AlexNet (2012): ReLU + dropout + GPU training β kicked off the deep learning era in CV
- VGG: 3Γ3 convolutions only β architectural simplicity with depth; still the go-to feature extractor for perceptual loss
- Inception module: parallel multi-scale convolutions (1Γ1, 3Γ3, 5Γ5) + 1Γ1 bottlenecks; GoogLeNet = 22 layers, only 5M params
- ResNet: y = F(x) + x skip connections solve the degradation problem and enable 100+ layer networks
- DenseNet: every layer connects to all subsequent β maximum feature reuse with fewer parameters
- MobileNet: depthwise separable convolutions give ~9Γ speedup β essential for mobile and edge deployment
- EfficientNet: compound scaling (depth Γ width Γ resolution) + NAS = best accuracy/efficiency trade-off
- Data augmentation: always use random crop + flip + colour jitter; advanced: RandAugment, CutMix, Cutout
- Transfer learning: always start with ImageNet pre-trained weights; use differential learning rates (backbone: 1e-5, head: 1e-3)
Classification tells you what. Detection tells you what AND where β outputting a variable number of bounding boxes, each with a class label and confidence score. The evolution from R-CNN's 47 seconds per image to YOLO's 90 FPS is one of the most dramatic speedups in deep learning history.
The Detection Task Core
Image classification assigns ONE label to an entire image: "cat." Object detection finds MULTIPLE objects, drawing a bounding box around each and labelling it: "cat at (x,y,w,h) with 97% confidence, dog at (x',y',w',h') with 89% confidence."
The output format is a list of detections, each containing: (class_id, confidence, x_centre, y_centre, width, height). Coordinates are typically normalised to 0β1 relative to image dimensions.
Variable Object Count
An image may contain 0, 1, or 100 objects. The model must handle all cases β unlike classification which always outputs one label.
Multi-Scale Objects
A pedestrian 20px tall and a bus 400px wide must both be detected. Scale variation is the hardest challenge.
Occlusion & Overlap
Objects hide behind other objects. The detector must still find partially visible objects and avoid merging overlapping ones.
Sliding Window & Anchor Boxes Core
The naive approach: slide a window across the image at multiple scales and aspect ratios, run a classifier on each crop. This is conceptually simple but catastrophically slow β hundreds of thousands of crops per image, each requiring a full forward pass.
Anchor boxes solved this elegantly. Instead of sliding a window, divide the feature map into a grid and place pre-defined bounding box shapes (anchors) at each cell. The model predicts offsets from these anchors: (Ξ΄x, Ξ΄y, Ξ΄w, Ξ΄h) plus an objectness score and class probabilities. Anchors are designed to cover common shapes β wide rectangles for cars, tall rectangles for people, squares for faces.
R-CNN Family In-depth
The R-CNN family represents the two-stage approach to detection: first propose regions that might contain objects, then classify each region. Three papers over two years went from painfully slow to real-time.
R-CNN (2014)
- Selective search: ~2000 region proposals
- Warp each to 227Γ227
- CNN feature extraction per region
- SVM classifier + bbox regression
- Speed: 47 sec/image
Fast R-CNN (2015)
- CNN on entire image once β shared feature map
- Project proposals onto feature map (RoI Pooling)
- Classify + regress from RoI features
- Bottleneck: selective search still external
- Speed: 2 sec/image
Faster R-CNN (2015)
- Replace selective search with Region Proposal Network (RPN)
- RPN shares CNN backbone with detection head
- End-to-end trainable
- 73.2% mAP on PASCAL VOC
- Speed: 0.2 sec/image (5 FPS)
Faster R-CNN is a two-stage detector: stage 1 proposes regions (RPN), stage 2 classifies them. Two-stage detectors are generally more accurate but slower. One-stage detectors (YOLO, SSD) skip the proposal step entirely β they predict boxes and classes in a single pass.
YOLO: You Only Look Once In-depth
Redmon et al. (2015) made a radical move: frame detection as a single regression problem. Divide the image into an SΓS grid, and in a single forward pass, predict all bounding boxes and class probabilities simultaneously. No proposals, no second stage β just one neural network, one pass, done.
The result: 45 FPS on 2015 hardware, over 200Γ faster than R-CNN. The trade-off was accuracy β YOLO struggled with small objects and nearby objects in the same grid cell. But the speed was revolutionary for real-time applications like autonomous driving and robotics.
| Version | Year | Speed | mAP (VOC/COCO) | Key Feature |
|---|---|---|---|---|
| YOLOv1 | 2015 | 45 FPS | 63.4% VOC | Single-pass regression |
| YOLOv2 | 2016 | 40 FPS | 78.6% VOC | Anchor boxes, batch norm, multi-scale |
| YOLOv3 | 2018 | 30 FPS | 33.0% COCO | Multi-scale detection, Darknet-53 |
| YOLOv5 | 2020 | 30+ FPS CPU | 50.7% COCO | PyTorch, Ultralytics, easy API |
| YOLOv8 | 2023 | 80+ FPS | 53.9% COCO | Anchor-free, SOTA single-stage |
from ultralytics import YOLO
import cv2
# Load pre-trained model (downloads automatically)
model = YOLO('yolov8n.pt') # 'n'=nano, 's'=small, 'm'=medium, 'l'=large, 'x'=xlarge
# Run inference on image
results = model('street.jpg', conf=0.5, iou=0.45)
# Parse results
for result in results:
boxes = result.boxes
for box in boxes:
cls = int(box.cls[0]) # class index
conf = float(box.conf[0]) # confidence
x1, y1, x2, y2 = box.xyxy[0].tolist() # bounding box coords
print(f"{model.names[cls]}: {conf:.2f} at [{x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f}]")
# Visualise
result_image = results[0].plot()
cv2.imwrite('detected.jpg', result_image)
# Fine-tune on custom dataset
model.train(
data='custom_dataset.yaml', # YAML with train/val paths and class names
epochs=100,
imgsz=640,
batch=16,
device=0 # GPU device
) SSD & Single-Stage Detectors Core
SSD (Liu et al., 2016) combined YOLO's single-pass speed with an elegant multi-scale approach. Instead of predicting from just one feature map, SSD extracts predictions from 6 different feature map scales within the CNN. Early (larger) feature maps detect small objects; later (smaller) feature maps detect large objects.
This solved YOLO's weakness with small objects: 59 FPS at 300Γ300 input, 74.3% mAP β better small-object detection with comparable speed.
Modern Single-Stage Detectors
- RetinaNet (2017): focal loss β solves class imbalance (background vs objects)
- FCOS (2019): fully convolutional, anchor-free β predicts directly per pixel
- DETR (2020): transformer-based detection β no anchors, no NMS needed
Two-Stage vs One-Stage Summary
- Two-stage (Faster R-CNN): more accurate, slower (~5 FPS)
- One-stage (YOLO, SSD): faster (30β90 FPS), slightly less accurate
- Modern one-stage detectors have nearly closed the accuracy gap
NMS & Detection Metrics In-depth
Three concepts underpin detection evaluation: IoU measures box overlap quality, NMS removes duplicate detections, and mAP quantifies overall detector performance.
IoU (Intersection over Union)
Measures overlap between predicted and ground-truth box.
- IoU > 0.5: correct (PASCAL VOC)
- IoU > 0.75: stricter standard
- COCO: average over 0.5:0.05:0.95
NMS (Non-Max Suppression)
Removes duplicate overlapping boxes:
- Sort boxes by confidence β
- Keep highest-scoring box
- Remove boxes with IoU > 0.45
- Repeat for remaining boxes
mAP (Mean Average Precision)
The standard detection metric:
- Per class: precision-recall curve
- AP = area under PR curve
- mAP = mean AP across all classes
- COCO mAP averages over 10 IoU thresholds
Standard NMS struggles when objects genuinely overlap (e.g., a crowd of people). It may suppress correct detections. Solutions: Soft-NMS (reduces confidence instead of removing), DETR (transformer-based, no NMS needed), or learnable NMS modules.
Chapter 6.3 β Summary
- Detection output: list of (class, confidence, x, y, w, h) per object in image
- Anchor boxes: pre-defined shapes at each grid cell; model predicts offsets + class + objectness
- R-CNN family: propose β extract β classify; Faster R-CNN adds RPN for end-to-end training (0.2s/img)
- YOLO: single forward pass predicts all objects β real-time at 30β90 FPS; YOLOv8 is SOTA single-stage
- SSD: YOLO-like but uses multiple feature map scales β better small object detection
- IoU: intersection / union measures predicted vs ground truth box overlap; threshold 0.5 (VOC) or 0.5:0.95 (COCO)
- NMS: remove duplicate overlapping boxes keeping highest-confidence detections; Soft-NMS for crowded scenes
- Modern trend: anchor-free (FCOS) and transformer-based (DETR) detectors eliminating hand-designed components
Detection draws boxes. Segmentation colours every pixel. From U-Net's elegant encoder-decoder to Meta's Segment Anything Model, segmentation has evolved from a niche medical imaging task to a foundation capability β segment any object in any image with a single click.
Segmentation Types Core
All three segmentation types assign labels at the pixel level β but they differ fundamentally in what they distinguish.
Semantic Segmentation
Every pixel labelled with a class: road, car, sky, person.
- All instances of same class get same colour
- Two cats = one merged mask
- Output: HΓW label map (one int per pixel)
Instance Segmentation
Every pixel labelled with class AND instance ID.
- Two cats = two separate masks (Catβ, Catβ)
- Only "things" (countable objects)
- Output: list of (mask, class, conf) per object
Panoptic Segmentation
Unifies semantic (stuff) + instance (things).
- Every pixel: class + optional instance ID
- "Stuff" (sky, road) + "things" (carβ, carβ)
- Output: complete scene understanding
Semantic Segmentation In-depth
The core challenge: classification networks reduce spatial resolution (pooling, striding) to build semantic understanding. Segmentation needs to restore it β predict a class for every single pixel. The solution: encoder-decoder architectures.
Loses spatial resolution via pooling/stride
Builds semantic understanding β "what" is here
224Γ224 β 7Γ7 feature map
Restores spatial resolution to original size
Upsampling methods:
β’ Bilinear interpolation: simple, no learned params
β’ Transposed conv: learnable, can cause checkerboard artefacts
Skip connections are critical: they connect encoder layers to decoder layers at matching resolutions. Without them, the decoder must reconstruct fine spatial detail (edges, boundaries) from the bottleneck alone β a lossy process. With skips, fine spatial detail flows directly from encoder to decoder.
Loss Functions
- Pixel-wise cross-entropy: standard classification loss per pixel
- Dice loss: 2Β·|Aβ©B| / (|A|+|B|) β better for class imbalance
- Combined: CE + Dice often used together in practice
- Weighted CE: higher weight for rare classes (e.g., tumour vs background)
Key Architectures
- FCN (2014): first fully convolutional β no FC layers, any input size
- U-Net (2015): symmetric encoder-decoder + skip connections
- DeepLab v3+ (2018): atrous convolutions + ASPP for multi-scale
- SegFormer (2021): transformer-based encoder, MLP decoder
U-Net Architecture In-depth
Ronneberger et al. (2015) designed U-Net for biomedical image segmentation β but it became the universal segmentation architecture. The U-shaped design features a symmetric encoder (contracting path) and decoder (expanding path) connected by skip connections at every level.
The critical innovation: skip connections don't just add features (like ResNet) β they concatenate entire feature maps from encoder to decoder. This preserves the fine spatial detail (textures, edges) lost during downsampling. The decoder gets both the upsampled abstract features AND the original high-resolution details.
import torch
import torch.nn as nn
import torch.nn.functional as F
class DoubleConv(nn.Module):
def __init__(self, in_ch, out_ch):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1), nn.BatchNorm2d(out_ch), nn.ReLU(),
nn.Conv2d(out_ch, out_ch, 3, padding=1), nn.BatchNorm2d(out_ch), nn.ReLU()
)
def forward(self, x): return self.conv(x)
class UNet(nn.Module):
def __init__(self, n_classes=2):
super().__init__()
# Encoder
self.enc1 = DoubleConv(1, 64)
self.enc2 = DoubleConv(64, 128)
self.enc3 = DoubleConv(128, 256)
self.pool = nn.MaxPool2d(2)
# Bottleneck
self.bottleneck = DoubleConv(256, 512)
# Decoder
self.up3 = nn.ConvTranspose2d(512, 256, 2, stride=2)
self.dec3 = DoubleConv(512, 256) # 512 = 256 (up) + 256 (skip)
self.up2 = nn.ConvTranspose2d(256, 128, 2, stride=2)
self.dec2 = DoubleConv(256, 128)
self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2)
self.dec1 = DoubleConv(128, 64)
self.out_conv = nn.Conv2d(64, n_classes, 1) # 1Γ1 final
def forward(self, x):
e1 = self.enc1(x) # skip 1
e2 = self.enc2(self.pool(e1)) # skip 2
e3 = self.enc3(self.pool(e2)) # skip 3
b = self.bottleneck(self.pool(e3))
d3 = self.dec3(torch.cat([self.up3(b), e3], dim=1)) # concat skip
d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))
d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))
return self.out_conv(d1) # HΓWΓn_classes U-Net's encoder-decoder + skip connection pattern became foundational far beyond segmentation. Stable Diffusion uses a U-Net (with attention layers) as its denoising backbone. The same architecture that segments tumours also generates images from text prompts.
Mask R-CNN & Instance Segmentation In-depth
Mask R-CNN (He et al., Facebook AI, 2017) extends Faster R-CNN with a simple but powerful addition: a third prediction head that outputs a pixel-level mask for each detected object. Three heads run in parallel after RoI features are extracted:
Classification Head
FC layers β Softmax
Output: class label for this region
Box Regression Head
FC layers β (Ξx, Ξy, Ξw, Ξh)
Output: refined bounding box
Mask Head (NEW)
FCN β 28Γ28 binary mask per class
Output: pixel-level mask for the object
A critical improvement: RoI Align replaced RoI Pooling. RoI Pooling quantises coordinates (rounding to nearest pixel), causing spatial misalignment that doesn't matter for bounding boxes but is catastrophic for pixel-accurate masks. RoI Align uses bilinear interpolation β no quantisation, precise alignment.
RoI Pooling rounds coordinates to the nearest integer, creating up to 0.5px misalignment. For 7Γ7 pooled features from a 224px image, that's a 7px error β invisible for classification, catastrophic for pixel masks. RoI Align uses bilinear interpolation at exact floating-point positions, eliminating this misalignment entirely.
Panoptic Segmentation In-depth
Panoptic segmentation (Kirillov et al., 2019) unifies semantic and instance segmentation into a single coherent output. Every pixel gets a class label. "Things" (countable: cars, people) also get unique instance IDs. "Stuff" (uncountable: sky, road, grass) gets class labels only.
Each instance gets a unique ID
carβ β carβ even though both are "car"
Predicted by instance segmentation branch
No instance distinction β just class label
All sky pixels = "sky" (no skyβ, skyβ)
Predicted by semantic segmentation branch
Modern panoptic models like Panoptic FPN and Mask2Former (2022) use a unified architecture that handles both things and stuff with a single decoder. Mask2Former treats all segments (things + stuff) as mask queries processed by a transformer decoder β achieving SOTA on all three segmentation tasks simultaneously.
SAM: Segment Anything Model In-depth
Kirillov et al. (Meta AI, 2023) built a foundation model for segmentation. Trained on SA-1B β 11 million images with 1.1 billion segmentation masks β SAM can segment any object in any image with a simple prompt: click a point, draw a box, or provide text.
Image Encoder
MAE pre-trained ViT-H (Vision Transformer Huge)
- Encodes image into embedding
- 256Γ64Γ64 feature map
- Heavy: ~100ms (done once)
Prompt Encoder
Encodes user prompts as tokens:
- Point click: segment object at that point
- Bounding box: segment within the box
- Mask: refine an existing mask
- Text (SAM2): "the red car"
Mask Decoder
Lightweight transformer decoder
- Outputs 3 candidate masks
- Whole object / part / subpart
- Fast: ~50ms per prompt
- Interactive β prompt β mask instantly
SAM was trained on 1.1 billion masks across 11 million images β over 100Γ larger than any previous segmentation dataset. The data engine used SAM itself in a loop: model assists human annotators β annotators correct β model improves β repeat. This "model-in-the-loop" approach is now standard for building large-scale datasets.
SAM excels at segmenting arbitrary objects but does NOT classify them. It tells you "here's an object boundary" but not "this is a cat." For applications needing both segmentation and classification, combine SAM with a classifier or use specialised models like Mask R-CNN or Mask2Former.
Chapter 6.4 β Summary
- Semantic segmentation: every pixel gets a class label β same mask for all instances of a class
- Instance segmentation: each object gets a separate mask + class + confidence β distinguishes individual objects
- Panoptic segmentation: unifies semantic (stuff) + instance (things) β complete scene understanding
- U-Net: symmetric encoder-decoder with skip connections (copy + concat) β preserves spatial detail; also used in Stable Diffusion
- Mask R-CNN: Faster R-CNN + 28Γ28 mask head + RoI Align β no quantisation, pixel-accurate masks; SOTA 2017β2022
- Panoptic Quality: PQ = SQ Γ RQ β single metric combining segmentation accuracy and recognition accuracy
- SAM: foundation model β segment anything with a point click, trained on 1.1B masks, interactive at <50ms per prompt
- SAM2 (2024): extends to video β track and segment objects across frames with interactive prompts
Two neural networks locked in an adversarial game β one forges, the other detects. From blurry 64Γ64 bedrooms to photorealistic 1024Γ1024 faces, GANs evolved from an elegant mathematical idea into one of the most impactful generative frameworks in computer vision.
GAN Fundamentals In-depth
Goodfellow et al. (2014) introduced one of the most cited frameworks in deep learning: two networks in competition. The Generator G takes random noise z ~ N(0,I) and produces fake images G(z). The Discriminator D receives an image (real or fake) and outputs the probability that it's real.
Training alternates: update D for k steps (improve its detection ability), then update G for 1 step (improve its forgery). At Nash equilibrium, G produces perfect fakes and D outputs 0.5 for everything β it literally cannot tell real from fake.
GAN Loss Functions Core
The original GAN loss has two critical failure modes. Vanishing gradients: when D becomes too good, D(G(z)) β 0, and log(0) = ββ gives no useful gradient for G. Mode collapse: G finds a few outputs that fool D and keeps generating only those, ignoring the rest of the distribution.
Original GAN Loss
JS divergence-based.
- Vanishes when distributions don't overlap
- Unstable training dynamics
- Loss curves not meaningful
WGAN (2017)
Wasserstein (Earth Mover's) distance.
- Gradient never vanishes β even with no distribution overlap
- D must be 1-Lipschitz
- WGAN-GP: gradient penalty for stability
LSGAN
Least-squares loss for D.
- MSE instead of log-likelihood
- Penalises samples far from boundary
- More stable, less mode collapse
DCGAN Core
Radford et al. (2015) established the first stable recipe for CNN-based GANs. Before DCGAN, most GAN experiments produced noise or collapsed. DCGAN's design rules became gospel for all subsequent work:
DCGAN Design Rules
- Replace pooling with strided convolutions (D) and transposed convolutions (G)
- BatchNorm in both G and D (except G output and D input)
- Remove fully connected hidden layers
- ReLU in G (except output: Tanh), LeakyReLU in D
DCGAN Results
- Generated 64Γ64 bedroom images β first realistic CNN-generated images
- Meaningful latent space: interpolating between z vectors = smooth visual transitions
- "Smiling woman" β "neutral woman" + "neutral man" = "smiling man"
- Proved CNNs could learn rich image priors unsupervised
Conditional GAN Core
Vanilla GANs generate random images β you have no control over what comes out. Mirza & Osindero (2014) fixed this by feeding a condition label y to both G and D. Now G(z, y) generates an image of class y, and D(x, y) verifies it's a real image of that class.
Isola et al. (2016) used conditional GANs for paired image-to-image translation: sketch β photo, edges β handbag, day β night, satellite β map. The condition is the input image itself. pix2pix showed that conditional GANs are a universal framework for image transformation tasks.
StyleGAN In-depth
Karras et al. (NVIDIA, 2019) produced the first truly photorealistic synthetic faces. The key insight: separate the latent code into a style that controls appearance at each resolution level, rather than feeding noise directly into the first layer.
StyleGAN Innovations
- Mapping network: z β w (8-layer FC) β less entangled intermediate latent space
- AdaIN: inject style w at each resolution level via Adaptive Instance Normalisation
- Progressive growing: train at 4Γ4, gradually grow to 1024Γ1024
- Stochastic variation: per-pixel noise at each layer for fine details (hair, freckles)
StyleGAN Evolution
- StyleGAN (2019): 1024Γ1024 photorealistic faces, FFHQ dataset
- StyleGAN2 (2020): removes AdaIN artefacts, path length regularisation
- StyleGAN3 (2021): alias-free β proper translation/rotation equivariance
- StyleGAN-XL (2022): scaled to ImageNet-level diversity
CycleGAN Core
pix2pix requires paired training data β the exact same scene in both domains (e.g., the same street in day and night). This is often impossible to collect. Zhu et al. (2017) solved this with CycleGAN: learn translation between domains using only unpaired examples.
The trick: cycle consistency. Two generators (GAB: AβB, GBA: BβA) must satisfy GBA(GAB(a)) β a. If you translate a horse to a zebra and back, you should get the original horse. This constraint prevents the generators from hallucinating arbitrary outputs.
Horse β zebra, summer β winter, photo β Monet painting, day β night, apple β orange. CycleGAN works on any two unpaired image domains. The cycle consistency loss is also used in unsupervised machine translation (text) and audio style transfer.
GAN Training Challenges In-depth
Mode Collapse
G produces limited variety β finds a few modes that fool D, ignores the rest.
- "Same face no matter what z is"
- Fix: mini-batch discrimination
- Fix: unrolled GANs, WGAN
Training Instability
D and G must stay balanced β if one dominates, the other gets no gradient.
- Loss curves are NOT meaningful
- Fix: spectral normalisation
- Fix: D slower LR than G
Evaluation Challenge
How to measure "realistic and diverse"?
- FID: FrΓ©chet Inception Distance
- Lower FID = better quality + diversity
- Needs 1000s of samples to estimate
| Technique | Problem | How It Works | Used In |
|---|---|---|---|
| WGAN / WGAN-GP | Vanishing gradients | Wasserstein distance + gradient penalty | Most modern GANs |
| Spectral Normalisation | Training instability | Constrain weight matrix spectral norm | SN-GAN, BigGAN |
| Progressive Growing | High-res training | Start 4Γ4, gradually increase resolution | StyleGAN |
| Mini-batch Discrimination | Mode collapse | Pass statistics across batch to D | Original GAN improvements |
| Label Smoothing | Overconfident D | D target = 0.9 not 1.0 | Standard practice |
By 2022, diffusion models (Ch 6.6) largely replaced GANs for image generation. Diffusion models are more stable to train, don't suffer mode collapse, and achieve better FID scores. GANs remain relevant for real-time generation (single forward pass vs diffusion's iterative denoising) and image-to-image translation (CycleGAN, pix2pix).
Chapter 6.5 β Summary
- GAN: Generator fools Discriminator; D trains to detect fakes β minimax game with Nash equilibrium at D(G(z)) = 0.5
- WGAN: replaces JS divergence with Wasserstein distance β solves vanishing gradients; WGAN-GP adds gradient penalty
- DCGAN: strided convolutions + BatchNorm + LeakyReLU = first stable image GAN; established CNN-GAN design rules
- Conditional GAN: label y conditions both G and D β enables class-conditional generation and pix2pix image translation
- StyleGAN: mapping network z β w + AdaIN style injection per resolution = first photorealistic 1024Β² face generation
- CycleGAN: unpaired image translation via cycle consistency loss β GBA(GAB(a)) β a; no paired data needed
- Mode collapse: G generates limited variety; training instability: D/G balance is fragile; FID: standard evaluation metric
- Modern trend: diffusion models replacing GANs for generation quality; GANs still best for real-time and image-to-image tasks
In 2020, a single idea flipped computer vision upside-down: what if we treated an image as a sequence of patches β just like tokens in a sentence? The Vision Transformer proved that convolutions are not necessary. Pure self-attention, given enough data and compute, learns to see.
CNN Limitations That Motivated ViT Core
CNNs have two hard-wired inductive biases: locality (convolutions see only a small neighbourhood) and translation equivariance (same filter applied everywhere). These are powerful priors β but they also limit expressiveness.
A 3Γ3 conv at layer 1 sees only 9 pixels. To see the whole image, information must pass through many layers of pooling β getting progressively diluted. Transformers have no such constraints: self-attention connects every patch to every other patch in a single operation. Global context from layer 1.
β Translation equivariant by design
β Computationally efficient: O(HW) not O(HW)Β²
β Works well on small datasets
β Fast inference on edge devices
β Long-range dependencies require many layers
β Fixed spatial hierarchy (downsampling loses info)
β Inductive biases may hurt on novel domains
β Hard to model global context at early layers
ViT: An Image is Worth 16Γ16 Words In-depth
Dosovitskiy et al. (Google Brain, 2020) applied the standard Transformer encoder β unchanged from NLP β directly to images. The trick: divide the image into fixed-size non-overlapping patches, flatten each patch, linearly project it, and treat the resulting sequence exactly like word tokens.
224Β² β 196 patches (16Γ16)
16Γ16Γ3 = 768 values each
Linear: 768βD + pos embed
12 layers, global attention
MLP on [CLS] β class
import torch
import timm
# Load pre-trained ViT-Base/16 (ImageNet-21k β ImageNet-1k)
model = timm.create_model('vit_base_patch16_224', pretrained=True)
model.eval()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}") # 86,567,656
print(f"Patch size: {model.patch_embed.patch_size}") # (16, 16)
print(f"Num patches: {model.patch_embed.num_patches}") # 196
print(f"Embedding dim: {model.embed_dim}") # 768
print(f"Num heads: {model.blocks[0].attn.num_heads}") # 12
print(f"Num layers: {len(model.blocks)}") # 12
# Inference
from torchvision import transforms
from PIL import Image
transform = transforms.Compose([
transforms.Resize(256), transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
img = transform(Image.open("cat.jpg")).unsqueeze(0) # (1, 3, 224, 224)
with torch.no_grad():
logits = model(img) # (1, 1000)
probs = logits.softmax(-1)
top5 = probs.topk(5)
print("Top-5 predictions:", top5.indices) ViT Training Requirements Core
The original paper's most critical finding: ViT-Large trained on ImageNet-1k ONLY achieved 77.9% β worse than ResNet-50 (76.1%). But pre-trained on JFT-300M (Google's internal 300M-image dataset), it reached 88.55% β crushing every CNN. The conclusion: ViT needs massive data to overcome its lack of inductive biases.
CNNs have built-in priors: locality and translation equivariance guide learning even with limited data. ViT must learn these properties from data. With enough examples, ViT discovers even better representations β but on small datasets, it overfits to training patterns without discovering general visual structure.
DeiT: Data-Efficient Image Transformers In-depth
Touvron et al. (Facebook, 2021) asked: can we train ViT on ImageNet-1k without Google's proprietary JFT-300M? The answer: yes, with three key tricks β knowledge distillation from a CNN teacher, aggressive data augmentation, and a careful training recipe.
DeiT adds a second special token: [DIST] (distillation token) alongside the standard [CLS]. [DIST] is trained to match the CNN teacher's soft labels, while [CLS] learns from ground-truth hard labels. At inference, both outputs are averaged. Result: DeiT-Base: 81.8% vs original ViT-Base on same data: 77.9%.
Swin Transformer In-depth
Liu et al. (Microsoft, 2021) solved ViT's two biggest problems for dense prediction tasks: fixed single-scale output and quadratic attention cost. Swin Transformer (ICCV 2021 Best Paper) introduced window attention and hierarchical feature maps β making transformers practical for detection and segmentation.
Window Attention
Compute self-attention within local 7Γ7 windows instead of globally.
- Cost: O(MΒ²) per window, not O(HW)Β²
- Linear complexity in image size
- Shifted windows alternate each layer
- Cross-boundary info flow restored
Hierarchical Features
Multi-scale feature maps like ResNet's pyramid.
- Stage 1: 56Γ56 (fine, small objects)
- Stage 2: 28Γ28 β Stage 3: 14Γ14
- Stage 4: 7Γ7 (coarse, large objects)
- Plug directly into FPN for detection
ConvNeXt: A ConvNet for the 2020s Core
Liu et al. (Facebook, 2022) asked: "What if we took ResNet and applied every Transformer design decision?" Starting from ResNet-50, they applied 7 systematic changes. The result: ConvNeXt-Base: 83.8% β matching Swin-Base (83.5%) without any attention mechanism.
Architecture Showdown In-depth
| Model | Inductive Bias | Attention | Hierarchy | Best For | Params | Top-1 |
|---|---|---|---|---|---|---|
| ResNet-50 | Strong (conv) | None | Yes | Small data, fast inference | 25M | 76.1% |
| EfficientNetV2-M | Strong (conv+NAS) | None | Yes | Efficient production | 54M | 85.1% |
| ViT-Base/16 | None | Global | No | Large-scale pre-training | 86M | 81.8% |
| DeiT-Base/16 | Weak (distill) | Global | No | ImageNet-scale tasks | 86M | 81.8% |
| Swin-Base | Weak (window) | Local+shift | Yes | Detection, segmentation | 88M | 83.5% |
| ConvNeXt-Base | Moderate (conv) | None | Yes | All-around modern CNN | 89M | 83.8% |
| ViT-L/16 (MAE) | None | Global | No | Large-scale SOTA | 307M | 87.8% |
| SAM2 (ViT-H) | None | Global | No | Zero-shot segmentation | 641M | β |
There is no single "best" architecture in 2024. For classification accuracy: ViT-Large with MAE pre-training. For efficiency: ConvNeXt or EfficientNetV2. For detection/segmentation: Swin or ConvNeXt backbone. For multimodal tasks: ViT dominates β it connects naturally to language models via shared attention.
Chapter 6.6 β Summary
- ViT splits image into 196 non-overlapping 16Γ16 patches, treats them as a token sequence for a standard Transformer encoder
- Self-attention is global from layer 1 β no locality constraint unlike CNNs; every patch sees every other patch directly
- ViT needs massive pre-training data: outperforms CNNs only above ~14M images; on small data, CNNs still win
- DeiT: trains ViT on ImageNet-1k via knowledge distillation from a CNN teacher + [DIST] token β 81.8% vs 77.9%
- Swin Transformer: window attention (linear cost) + shifted windows + hierarchical feature maps = best backbone for detection/segmentation
- ConvNeXt: ResNet updated with 7 Transformer design choices β matches Swin without any attention (83.8% vs 83.5%)
- No single best architecture: ViT for accuracy, ConvNeXt for efficiency, Swin for dense prediction, ViT for multimodal
For decades, vision and language models lived in separate worlds. CLIP changed that in 2021 by learning a shared embedding space where "a photo of a dog" and an actual photo of a dog sit close together. One model. Two modalities. Zero task-specific training.
What Is Multimodal AI? Core
Unimodal models process only one type of data β a text-only LLM or an image-only CNN. Multimodal models process and relate multiple data types simultaneously: text + image, image + audio, video + text. Real-world tasks are inherently multimodal β "describe what's in this photo" requires both vision and language.
Vision β Language
- Image captioning (BLIP, CoCa)
- Visual QA β "What colour is the car?"
- Document parsing (GPT-4V, DocVQA)
- Medical image report generation
- Chart and figure understanding
Language β Vision
- Text-to-image (DALL-E 3, SDXL, Midjourney)
- Image editing by text instruction
- Text-guided inpainting
- Text-to-video (Sora, Runway)
- Text-to-3D generation
Text + Image β Text
- Multimodal chat (GPT-4V, LLaVA, Gemini)
- "Explain this chart step by step"
- Visual code debugging
- Document Q&A with screenshots
Text β Image Retrieval
- Find images matching a text query (CLIP)
- Find text matching an image
- Pinterest visual search, Google Lens
- Open-vocabulary detection (GLIP, OwL-ViT)
CLIP: Contrastive Language-Image Pre-training In-depth
Radford et al. (OpenAI, 2021) trained two encoders β one for images, one for text β jointly on 400 million (image, text) pairs scraped from the internet. No manual labels: web captions ARE the supervision. The goal: learn a shared embedding space where matching image-text pairs are close and non-matching pairs are far apart.
Input: 224Γ224 RGB image
Output: d-dim embedding (e.g., 512 or 1024)
Projected to shared embedding space via linear layer
Input: text caption (up to 77 tokens)
Output: d-dim embedding
[EOS] token representation projected to shared space
What CLIP Enables In-depth
Zero-Shot Image Classification
Write text prompts: "a photo of a {class}" for each class. Embed all prompts. Compare with image embedding via cosine similarity. Nearest = predicted class.
- 76.2% Top-1 on ImageNet β zero task-specific training
- Competitive with supervised ResNet-50 (76.1%)
Visual Similarity Search
Embed a text query, find images by cosine similarity in shared space.
- Pinterest visual search, Google Lens
- Stock photo search by description
- Medical image retrieval
Open-Vocabulary Detection
Combine CLIP with detection models β detect ANY category described in text.
- OwL-ViT, GLIP, Grounding DINO
- No retraining for new classes
- "Find all red objects in this image"
Data Filtering & Curation
Filter web-crawled images by semantic content using text queries.
- Built LAION-5B (5B image-text pairs)
- LAION = training data for Stable Diffusion
- CLIP score filters low-quality pairs
import torch
import open_clip
from PIL import Image
# Load pre-trained CLIP model
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-B-32', pretrained='openai'
)
tokenizer = open_clip.get_tokenizer('ViT-B-32')
model.eval()
# Load and preprocess image
image = preprocess(Image.open("cat.jpg")).unsqueeze(0) # (1, 3, 224, 224)
# Zero-shot classification: define class prompts
classes = ["cat", "dog", "car", "bird", "elephant"]
text_prompts = [f"a photo of a {c}" for c in classes]
text_tokens = tokenizer(text_prompts) # (5, 77) token sequences
with torch.no_grad():
image_features = model.encode_image(image) # (1, 512)
text_features = model.encode_text(text_tokens) # (5, 512)
# Normalise to unit vectors before cosine similarity
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Cosine similarity: (1,512) @ (512,5) β (1,5)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
for cls, prob in zip(classes, similarity[0]):
print(f"{cls:12s}: {prob.item():.3f}")
# cat : 0.823 β highest
# dog : 0.091
# car : 0.031
# bird : 0.034
# elephant : 0.021 DALL-E 1, 2, and 3 In-depth
DALL-E 1 (2021)
- Text β BPE tokens (256) + VQ-VAE image tokens (1024)
- 12B-param autoregressive Transformer
- Predicts image tokens sequentially
- Creative combinations but limited fidelity
- Slow: sequential generation
DALL-E 2 (2022)
- Text β CLIP text embedding β Prior β CLIP image embedding
- Diffusion decoder (unCLIP) β high-res image
- Much higher fidelity than DALL-E 1
- Supports image variations (encode β re-decode)
- Uses CLIP embeddings as bridge
DALL-E 3 (2023)
- Key innovation: highly descriptive synthetic captions
- Re-captioned all training data with detailed descriptions
- Much better instruction following
- Integrated into ChatGPT
- Handles complex prompts faithfully
Vision-Language Models (VLMs) In-depth
VLMs accept both images and text as input and generate text as output. The core challenge: how do you connect a vision encoder to a language model? Three main approaches have emerged:
1. Feature Concatenation
Vision tokens prepended to text tokens before LLM.
- LLM processes visual + text tokens together
- Requires LLM pre-training on multimodal data
- Example: Flamingo (cross-attention layers)
- Limitation: LLM must learn vision interpretation
2. Projector / Adapter
MLP projector bridges vision encoder β LLM input space.
- Most common approach
- Freeze LLM (or LoRA), train only projector
- Examples: LLaVA-1.5, InternVL, Qwen-VL
- Efficient: minimal new parameters
3. Native Multimodal
Vision and language trained together from scratch.
- Best cross-modal reasoning
- Most expensive to train
- Examples: GPT-4V, Gemini, PaliGemma
- Unified architecture β no connector needed
GPT-4V, Gemini & Open VLMs Core
GPT-4V / GPT-4o (OpenAI)
- First frontier VLM (2023)
- Read documents, analyse charts
- Solve visual math, describe scenes
- GPT-4o: native voice + vision + text
- 128K token context
Gemini 1.5 Pro (Google)
- Natively multimodal from pre-training
- 1M token context window
- Process 1 hour of video or 1000 images
- Images, video, audio, code, text
- Best for long-document & video tasks
Open-Source VLMs
- LLaVA-1.5: CLIP + MLP + Vicuna β strong baseline
- InternVL-2: competitive with GPT-4V on benchmarks
- Qwen-VL: multilingual, multi-image
- PaliGemma: SigLIP + Gemma, efficient open model
| Model | Architecture | Vision Encoder | LLM Base | Context | Notable |
|---|---|---|---|---|---|
| GPT-4V / 4o | Proprietary | Undisclosed | GPT-4 | 128K | Best overall, native voice+vision |
| Gemini 1.5 Pro | Native multimodal | Proprietary | Gemini | 1M tokens | Long video, multi-image |
| Claude 3.5 Sonnet | Proprietary | Undisclosed | Claude 3.5 | 200K | Document analysis, charts |
| LLaVA-1.5 | ViT+Projector | CLIP ViT-L/336 | Vicuna-13B | 4K | Strong open baseline |
| InternVL-2 | ViT+MLP | InternViT-6B | InternLM2-20B | 8K | Near-frontier open |
| Qwen-VL-Plus | ViT+Adapter | Qwen ViT | Qwen-7B | 8K | Multilingual, multi-image |
| PaliGemma | ViT+Linear | SigLIP-So400M | Gemma-2B/9B | 8K | Open, small, efficient |
Multimodal Benchmarks Core
| Benchmark | Tests | Format | GPT-4V | Best Open | Human |
|---|---|---|---|---|---|
| VQAv2 | General visual QA | Open-ended | 77.2% | ~75% | 80.9% |
| TextVQA | Text in images | Open-ended | 78.0% | 76.1% | ~85% |
| DocVQA | Document understanding | Open-ended | 87.2% | 82.6% | 96% |
| ChartQA | Chart comprehension | Open-ended | 78.5% | 74.8% | 80.5% |
| MMMU | University-level multimodal | MCQ | 56.8% | 49.3% | 56.2% |
| MMBench | Comprehensive multimodal | MCQ | 75.8% | 72.4% | β |
Many multimodal benchmarks are rapidly saturating β models now exceed human performance on DocVQA and approach human-level on ChartQA. MMMU (university-level expert questions across 57 subjects) remains the most challenging, with GPT-4V barely reaching human-level performance. New harder benchmarks (MMMU-Pro, MathVista) are being developed constantly.
Chapter 6.7 β Summary
- CLIP: jointly trained image + text encoders via contrastive loss on 400M image-text pairs β shared embedding space
- Shared embedding: "a photo of a dog" and a dog photo have similar vectors β the key to zero-shot capabilities
- CLIP zero-shot: 76.2% on ImageNet with no task-specific training β just text prompts per class
- DALL-E 1: autoregressive over VQ-VAE tokens; DALL-E 2/3: CLIP + diffusion = higher fidelity + better instruction following
- VLMs: vision encoder β MLP projector β LLM β visual tokens treated exactly like text tokens inside the language model
- Frontier models: GPT-4V, Gemini 1.5 Pro (1M context), Claude 3.5 Sonnet lead on benchmarks
- Open alternatives: LLaVA-1.5, InternVL-2, PaliGemma β competitive with proprietary models on most benchmarks
- Key benchmarks: VQAv2, DocVQA, ChartQA, MMMU β MMMU remains hardest; most others nearly saturated
A video is not just a sequence of images β it is time, motion, causality, and physics. 3D vision goes further: understanding the world as volumetric space, not flat projections. These are the hardest problems in computer vision, and also the most important for autonomous systems that must act in the real world.
The Video Challenge Core
Video adds the temporal dimension to images: motion, change, causality, events. Processing frames independently with image models misses all temporal information β you can't tell if a person is walking left or right. Adjacent frames are also ~95% identical pixels, making naive processing extremely redundant.
Temporal Modelling
Must capture short-term motion (running, gestures) AND long-term events (scoring a goal over 10 seconds). Both time scales matter.
Computational Cost
30fps Γ 1 min = 1,800 frames. Can't process all at full resolution. Solutions: sampling, temporal pooling, sparse attention.
Temporal Alignment
Two videos of "making coffee" have the same steps in different order and timing. Models must be robust to temporal variation.
Optical Flow In-depth
Optical flow computes a dense per-pixel motion vector field between two consecutive frames. For each pixel (x,y) in frame t, it asks: where does this pixel move to in frame t+1? The result is an HΓWΓ2 flow field (Ξx, Ξy per pixel). Used in action recognition, video stabilisation, compression, and motion segmentation.
Classical Methods
- Lucas-Kanade (1981): sparse flow on corners/keypoints β fast, robust
- Horn-Schunck (1981): dense regularised flow β smooth but slow
- Farneback (2003): dense flow via polynomial expansion
- All assume: brightness constancy + small motion
Deep Learning Methods
- FlowNet (2015): first end-to-end CNN for optical flow
- PWC-Net (2018): coarse-to-fine with cost volume
- RAFT (2020): iterative refinement on 4D correlation volume β SOTA
- RAFT generalises across domains β no motion assumptions
Video Understanding Models In-depth
The history of video understanding mirrors the history of image understanding: hand-crafted features β CNNs β Transformers. Each generation solved the temporal modelling problem differently.
| Model | Year | Architecture | Temporal Modelling | Kinetics-400 Acc | Speed |
|---|---|---|---|---|---|
| Two-Stream | 2014 | Dual CNN | Optical flow | 88.0% | Slow (flow) |
| C3D | 2015 | 3D CNN | 3D convolution | 79.9% | Moderate |
| I3D | 2017 | Inflated 3D | 3D conv (ImageNet init) | 95.6% | Moderate |
| R(2+1)D | 2018 | Factorised 3D | 2D spatial + 1D temporal | 96.8% | Moderate |
| SlowFast | 2019 | Dual-speed CNN | Slow (semantics) + Fast (motion) | 79.0% (val) | Fast |
| TimeSformer | 2021 | ViT + Attn | Factorised spatial+temporal attn | 80.7% | Moderate |
| VideoMAE-H | 2022 | ViT-H MAE | Masked video pre-training | 86.6% | Moderate |
Video Generation Core
Video generation is dramatically harder than image generation: objects must stay consistent across hundreds of frames, motion must follow physics, and coherent storylines span many seconds. The field progressed rapidly from 2022β2024.
Sora's key innovation: treat video as a sequence of spacetime patches rather than frames. A spacetime patch spans Ξt Γ Ξh Γ Ξw β capturing motion intrinsically. These patches become tokens for a Diffusion Transformer (DiT), replacing the U-Net with a scalable Transformer architecture. This allows Sora to generate variable duration, resolution, and aspect ratio from a single model.
Depth Estimation In-depth
Depth estimation predicts the distance from camera to each pixel, producing an HΓW depth map. Monocular depth (single camera) is ambiguous β a nearby toy car looks like a distant real car. Deep learning now handles this, trained on stereo pairs or synthetic data to learn scale cues like perspective and size.
Monocular Depth
One RGB image β depth map. Scale ambiguous β requires learned priors.
- MiDaS (Intel 2020): relative depth, robust
- DPT (2021): ViT backbone for accuracy
- Depth Anything v2 (2024): 62M images, SOTA foundation model
Stereo Depth
Two cameras with known baseline β triangulate from disparity (pixel shift).
- Absolute scale available (unlike monocular)
- Standard in autonomous driving hardware
- IGEV-Stereo, RAFT-Stereo: learned stereo
RGB-D Sensors
Direct depth measurement hardware.
- Structured light: Kinect, Intel RealSense
- Time-of-Flight: iPhone LiDAR, Velodyne
- Used in: AR, robotics, autonomous vehicles
Point Clouds & LiDAR Core
A point cloud is an unordered set of 3D points, each a (x,y,z) coordinate. LiDAR sensors emit laser pulses and measure return time to build dense 3D point clouds at 1.3M points/second. Unlike images, point clouds have no grid structure, variable density, and only capture visible surfaces.
PointNet (Qi et al., 2017): process each point independently with a shared MLP, then aggregate via max pooling. Max pooling over all points is permutation-invariant β doesn't matter what order you feed the points in, you get the same result. PointNet++ extends this with hierarchical local feature extraction (like CNN for point clouds).
NeRF & 3D Gaussian Splatting In-depth
Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) learn a 3D scene representation from 2D photos β given any new camera viewpoint, the model synthesises a photorealistic image. A small MLP maps (x, y, z, direction) β (colour, density). Volume rendering integrates these values along camera rays. Given 20β100 posed photos β synthesise any novel angle.
3D Gaussian Splatting (Kerbl et al., 2023) replaces NeRF's implicit MLP with explicit 3D Gaussians β each has a centre, covariance shape, colour, and opacity. Rendering projects Gaussians to 2D and rasterises directly on GPU. Result: near-real-time novel view synthesis at 30fps, better quality, and 30-minute training vs NeRF's hours.
NeRF (2020)
- MLP: (x,y,z,dir) β (colour, density)
- Volume rendering along camera rays
- Training: hours, Rendering: seconds/frame
- Many variants: Instant-NGP (fast), Mip-NeRF (anti-aliased)
- Foundation for all novel-view synthesis methods
3D Gaussian Splatting (2023)
- Explicit 3D Gaussians: centre + shape + colour + opacity
- Rasterise to screen β GPU-native, differentiable
- Training: ~30 min, Rendering: 30fps real-time
- Better quality, sharper edges than NeRF
- Most impactful 3D vision paper of 2023
# gsplat: fast differentiable Gaussian splatting renderer
import torch
from gsplat import rasterization
def render_gaussians(
means: torch.Tensor, # (N, 3): Gaussian centres in 3D
quats: torch.Tensor, # (N, 4): quaternion rotations
scales: torch.Tensor, # (N, 3): scale along each axis
opacities: torch.Tensor, # (N,) : opacity values
colors: torch.Tensor, # (N, 3): RGB colours
viewmat: torch.Tensor, # (C, 4, 4): camera extrinsics
K: torch.Tensor, # (C, 3, 3): camera intrinsics
width: int, height: int
) -> torch.Tensor:
# Differentiable rasterisation via tile-based splatting
renders, alphas, meta = rasterization(
means=means, quats=quats, scales=scales,
opacities=opacities, colors=colors,
viewmats=viewmat, Ks=K,
width=width, height=height
)
return renders # (C, H, W, 3) rendered images
# Training: optimise Gaussian parameters to minimise L1 + SSIM vs training images
# Init from SfM point cloud β optimise ~30 min on RTX 4090 β real-time 30fps render π Domain 6 Complete β Computer Vision & Multimodal AI
- Ch 6.1: Images = 3D tensors (N,C,H,W); always normalise with ImageNet mean/std for pre-trained models. Canny, HOG, and SIFT dominated before 2012.
- Ch 6.2: AlexNet 2012 = the inflection point. ResNet skip connections F(x)+x solved depth degradation; EfficientNet compound-scales depth, width, and resolution jointly.
- Ch 6.3: Detection = localise + classify all objects. YOLO: single forward pass predicts all boxes at 30β160fps. IoU and mAP are the standard metrics.
- Ch 6.4: Segmentation = per-pixel labelling. U-Net: symmetric encoder-decoder with concatenation skip connections. SAM: promptable zero-shot segmentation β click any point, get a mask.
- Ch 6.5: GAN: Generator fools Discriminator via minimax game. StyleGAN: mapping network + AdaIN style injection per resolution = photorealistic faces. CycleGAN: unpaired domain translation.
- Ch 6.6: ViT: image as 16Γ16 patch tokens fed into a Transformer. Needs large pre-training data (~14M+); Swin adds hierarchy and shifted-window attention for detection/segmentation.
- Ch 6.7: CLIP: shared image-text embedding via contrastive learning on 400M pairs β 76.2% zero-shot ImageNet. VLMs: ViT + MLP projector + LLM = visual question answering at scale.
- Ch 6.8: Video = (T,C,H,W) tensor; Sora treats it as spacetime patches in a Diffusion Transformer. 3D Gaussian Splatting: real-time novel-view synthesis from photos. NeRF, depth maps, and LiDAR power autonomous systems.
Domain 6 traces how AI learned to see, understand, and recreate the visual world. The key progression: raw pixels β hand-crafted features (HOG, SIFT) β learned features (CNNs) β global attention (ViT) β language-grounded vision (CLIP, VLMs). Multimodal AI bridges Domain 6 and Domain 5 β vision and language are converging into unified foundation models that process any modality through shared embedding spaces and transformer architectures.