AI Foundation Β· Domain 06 Β· Chapter 6.1

Image Fundamentals & Classical Computer Vision

How images become numbers, how classical algorithms extract meaning from pixels, and why deep learning ultimately replaced them all.

6.1
Chapter 6.1
Image Fundamentals & Classical Computer Vision

A digital image is a 3D array of numbers β€” width Γ— height Γ— channels. Everything in computer vision, from a simple edge detector to Stable Diffusion, is ultimately operations on this array. Understanding what those numbers represent is where computer vision begins.

Every digital image is, at its core, a grid of numbers. Each cell in this grid is a pixel β€” the atomic unit of a digital image, representing a single colour sample at a specific grid position. Before any algorithm can process an image, you must understand how those pixels are encoded and what they represent.

Grayscale images use a single channel β€” each pixel is an integer from 0 (black) to 255 (white), giving 256 possible intensity levels. RGB images use three channels (Red, Green, Blue), each ranging 0–255, producing 256Β³ = 16.7 million possible colours per pixel. Every pixel in an RGB image is defined by exactly three numbers.

In deep learning, the tensor representation matters enormously. PyTorch uses channels-first ordering: (C, H, W) β€” so a 224Γ—224 RGB image becomes shape (3, 224, 224). NumPy and OpenCV use height-first: (H, W, C). Confusing these axes is one of the most common bugs in computer vision code.

πŸ”’

MNIST

28 Γ— 28 Γ— 1

784 values

πŸ–ΌοΈ

CIFAR-10

32 Γ— 32 Γ— 3

3,072 values

πŸ“Έ

ImageNet

224 Γ— 224 Γ— 3

150,528 values

🎨

Stable Diffusion

512 Γ— 512 Γ— 3

786,432 values

Data types matter for performance. Raw images are stored as uint8 (0–255) β€” compact but unsuitable for gradients. Neural networks require float32 in range [0.0, 1.0] or normalised with ImageNet statistics: subtract mean [0.485, 0.456, 0.406] and divide by std [0.229, 0.224, 0.225]. This normalisation centres values around zero, helping gradient-based optimisation converge faster.

RGB Image as 3D Tensor β€” 3 channels Γ— H Γ— W array of pixel values
RGB Image W Γ— H pixels R channel G channel B channel 3 stacked planes (3, H, W) PyTorch: CHW NumPy: HWC SINGLE PIXEL R = 128 G = 64 B = 200 Each pixel = 3 numbers 224Γ—224 image = 150,528 total values Full batch tensor: (batch=1, channels=3, height=224, width=224)
import torch import torchvision.transforms as T from PIL import Image import numpy as np # Load image img = Image.open("cat.jpg") print(f"PIL size: {img.size}, mode: {img.mode}") # (width, height), 'RGB' # Convert to NumPy β€” shape: (H, W, 3), dtype: uint8 img_np = np.array(img) print(f"NumPy shape: {img_np.shape}") # (480, 640, 3) print(f"Pixel [0,0]: R={img_np[0,0,0]} G={img_np[0,0,1]} B={img_np[0,0,2]}") # Convert to PyTorch tensor β€” shape: (3, H, W), float32 [0,1] transform = T.Compose([ T.Resize((224, 224)), T.ToTensor(), # HWC uint8 β†’ CHW float32, scales to [0,1] T.Normalize(mean=[0.485, 0.456, 0.406], # ImageNet mean std=[0.229, 0.224, 0.225]) # ImageNet std ]) tensor = transform(img) print(f"Tensor shape: {tensor.shape}") # torch.Size([3, 224, 224]) print(f"Value range: [{tensor.min():.2f}, {tensor.max():.2f}]") # roughly [-2, 2]
⚠️ Common Pitfall

OpenCV loads images as BGR, not RGB. When you use cv2.imread(), the channels are Blue-Green-Red. Feeding BGR to a model trained on RGB will silently produce wrong results. Always convert with cv2.cvtColor(img, cv2.COLOR_BGR2RGB).

RGB is the natural colour space for displays β€” your screen uses red, green, and blue LEDs. But RGB mixes luminance (brightness) and chrominance (colour) together, making it poor for many computer vision tasks. Different colour spaces separate these properties, each suited to specific algorithms.

πŸ”΄πŸŸ’πŸ”΅

RGB

Red, Green, Blue β€” each 0–255

  • Natural for display hardware
  • Mixes brightness with colour
  • Default for most image libraries
🌈

HSV

Hue (0°–360Β°), Saturation, Value

  • Separates colour from brightness
  • Hue: red=0Β°, green=120Β°, blue=240Β°
  • Ideal for colour-based segmentation
🎨

LAB

L=lightness, A=green↔red, B=blue↔yellow

  • Perceptually uniform
  • Equal Ξ”E = equal visual difference
  • Best for colour similarity & style transfer

Grayscale conversion is not a simple average of R, G, B. The human eye is most sensitive to green, so the standard formula is: Luminance = 0.299R + 0.587G + 0.114B. Green contributes nearly 60% of perceived brightness. This is why most classical CV algorithms operate on grayscale β€” it reduces data by 3Γ— while preserving the structural information humans rely on.

Colour Spaces β€” RGB, HSV, LAB for different CV tasks
RGB Cube Black White R G B Intuitive for display HSV Cylinder 0Β° 120Β° 240Β° Value ↓ Sat β†’ Separates colour from brightness LAB Space L (light) L (dark) -a +a green red +b yellow -b blue Perceptually uniform
πŸ—οΈ Real-World Deployment

In production, HSV colour filtering is still used for fast pre-processing β€” e.g., isolating red traffic lights before running a neural network detector. It's computationally cheap and works reliably when lighting is controlled. LAB is used in image quality assessment and style transfer where perceptual accuracy matters.

A filter (or kernel) is a small matrix β€” typically 3Γ—3 or 5Γ—5 β€” that slides across an image computing a weighted sum at each position. This operation is called convolution, and it's the single most important operation in all of computer vision. Classical filters are hand-designed for specific effects. Deep learning CNNs learn their filters from data β€” but the underlying mathematical mechanism is identical.

🌫️

Gaussian Blur

Smooths out noise by averaging each pixel with its neighbours using bell-shaped weights. The centre pixel has the highest weight; further pixels contribute less.

  • Reduces high-frequency noise
  • Pre-processing step before edge detection
  • Kernel: [[1,2,1],[2,4,2],[1,2,1]] / 16
πŸ”ͺ

Sharpening

Enhances edges by amplifying the centre pixel and subtracting neighbours β€” effectively adding the difference between the pixel and its surroundings.

  • Large centre weight (5), negative neighbours (-1)
  • Kernel: [[0,-1,0],[-1,5,-1],[0,-1,0]]
  • Increases contrast at edges
πŸ“

Sobel Filter

Detects edges by computing the intensity gradient. Separate kernels for horizontal and vertical edges.

  • Horizontal: [[-1,-2,-1],[0,0,0],[1,2,1]]
  • Vertical: [[-1,0,1],[-2,0,2],[-1,0,1]]
  • Foundation for Canny edge detection
🧹

Median Filter

Replaces each pixel with the median of its neighbourhood. Non-linear β€” not technically a convolution.

  • Excellent for salt-and-pepper noise
  • Preserves edges better than Gaussian
  • Cannot be learned by a standard CNN layer
Classical Image Filters β€” hand-designed convolution kernels
Original Light→dark edge + noise Gaussian Blur 1 2 1 2 4 2 1 2 1 ÷ 16 Sobel Edge -1 0 1 -2 0 2 -1 0 1 vertical edges Sharpen 0 -1 0 -1 5 -1 0 -1 0 edges enhanced Classical CV: hand-design kernels → Deep Learning: learn kernels from data Same convolution operation — different source of weights

The convolution operation is identical in classical CV and deep learning. The only difference: classical engineers design the kernel weights by hand, while CNNs learn them via backpropagation. This single insight connects 40 years of computer vision history.

Edges are boundaries between regions of different intensity β€” they mark where objects begin and end, where surfaces change orientation, and where textures shift. Edges are the most information-dense locations in an image: they encode shape, structure, and object boundaries while discarding uniform regions that carry little useful signal.

The Canny Edge Detector (John Canny, 1986) remains the gold standard for classical edge detection. It's a five-step pipeline, each step carefully designed to balance noise rejection against edge localisation. Nearly every classical CV system used Canny as a preprocessing step β€” from document scanning to lane detection to augmented reality registration.

Canny Edge Detector β€” 5-step classical edge detection pipeline
β‘  Gaussian Blur Reduce noise Slightly blurred β–Ά β‘‘ Sobel Gradient Find intensity changes Gradient magnitudes β–Ά β‘’ Non-max Suppression Thin to 1px edges Thin lines β–Ά β‘£ Double Threshold Strong vs weak edges Classified edges β–Ά β‘€ Hysteresis Connect weak to strong Clean edges βœ“ Input: noisy image β†’ Output: clean 1-pixel-wide edge map Still used today in document scanning, AR, and lane detection

The Harris Corner Detector extends edge detection to find corners β€” points where intensity changes significantly in two directions simultaneously. Corners are more distinctive than edges (an edge looks the same along its length), making them better landmarks for matching between images. Harris corners are still used in camera calibration and simple tracking systems.

Before deep learning, the central challenge of computer vision was: how do you represent an image patch as a compact, distinctive vector that's robust to scale, rotation, and illumination changes? Researchers hand-crafted feature descriptors that encode local image structure β€” these dominated from roughly 2000 to 2012.

πŸ”‘

SIFT (2004)

Scale-Invariant Feature Transform β€” David Lowe

  • 128-dimensional descriptor per keypoint
  • Invariant to scale, rotation, minor illumination
  • Used in: panorama stitching, SLAM, image matching
πŸ“Š

HOG (2005)

Histogram of Oriented Gradients β€” Dalal & Triggs

  • Divide image into cells, count gradient orientations
  • Concatenate histograms β†’ feature vector
  • Used in: pedestrian detection, object recognition
⚑

ORB (2011)

Oriented FAST + Rotated BRIEF β€” patent-free

  • Fast, rotation-invariant binary descriptor
  • 10–100Γ— faster than SIFT
  • Used in: real-time mobile feature matching
HOG Descriptor β€” gradient orientation histograms capture local shape
Input Region 64 Γ— 128 px divide 8Γ—8 Cells + Gradients histogram HOG Visualisation Feature Vector 3,780 dims for 64Γ—128 Feed to SVM for classification HOG: count gradient orientations per cell β†’ distinctive descriptor of shape Powered pedestrian detection from 2005 until deep learning replaced it ~2014

The classical computer vision pipeline dominated the field from approximately 1980 to 2012. Every step was designed and tuned by hand β€” a brittle, domain-specific process that required deep expertise and did not generalise well across tasks or visual domains.

Raw Imagecamera input
Preprocessingresize, blur, normalise
Feature ExtractionSIFT, HOG, Haar
Feature SelectionPCA, filter
ClassificationSVM, AdaBoost
Post-processingNMS, smoothing
Outputlabel, bbox
⚠️ Why this pipeline failed

Each stage was optimised independently β€” features that were good for one task (pedestrians) performed poorly on another (faces, cars, medical images). Every new domain required starting over: new features, new tuning, new expertise. This is why end-to-end learning was so revolutionary.

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered the ImageNet Large-Scale Visual Recognition Challenge with AlexNet β€” a deep convolutional neural network trained on two GPUs. It achieved a top-5 error rate of 15.3%, compared to 26.2% for the best classical method. That 41% relative improvement in a single year was the most dramatic result in the competition's history, and it permanently changed computer vision.

Three fundamental advantages explain why deep learning dominates:

🧠

Learned Features

No manual feature engineering. Optimal representations emerge automatically from data β€” the network discovers what matters.

πŸ”—

End-to-End

Directly maps pixels to outputs. No intermediate representation bottleneck β€” the entire pipeline is optimised jointly.

πŸ“ˆ

Scale

Performance improves with more data and compute. Classical methods plateau β€” deep learning keeps getting better.

Classical CV
Deep Learning CV
Hand-crafted features (HOG, SIFT)
Learned features from data
Domain expertise required
No domain expertise needed
Features optimised separately from classifier
End-to-end optimisation
Plateaus with more data
Improves with more data
Brittle to domain shift
Generalises across domains
Fast inference, tiny models
Slower inference, large models
ImageNet Competition β€” Deep Learning Dominates from 2012
Top-5 Error % 30% 25% 15% 10% 5% 0% Human ~5% 2010 2011 2012 2013 2014 2015 2016 2017 28% 26% 15.3% AlexNet β€” DL breakthrough 11.2% 6.7% 3.57% ResNet β€” beats humans 2.99% 2.25% 41% drop in 1 year Deep learning: learn features from data Classical CV: hand-design features
πŸ—οΈ Real-World Deployment

Despite deep learning's dominance, classical CV is not dead. Edge devices with limited compute (microcontrollers, drones) still use Canny, HOG, and ORB. Classical methods also serve as fast pre-filters before running expensive neural networks β€” e.g., using simple motion detection to trigger a deep learning classifier only when something moves in frame.

Chapter 6.1 β€” Summary

  • Images are (C, H, W) tensors β€” normalise to float32 [0,1] before feeding to neural networks
  • RGB mixes colour and brightness β€” HSV separates hue from value for colour-based algorithms
  • Classical filters are hand-designed convolution kernels β€” CNNs learn these automatically from data
  • Canny edge detector: 5 steps from blur β†’ gradient β†’ NMS β†’ threshold β†’ hysteresis
  • HOG and SIFT: hand-crafted local feature descriptors that dominated pre-2012 computer vision
  • AlexNet (2012): 41% error reduction proved end-to-end learned features beat manual engineering
6.2
Chapter 6.2
CNN Architectures for Vision

From AlexNet's breakthrough in 2012 to EfficientNet's compound scaling in 2019, each generation of CNN architecture solved a specific problem β€” depth, efficiency, scale. Understanding why each design was invented is more important than memorising layer counts.

Before diving into specific architectures, recall the three inductive biases that make CNNs ideal for images (covered in Domain 4, Ch 4.5):

πŸ”

Local Connectivity

Each neuron sees only a small spatial patch. This exploits spatial locality β€” nearby pixels are more related than distant ones.

πŸ”„

Weight Sharing

The same filter slides across every position. This gives translation equivariance β€” a cat is detected regardless of where it appears.

πŸ—οΈ

Hierarchical Representation

Stacked conv layers build edges β†’ textures β†’ parts β†’ objects. Each layer composes features from the layer below.

The core building block pattern used in almost every CNN: Conv β†’ BatchNorm β†’ ReLU β†’ Pooling. As you go deeper, spatial dimensions shrink (via pooling/stride) while channel depth grows (more filters).

Feature Map Evolution β€” spatial dimensions shrink, channel depth grows
224Γ—224Γ—3 Input 56Γ—56Γ—64 Conv1+Pool 28Γ—28Γ—128 Conv2+Pool 14Γ—14Γ—256 Conv3+Pool 7Γ—7Γ—512 Conv4+Pool 4096 4096 1000 Softmax Spatial dims ↓ (pooling) Channel depth ↑ (more filters)

AlexNet (Krizhevsky et al., 2012) was the CNN that launched the deep learning revolution. It didn't just beat the competition on ImageNet β€” it obliterated it, reducing top-5 error from 25.8% to 15.3%. Every key idea it introduced became standard practice.

πŸ†

AlexNet β€” Key Innovations

  • GPU training β€” trained on 2 GTX 580 GPUs (first practical GPU training)
  • ReLU activation β€” 6Γ— faster training than tanh/sigmoid
  • Dropout (p=0.5) in FC layers β€” regularisation against overfitting
  • Data augmentation β€” random crops, horizontal flips, colour jitter
  • Architecture: 5 conv + 3 FC layers, 60M parameters
  • 11Γ—11 first-layer filters β€” large receptive field to capture coarse features
🎨

VGG β€” The "Beautiful" Architecture

  • 3Γ—3 convolutions only β€” architectural simplicity as a virtue
  • Two 3Γ—3 convs = same receptive field as one 5Γ—5, but fewer parameters and more non-linearity
  • VGG-16: 16 weight layers, 138M parameters
  • VGG-19: 19 layers, 144M parameters
  • Still widely used as a feature extractor backbone
  • Simple, understandable β€” the "ImageNet of architectures"
VGG-16 β€” 5 blocks of 3Γ—3 convolutions, 138M parameters
Block 1 64 224β†’112 Block 2 128 112β†’56 Block 3 256 56β†’28 Block 4 512 28β†’14 Block 5 512 14β†’7 FC 4096 FC 4096 1000 Conv 3Γ—3 MaxPool FC layer Only 3Γ—3 filters throughout β€” architectural simplicity as a virtue Total: 138M parameters
ArchitectureYearLayersParamsTop-5 ErrorKernel SizesKey Innovation
LeNet-51998560K~25% (MNIST)5Γ—5First practical CNN
AlexNet2012860M15.3%11Γ—11, 5Γ—5, 3Γ—3GPU, ReLU, Dropout
VGG-16201416138M7.3%3Γ—3 onlyPure 3Γ—3, deep simple
VGG-19201419144M7.1%3Γ—3 onlyEven deeper VGG
Why VGG Still Matters

Despite being "outdated" in accuracy, VGG is still the default feature extractor for perceptual loss (style transfer, super-resolution) and neural texture synthesis. Its intermediate features are remarkably good at capturing visual similarity.

Szegedy et al. (Google, 2014) asked: how do you go deeper more efficiently? VGG's approach of stacking 3Γ—3 convolutions worked, but parameters and computation grew together. The Inception module solved this with a radically different idea.

The Inception module applies multiple filter sizes (1Γ—1, 3Γ—3, 5Γ—5) in parallel at each layer, plus a max-pooling branch. Outputs are concatenated along the channel dimension. The network learns which spatial scale to attend to at each layer.

The critical insight: 1Γ—1 convolutions as dimensionality reduction. Before the expensive 3Γ—3 and 5Γ—5 convolutions, a 1Γ—1 "bottleneck" conv reduces channel count dramatically. This made GoogLeNet just 5M parameters β€” 12Γ— fewer than AlexNet with better accuracy.

Inception Module β€” parallel multi-scale convolutions with 1Γ—1 bottlenecks
Previous layer output 1Γ—1 conv 1Γ—1 ↓dim 3Γ—3 conv 1Γ—1 ↓dim 5Γ—5 conv 3Γ—3 pool 1Γ—1 conv Concatenate along channels 4 parallel paths β†’ network learns which scale to use 1Γ—1 bottlenecks reduce cost
GoogLeNet Efficiency 22 layers Β· 5M parameters Β· 6.7% top-5 error 12Γ— fewer params than AlexNet (60M) with 2Γ— better accuracy. The 1Γ—1 bottleneck is the key.

ResNet (He et al., 2015) solved the most puzzling problem in deep learning at the time: deeper networks had higher training error. Not overfitting β€” the networks simply couldn't learn. The solution was deceptively simple.

The residual block learns the change rather than the full mapping: y = F(x) + x. The skip connection allows gradients to flow directly through the identity path, enabling 50, 101, and even 152-layer networks.

DenseNet (Huang et al., 2017) extended this idea: connect every layer to all subsequent layers within a block. With L layers in a dense block, there are L(L+1)/2 direct connections. Each layer receives feature maps from ALL preceding layers, enabling maximum feature reuse with fewer parameters.

ResNet Skip vs DenseNet Dense Connectivity
ResNet Block Layer 1 Layer 2 Layer 3 Layer 4 y = F(x) + x Skip every 2 layers DenseNet Block Layer 1 Layer 2 Layer 3 Layer 4 [x₁; xβ‚‚; x₃] β†’ L4 Every layer β†’ all subsequent L(L+1)/2 connections = maximum feature reuse

ResNet Variants

  • ResNet-50: 25.6M params, 76.1% top-1 β€” the workhorse
  • ResNet-101: 44.5M params, 77.4% top-1
  • ResNet-152: 60.2M params, 78.3% top-1
  • Bottleneck block: 1Γ—1β†’3Γ—3β†’1Γ—1 (reduces computation)

DenseNet Advantages

  • Feature reuse: fewer parameters than ResNet for same accuracy
  • Better gradient flow: direct paths to every layer
  • Implicit deep supervision: early layers get strong gradients
  • Growth rate k=32: each layer adds 32 new channels

ResNet and VGG are too heavy for mobile phones, embedded systems, and real-time applications. MobileNet (Howard et al., 2017) introduced depthwise separable convolutions β€” splitting one expensive operation into two cheap ones.

Standard Convolution
Depthwise Separable
One operation: KΓ—KΓ—CinΓ—Cout filter
Cost: KΒ²Β·CinΒ·CoutΒ·HΒ·W
For 3Γ—3, 256β†’512: 1,179,648 ops/pixel
Step 1 β€” Depthwise: KΓ—K filter per channel (Cin filters)
Step 2 — Pointwise: 1×1 conv mixing channels (Cin→Cout)
Cost: (KΒ²Β·Cin + CinΒ·Cout)Β·HΒ·W
For 3Γ—3, 256β†’512: 133,376 ops/pixel β€” ~9Γ— cheaper
Depthwise Separable Convolution β€” MobileNet's efficiency trick (~9Γ— faster)
Standard Convolution HΓ—WΓ—Cin KΓ—K Γ—CinΓ—Cout (big!) HΓ—WΓ—Cout Cost: KΒ²Β·CinΒ·CoutΒ·HΒ·W Depthwise Separable HΓ—WΓ—Cin KΓ—K per ch Depthwise 1Γ—1 mix Pointwise HΓ—WΓ—Cout Cost: (KΒ²Β·Cin + CinΒ·Cout)Β·HΒ·W For K=3: ~9Γ— fewer ops!
MobileNetV2 β€” Inverted Residuals

MobileNetV2 (2018) added inverted residual blocks: expand channels with 1Γ—1 β†’ depthwise 3Γ—3 β†’ compress back with 1Γ—1. The skip connection goes between the narrow layers (inverted compared to ResNet). Also: SqueezeNet, ShuffleNet, and GhostNet target edge deployment.

Tan & Le (Google, 2019) observed that previous architectures scaled only one dimension at a time β€” depth (more layers), width (more channels), or resolution (larger images). EfficientNet scales all three simultaneously with a fixed compound coefficient.

Compound Scaling depth: d = Ξ±Ο† Β· width: w = Ξ²Ο† Β· resolution: r = Ξ³Ο† Subject to Ξ± Β· Ξ²Β² Β· Ξ³Β² β‰ˆ 2 (FLOP budget roughly doubles per step). Ο† controls how much to scale.
EfficientNet Compound Scaling β€” width, depth, and resolution scaled jointly
Baseline Small model Width (w↑) More channels Depth (d↑) More layers Resolution (r↑) Bigger images Compound (φ↑) All three scaled = EfficientNet 76% 78% 78% 78% 84.4%

Neural Architecture Search (NAS) was used to find the optimal base architecture (EfficientNet-B0). NAS automates architecture design by searching over a space of possible layers, connections, and hyperparameters. The EfficientNet family (B0–B7) uses the same base architecture at different compound scales.

ModelTop-1 AccParamsFLOPsKey Feature
ResNet-5076.0%25M4.1BSkip connections
EfficientNet-B077.1%5.3M0.39BNAS + compound scaling
EfficientNet-B381.6%12M1.8BMedium scale
EfficientNet-B784.4%66M37BLargest, SOTA 2019

Training a CNN requires millions of images β€” but labelled data is expensive. Data augmentation creates diverse training samples by applying label-preserving transformations. A flipped cat is still a cat. A slightly rotated stop sign is still a stop sign.

Critical Rule

Augmentations must be label-preserving. Flipping a "6" into a "9" changes the label β€” don't do horizontal flips for digit recognition. Always think about whether the transform preserves meaning for your specific task.

Data Augmentation β€” 8 transforms to increase training diversity
Original H-Flip Crop Color Jitter Rotation Blur Cutout CutMix All augmentations preserve the label β€” a flipped cat is still a cat Standard (always use) Advanced (for stronger training)

Standard Augmentations (Always Use)

  • RandomResizedCrop: random area crop + resize to target
  • RandomHorizontalFlip: 50% chance mirror
  • ColorJitter: brightness, contrast, saturation, hue
  • Normalize: ImageNet mean/std β€” technically preprocessing, always required

Advanced Augmentations

  • RandAugment: randomly pick N of 14 transforms β€” simple, effective
  • Cutout / RandomErasing: mask random rectangle β€” forces robustness
  • CutMix: paste patch from one image onto another, blend labels
  • Mixup: linear blend of two images + their labels
import torchvision.transforms as T import torchvision.transforms.v2 as T2 # Standard training augmentation (ImageNet-style) train_transform = T.Compose([ T.RandomResizedCrop(224, scale=(0.08, 1.0)), # random crop, resize to 224 T.RandomHorizontalFlip(p=0.5), # 50% chance flip T.ColorJitter(brightness=0.4, contrast=0.4, # colour distortion saturation=0.4, hue=0.1), T.ToTensor(), T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) # Strong augmentation (RandAugment, used in EfficientNet training) strong_transform = T2.Compose([ T2.RandomResizedCrop(224), T2.RandomHorizontalFlip(), T2.RandAugment(num_ops=2, magnitude=9), # random 2 of 14 augmentations T2.RandomErasing(p=0.25), # cutout T2.ToDtype(torch.float32, scale=True), T2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) # Validation β€” no augmentation, just resize + normalise val_transform = T.Compose([ T.Resize(256), T.CenterCrop(224), T.ToTensor(), T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ])
Test-Time Augmentation (TTA)

At inference, augment the test image multiple times (flip, multi-crop), run each through the model, and average predictions. TTA typically boosts accuracy by 1–2% with no retraining. Standard in competitions and medical imaging.

Never train from scratch. This is the single most important practical rule in computer vision. ImageNet pre-trained models have already learned universal visual features β€” edges, textures, shapes β€” that transfer remarkably well to nearly any image task.

🧊

Feature Extraction

Freeze the entire backbone. Train only a new classification head.

  • Best when: small dataset (<1K images)
  • Training: very fast (few params)
  • Risk: underfitting if task is very different
πŸ”“

Partial Fine-Tuning

Freeze early layers, fine-tune later layers + head.

  • Best when: moderate dataset (1K–10K)
  • Rationale: early layers = universal edges; later layers = task-specific
  • Common: freeze first 2–3 blocks
πŸ”₯

Full Fine-Tuning

Unfreeze all layers with a small learning rate.

  • Best when: large dataset (>10K images)
  • Key: use differential LR β€” backbone 10–100Γ— smaller than head
  • Risk: catastrophic forgetting if LR too high
Learning Rate Rule of Thumb Pre-trained backbone: 1e-5 to 1e-4 Β· New head: 1e-3 to 1e-2 The backbone has already learned good features β€” large updates would destroy them. The new head needs to learn from scratch.
import torch import torch.nn as nn import torchvision.models as models # ── Option 1: Feature Extraction (freeze backbone) ── model = models.resnet50(weights='IMAGENET1K_V2') # pretrained for param in model.parameters(): param.requires_grad = False # freeze everything # Replace final classification layer num_classes = 10 # your task: 10 classes (not 1000) model.fc = nn.Linear(model.fc.in_features, num_classes) # Only model.fc has requires_grad=True β€” very few parameters to train # ── Option 2: Full Fine-tuning with differential LR ── model2 = models.resnet50(weights='IMAGENET1K_V2') model2.fc = nn.Linear(model2.fc.in_features, num_classes) # Differential learning rates: backbone gets 10Γ— smaller LR backbone_params = [p for n, p in model2.named_parameters() if 'fc' not in n] head_params = list(model2.fc.parameters()) optimizer = torch.optim.AdamW([ {'params': backbone_params, 'lr': 1e-5}, # small LR for pre-trained layers {'params': head_params, 'lr': 1e-3}, # larger LR for new head ], weight_decay=1e-4)
Common Mistake

Forgetting to match the pre-processing. If the pre-trained model was trained with ImageNet normalisation (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), you must use the same normalisation at inference. Mismatched preprocessing is the #1 silent bug in transfer learning.

Chapter 6.2 β€” Summary

  • AlexNet (2012): ReLU + dropout + GPU training β€” kicked off the deep learning era in CV
  • VGG: 3Γ—3 convolutions only β€” architectural simplicity with depth; still the go-to feature extractor for perceptual loss
  • Inception module: parallel multi-scale convolutions (1Γ—1, 3Γ—3, 5Γ—5) + 1Γ—1 bottlenecks; GoogLeNet = 22 layers, only 5M params
  • ResNet: y = F(x) + x skip connections solve the degradation problem and enable 100+ layer networks
  • DenseNet: every layer connects to all subsequent β€” maximum feature reuse with fewer parameters
  • MobileNet: depthwise separable convolutions give ~9Γ— speedup β€” essential for mobile and edge deployment
  • EfficientNet: compound scaling (depth Γ— width Γ— resolution) + NAS = best accuracy/efficiency trade-off
  • Data augmentation: always use random crop + flip + colour jitter; advanced: RandAugment, CutMix, Cutout
  • Transfer learning: always start with ImageNet pre-trained weights; use differential learning rates (backbone: 1e-5, head: 1e-3)
6.3
Chapter 6.3
Object Detection

Classification tells you what. Detection tells you what AND where β€” outputting a variable number of bounding boxes, each with a class label and confidence score. The evolution from R-CNN's 47 seconds per image to YOLO's 90 FPS is one of the most dramatic speedups in deep learning history.

Image classification assigns ONE label to an entire image: "cat." Object detection finds MULTIPLE objects, drawing a bounding box around each and labelling it: "cat at (x,y,w,h) with 97% confidence, dog at (x',y',w',h') with 89% confidence."

The output format is a list of detections, each containing: (class_id, confidence, x_centre, y_centre, width, height). Coordinates are typically normalised to 0–1 relative to image dimensions.

🏷️

Variable Object Count

An image may contain 0, 1, or 100 objects. The model must handle all cases β€” unlike classification which always outputs one label.

πŸ“

Multi-Scale Objects

A pedestrian 20px tall and a bus 400px wide must both be detected. Scale variation is the hardest challenge.

🫣

Occlusion & Overlap

Objects hide behind other objects. The detector must still find partially visible objects and avoid merging overlapping ones.

Vision Task Hierarchy β€” Classification β†’ Detection β†’ Segmentation
Classification CAT What is in the image? Detection Cat 0.97 Dog 0.89 What AND where? Segmentation Cat Dog What, where, exact shape?

The naive approach: slide a window across the image at multiple scales and aspect ratios, run a classifier on each crop. This is conceptually simple but catastrophically slow β€” hundreds of thousands of crops per image, each requiring a full forward pass.

Anchor boxes solved this elegantly. Instead of sliding a window, divide the feature map into a grid and place pre-defined bounding box shapes (anchors) at each cell. The model predicts offsets from these anchors: (Ξ΄x, Ξ΄y, Ξ΄w, Ξ΄h) plus an objectness score and class probabilities. Anchors are designed to cover common shapes β€” wide rectangles for cars, tall rectangles for people, squares for faces.

Anchor Boxes β€” 3 pre-defined shapes per grid cell, model predicts offsets
wide (car) tall (person) square (face) Highlighted cell: 3 anchor shapes Per anchor prediction: Ξ΄x Ξ΄y Ξ΄w Ξ΄h ← box offsets from anchor objectness ← P(any object here) class_1 class_2 ... class_C ← class probs Each cell predicts: B anchors Γ— (5 + C) values e.g. 3 anchors Γ— (5 + 80) = 255 for COCO

The R-CNN family represents the two-stage approach to detection: first propose regions that might contain objects, then classify each region. Three papers over two years went from painfully slow to real-time.

🐒

R-CNN (2014)

  • Selective search: ~2000 region proposals
  • Warp each to 227Γ—227
  • CNN feature extraction per region
  • SVM classifier + bbox regression
  • Speed: 47 sec/image
πŸ‡

Fast R-CNN (2015)

  • CNN on entire image once β†’ shared feature map
  • Project proposals onto feature map (RoI Pooling)
  • Classify + regress from RoI features
  • Bottleneck: selective search still external
  • Speed: 2 sec/image
πŸš€

Faster R-CNN (2015)

  • Replace selective search with Region Proposal Network (RPN)
  • RPN shares CNN backbone with detection head
  • End-to-end trainable
  • 73.2% mAP on PASCAL VOC
  • Speed: 0.2 sec/image (5 FPS)
R-CNN Family Evolution β€” from 2000 crops to shared feature map + RPN
R-CNN Image Selective Search ~2000 proposals CNN Γ—2000 (bottleneck!) SVM 47 sec/img Fast R-CNN Image CNN Γ—1 (shared!) Feature Map + RoI Pool FC + Head 2 sec/img Selective Search (external) Faster R-CNN Image CNN RPN Feature Map FC + Head 0.2 sec/img End-to-end trainable RPN replaces selective search
Two-Stage vs One-Stage

Faster R-CNN is a two-stage detector: stage 1 proposes regions (RPN), stage 2 classifies them. Two-stage detectors are generally more accurate but slower. One-stage detectors (YOLO, SSD) skip the proposal step entirely β€” they predict boxes and classes in a single pass.

Redmon et al. (2015) made a radical move: frame detection as a single regression problem. Divide the image into an SΓ—S grid, and in a single forward pass, predict all bounding boxes and class probabilities simultaneously. No proposals, no second stage β€” just one neural network, one pass, done.

The result: 45 FPS on 2015 hardware, over 200Γ— faster than R-CNN. The trade-off was accuracy β€” YOLO struggled with small objects and nearby objects in the same grid cell. But the speed was revolutionary for real-time applications like autonomous driving and robotics.

YOLOv3 β€” grid division, multi-anchor predictions, multi-scale detection
car 0.94 person 0.91 Multi-Scale Detection 13Γ—13 grid β†’ large objects 26Γ—26 grid β†’ medium objects 52Γ—52 grid β†’ small objects Output: 13Γ—13Γ—(3Γ—(5+80)) = 13Γ—13Γ—255 Single forward pass β†’ all detections at once
VersionYearSpeedmAP (VOC/COCO)Key Feature
YOLOv1201545 FPS63.4% VOCSingle-pass regression
YOLOv2201640 FPS78.6% VOCAnchor boxes, batch norm, multi-scale
YOLOv3201830 FPS33.0% COCOMulti-scale detection, Darknet-53
YOLOv5202030+ FPS CPU50.7% COCOPyTorch, Ultralytics, easy API
YOLOv8202380+ FPS53.9% COCOAnchor-free, SOTA single-stage
from ultralytics import YOLO import cv2 # Load pre-trained model (downloads automatically) model = YOLO('yolov8n.pt') # 'n'=nano, 's'=small, 'm'=medium, 'l'=large, 'x'=xlarge # Run inference on image results = model('street.jpg', conf=0.5, iou=0.45) # Parse results for result in results: boxes = result.boxes for box in boxes: cls = int(box.cls[0]) # class index conf = float(box.conf[0]) # confidence x1, y1, x2, y2 = box.xyxy[0].tolist() # bounding box coords print(f"{model.names[cls]}: {conf:.2f} at [{x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f}]") # Visualise result_image = results[0].plot() cv2.imwrite('detected.jpg', result_image) # Fine-tune on custom dataset model.train( data='custom_dataset.yaml', # YAML with train/val paths and class names epochs=100, imgsz=640, batch=16, device=0 # GPU device )

SSD (Liu et al., 2016) combined YOLO's single-pass speed with an elegant multi-scale approach. Instead of predicting from just one feature map, SSD extracts predictions from 6 different feature map scales within the CNN. Early (larger) feature maps detect small objects; later (smaller) feature maps detect large objects.

This solved YOLO's weakness with small objects: 59 FPS at 300Γ—300 input, 74.3% mAP β€” better small-object detection with comparable speed.

SSD β€” predictions from multiple feature map scales in a single pass
VGG 38Γ—38 small 19Γ—19 med-sm 10Γ—10 medium 5Γ—5 med-lg 3Γ—3 large 1Γ—1 v.large NMS merge Detections Earlier layers = finer resolution = better for small objects

Modern Single-Stage Detectors

  • RetinaNet (2017): focal loss β€” solves class imbalance (background vs objects)
  • FCOS (2019): fully convolutional, anchor-free β€” predicts directly per pixel
  • DETR (2020): transformer-based detection β€” no anchors, no NMS needed

Two-Stage vs One-Stage Summary

  • Two-stage (Faster R-CNN): more accurate, slower (~5 FPS)
  • One-stage (YOLO, SSD): faster (30–90 FPS), slightly less accurate
  • Modern one-stage detectors have nearly closed the accuracy gap

Three concepts underpin detection evaluation: IoU measures box overlap quality, NMS removes duplicate detections, and mAP quantifies overall detector performance.

πŸ“

IoU (Intersection over Union)

Measures overlap between predicted and ground-truth box.

  • IoU > 0.5: correct (PASCAL VOC)
  • IoU > 0.75: stricter standard
  • COCO: average over 0.5:0.05:0.95
🧹

NMS (Non-Max Suppression)

Removes duplicate overlapping boxes:

  • Sort boxes by confidence ↓
  • Keep highest-scoring box
  • Remove boxes with IoU > 0.45
  • Repeat for remaining boxes
πŸ“Š

mAP (Mean Average Precision)

The standard detection metric:

  • Per class: precision-recall curve
  • AP = area under PR curve
  • mAP = mean AP across all classes
  • COCO mAP averages over 10 IoU thresholds
Detection Metrics IoU = |A ∩ B| / |A βˆͺ B| AP = βˆ«β‚€ΒΉ p(r) dr   (area under precision-recall curve) mAP = (1/C) Ξ£c APc   (mean over C classes) PASCAL VOC uses IoU@0.5. COCO averages IoU from 0.5 to 0.95 in steps of 0.05 β†’ much harder benchmark.
IoU and Non-Maximum Suppression β€” removing duplicate detections
IoU Calculation Ground Truth Prediction A ∩ B Intersection = 4900 pxΒ² Union = 17100 pxΒ² IoU = 4900/17100 = 0.29 Below 0.5 β†’ incorrect detection NMS Step by Step Step 1: All predictions 0.95 0.87 0.72 0.55 Step 2: Keep 0.95 0.95 βœ“ Step 3: Remove IoU>0.45 0.95 βœ“ 0.87 βœ— IoU=0.82 0.55 βœ— IoU=0.71 0.72 βœ“ IoU=0.12 Result: 2 clean detections (0.95 + 0.72) Overlapping duplicates removed, distinct objects kept
NMS Limitations

Standard NMS struggles when objects genuinely overlap (e.g., a crowd of people). It may suppress correct detections. Solutions: Soft-NMS (reduces confidence instead of removing), DETR (transformer-based, no NMS needed), or learnable NMS modules.

Chapter 6.3 β€” Summary

  • Detection output: list of (class, confidence, x, y, w, h) per object in image
  • Anchor boxes: pre-defined shapes at each grid cell; model predicts offsets + class + objectness
  • R-CNN family: propose β†’ extract β†’ classify; Faster R-CNN adds RPN for end-to-end training (0.2s/img)
  • YOLO: single forward pass predicts all objects β€” real-time at 30–90 FPS; YOLOv8 is SOTA single-stage
  • SSD: YOLO-like but uses multiple feature map scales β†’ better small object detection
  • IoU: intersection / union measures predicted vs ground truth box overlap; threshold 0.5 (VOC) or 0.5:0.95 (COCO)
  • NMS: remove duplicate overlapping boxes keeping highest-confidence detections; Soft-NMS for crowded scenes
  • Modern trend: anchor-free (FCOS) and transformer-based (DETR) detectors eliminating hand-designed components
6.4
Chapter 6.4
Image Segmentation

Detection draws boxes. Segmentation colours every pixel. From U-Net's elegant encoder-decoder to Meta's Segment Anything Model, segmentation has evolved from a niche medical imaging task to a foundation capability β€” segment any object in any image with a single click.

All three segmentation types assign labels at the pixel level β€” but they differ fundamentally in what they distinguish.

🎨

Semantic Segmentation

Every pixel labelled with a class: road, car, sky, person.

  • All instances of same class get same colour
  • Two cats = one merged mask
  • Output: HΓ—W label map (one int per pixel)
🧩

Instance Segmentation

Every pixel labelled with class AND instance ID.

  • Two cats = two separate masks (Cat₁, Catβ‚‚)
  • Only "things" (countable objects)
  • Output: list of (mask, class, conf) per object
🌐

Panoptic Segmentation

Unifies semantic (stuff) + instance (things).

  • Every pixel: class + optional instance ID
  • "Stuff" (sky, road) + "things" (car₁, carβ‚‚)
  • Output: complete scene understanding
Semantic vs Instance vs Panoptic Segmentation
Input Image 2 cars, 1 person, road, sky Semantic sky road Same class = same colour Instance Car₁ Carβ‚‚ Per₁ Each object = unique mask Panoptic sky (stuff) road (stuff) Car₁ Carβ‚‚ Stuff + things = complete Car Person Sky Road

The core challenge: classification networks reduce spatial resolution (pooling, striding) to build semantic understanding. Segmentation needs to restore it β€” predict a class for every single pixel. The solution: encoder-decoder architectures.

Encoder (Downsampling)
Decoder (Upsampling)
Standard CNN backbone (ResNet, VGG)
Loses spatial resolution via pooling/stride
Builds semantic understanding β€” "what" is here
224Γ—224 β†’ 7Γ—7 feature map
Reverse the downsampling
Restores spatial resolution to original size
Upsampling methods:
β€’ Bilinear interpolation: simple, no learned params
β€’ Transposed conv: learnable, can cause checkerboard artefacts

Skip connections are critical: they connect encoder layers to decoder layers at matching resolutions. Without them, the decoder must reconstruct fine spatial detail (edges, boundaries) from the bottleneck alone β€” a lossy process. With skips, fine spatial detail flows directly from encoder to decoder.

Loss Functions

  • Pixel-wise cross-entropy: standard classification loss per pixel
  • Dice loss: 2Β·|A∩B| / (|A|+|B|) β€” better for class imbalance
  • Combined: CE + Dice often used together in practice
  • Weighted CE: higher weight for rare classes (e.g., tumour vs background)

Key Architectures

  • FCN (2014): first fully convolutional β€” no FC layers, any input size
  • U-Net (2015): symmetric encoder-decoder + skip connections
  • DeepLab v3+ (2018): atrous convolutions + ASPP for multi-scale
  • SegFormer (2021): transformer-based encoder, MLP decoder

Ronneberger et al. (2015) designed U-Net for biomedical image segmentation β€” but it became the universal segmentation architecture. The U-shaped design features a symmetric encoder (contracting path) and decoder (expanding path) connected by skip connections at every level.

The critical innovation: skip connections don't just add features (like ResNet) β€” they concatenate entire feature maps from encoder to decoder. This preserves the fine spatial detail (textures, edges) lost during downsampling. The decoder gets both the upsampled abstract features AND the original high-resolution details.

U-Net β€” Symmetric Encoder-Decoder with Skip Connections at Every Level
ConvΓ—2, 64 256Β² pool ConvΓ—2, 128 128Β² pool ConvΓ—2, 256 64Β² pool ConvΓ—2, 512 32Β² 1024 ch Bottleneck (16Β²) up ConvΓ—2, 512 32Β² ConvΓ—2, 256 64Β² ConvΓ—2, 128 128Β² ConvΓ—2, 64 256Β² 1Γ—1 conv Seg map copy + concat copy + concat copy + concat copy + concat Encoder Decoder Bottleneck Skip conn Key insight: Skip connections concatenate full feature maps β€” preserving edges and fine detail. Also used in Stable Diffusion!
import torch import torch.nn as nn import torch.nn.functional as F class DoubleConv(nn.Module): def __init__(self, in_ch, out_ch): super().__init__() self.conv = nn.Sequential( nn.Conv2d(in_ch, out_ch, 3, padding=1), nn.BatchNorm2d(out_ch), nn.ReLU(), nn.Conv2d(out_ch, out_ch, 3, padding=1), nn.BatchNorm2d(out_ch), nn.ReLU() ) def forward(self, x): return self.conv(x) class UNet(nn.Module): def __init__(self, n_classes=2): super().__init__() # Encoder self.enc1 = DoubleConv(1, 64) self.enc2 = DoubleConv(64, 128) self.enc3 = DoubleConv(128, 256) self.pool = nn.MaxPool2d(2) # Bottleneck self.bottleneck = DoubleConv(256, 512) # Decoder self.up3 = nn.ConvTranspose2d(512, 256, 2, stride=2) self.dec3 = DoubleConv(512, 256) # 512 = 256 (up) + 256 (skip) self.up2 = nn.ConvTranspose2d(256, 128, 2, stride=2) self.dec2 = DoubleConv(256, 128) self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2) self.dec1 = DoubleConv(128, 64) self.out_conv = nn.Conv2d(64, n_classes, 1) # 1Γ—1 final def forward(self, x): e1 = self.enc1(x) # skip 1 e2 = self.enc2(self.pool(e1)) # skip 2 e3 = self.enc3(self.pool(e2)) # skip 3 b = self.bottleneck(self.pool(e3)) d3 = self.dec3(torch.cat([self.up3(b), e3], dim=1)) # concat skip d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1)) d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1)) return self.out_conv(d1) # HΓ—WΓ—n_classes
U-Net Beyond Medical Imaging

U-Net's encoder-decoder + skip connection pattern became foundational far beyond segmentation. Stable Diffusion uses a U-Net (with attention layers) as its denoising backbone. The same architecture that segments tumours also generates images from text prompts.

Mask R-CNN (He et al., Facebook AI, 2017) extends Faster R-CNN with a simple but powerful addition: a third prediction head that outputs a pixel-level mask for each detected object. Three heads run in parallel after RoI features are extracted:

🏷️

Classification Head

FC layers β†’ Softmax

Output: class label for this region

πŸ“¦

Box Regression Head

FC layers β†’ (Ξ”x, Ξ”y, Ξ”w, Ξ”h)

Output: refined bounding box

🎭

Mask Head (NEW)

FCN β†’ 28Γ—28 binary mask per class

Output: pixel-level mask for the object

A critical improvement: RoI Align replaced RoI Pooling. RoI Pooling quantises coordinates (rounding to nearest pixel), causing spatial misalignment that doesn't matter for bounding boxes but is catastrophic for pixel-accurate masks. RoI Align uses bilinear interpolation β€” no quantisation, precise alignment.

Mask R-CNN β€” three prediction heads: class, box, and pixel mask
Image Backbone (ResNet) + FPN RPN proposals RoI Align no quantisation! Classification FC β†’ Softmax β†’ class Box Regression FC β†’ Ξ”x,Ξ”y,Ξ”w,Ξ”h Mask Head FCN β†’ 28Γ—28Γ—C masks ← NEW in Mask R-CNN Per object: class label refined box pixel mask
Why RoI Align Matters

RoI Pooling rounds coordinates to the nearest integer, creating up to 0.5px misalignment. For 7Γ—7 pooled features from a 224px image, that's a 7px error β€” invisible for classification, catastrophic for pixel masks. RoI Align uses bilinear interpolation at exact floating-point positions, eliminating this misalignment entirely.

Panoptic segmentation (Kirillov et al., 2019) unifies semantic and instance segmentation into a single coherent output. Every pixel gets a class label. "Things" (countable: cars, people) also get unique instance IDs. "Stuff" (uncountable: sky, road, grass) gets class labels only.

"Things" (Instance)
"Stuff" (Semantic)
Countable objects: person, car, dog, chair
Each instance gets a unique ID
car₁ β‰  carβ‚‚ even though both are "car"
Predicted by instance segmentation branch
Amorphous regions: sky, road, grass, water
No instance distinction β€” just class label
All sky pixels = "sky" (no sky₁, skyβ‚‚)
Predicted by semantic segmentation branch

Modern panoptic models like Panoptic FPN and Mask2Former (2022) use a unified architecture that handles both things and stuff with a single decoder. Mask2Former treats all segments (things + stuff) as mask queries processed by a transformer decoder β€” achieving SOTA on all three segmentation tasks simultaneously.

Panoptic Quality (PQ) PQ = SQ Γ— RQ = (Ξ£ IoU / |TP|) Γ— (|TP| / (|TP| + Β½|FP| + Β½|FN|)) SQ = Segmentation Quality (average IoU of matched segments). RQ = Recognition Quality (F1 of matching). PQ combines both into a single metric.

Kirillov et al. (Meta AI, 2023) built a foundation model for segmentation. Trained on SA-1B β€” 11 million images with 1.1 billion segmentation masks β€” SAM can segment any object in any image with a simple prompt: click a point, draw a box, or provide text.

πŸ–ΌοΈ

Image Encoder

MAE pre-trained ViT-H (Vision Transformer Huge)

  • Encodes image into embedding
  • 256Γ—64Γ—64 feature map
  • Heavy: ~100ms (done once)
πŸ‘†

Prompt Encoder

Encodes user prompts as tokens:

  • Point click: segment object at that point
  • Bounding box: segment within the box
  • Mask: refine an existing mask
  • Text (SAM2): "the red car"
🎭

Mask Decoder

Lightweight transformer decoder

  • Outputs 3 candidate masks
  • Whole object / part / subpart
  • Fast: ~50ms per prompt
  • Interactive β€” prompt β†’ mask instantly
SAM β€” Image Encoder (ViT) + Prompt Encoder + Lightweight Mask Decoder
Image Image Encoder ViT-H 256Γ—64Γ—64 Heavy: ~100ms (computed once) Prompt click/box/text Prompt Encoder β†’ Prompt Tokens Mask Decoder Lightweight Transformer Fast: ~50ms per prompt Mask 1 (whole) conf: 0.97 Mask 2 (part) conf: 0.89 Mask 3 (subpart) conf: 0.73 SAM2 (2024): extends to video tracking
SA-1B β€” The Largest Segmentation Dataset

SAM was trained on 1.1 billion masks across 11 million images β€” over 100Γ— larger than any previous segmentation dataset. The data engine used SAM itself in a loop: model assists human annotators β†’ annotators correct β†’ model improves β†’ repeat. This "model-in-the-loop" approach is now standard for building large-scale datasets.

SAM Limitations

SAM excels at segmenting arbitrary objects but does NOT classify them. It tells you "here's an object boundary" but not "this is a cat." For applications needing both segmentation and classification, combine SAM with a classifier or use specialised models like Mask R-CNN or Mask2Former.

Chapter 6.4 β€” Summary

  • Semantic segmentation: every pixel gets a class label β€” same mask for all instances of a class
  • Instance segmentation: each object gets a separate mask + class + confidence β€” distinguishes individual objects
  • Panoptic segmentation: unifies semantic (stuff) + instance (things) β€” complete scene understanding
  • U-Net: symmetric encoder-decoder with skip connections (copy + concat) β€” preserves spatial detail; also used in Stable Diffusion
  • Mask R-CNN: Faster R-CNN + 28Γ—28 mask head + RoI Align β€” no quantisation, pixel-accurate masks; SOTA 2017–2022
  • Panoptic Quality: PQ = SQ Γ— RQ β€” single metric combining segmentation accuracy and recognition accuracy
  • SAM: foundation model β€” segment anything with a point click, trained on 1.1B masks, interactive at <50ms per prompt
  • SAM2 (2024): extends to video β€” track and segment objects across frames with interactive prompts
6.5
Chapter 6.5
Generative Adversarial Networks

Two neural networks locked in an adversarial game β€” one forges, the other detects. From blurry 64Γ—64 bedrooms to photorealistic 1024Γ—1024 faces, GANs evolved from an elegant mathematical idea into one of the most impactful generative frameworks in computer vision.

Goodfellow et al. (2014) introduced one of the most cited frameworks in deep learning: two networks in competition. The Generator G takes random noise z ~ N(0,I) and produces fake images G(z). The Discriminator D receives an image (real or fake) and outputs the probability that it's real.

Training alternates: update D for k steps (improve its detection ability), then update G for 1 step (improve its forgery). At Nash equilibrium, G produces perfect fakes and D outputs 0.5 for everything β€” it literally cannot tell real from fake.

GAN Minimax Objective minG maxD V(D,G) = 𝔼x[log D(x)] + 𝔼z[log(1 βˆ’ D(G(z)))] Generator wants: D(G(z)) β†’ 1 (fool discriminator). Discriminator wants: D(x) β†’ 1, D(G(z)) β†’ 0 (classify correctly).
GAN Training Loop β€” adversarial updates between Generator and Discriminator
Real Images (training set) Noise z ~ N(0, I) Generator G z β†’ fake image Fake G(z) generated Discriminator D image β†’ P(real) Real: P=0.97 D(x) β†’ 1 βœ“ Fake: P=0.03 D(G(z)) β†’ 0 βœ“ Step 1: Update D (k times) maximise log D(x) + log(1-D(G(z))) Step 2: Update G (1 time) minimise log(1-D(G(z))) βš” Adversarial Training βš” G improves forgery ↔ D improves detection Nash equilibrium: D(G(z)) = 0.5 for all inputs

The original GAN loss has two critical failure modes. Vanishing gradients: when D becomes too good, D(G(z)) β‰ˆ 0, and log(0) = βˆ’βˆž gives no useful gradient for G. Mode collapse: G finds a few outputs that fool D and keeps generating only those, ignoring the rest of the distribution.

πŸ“‰

Original GAN Loss

JS divergence-based.

  • Vanishes when distributions don't overlap
  • Unstable training dynamics
  • Loss curves not meaningful
🌊

WGAN (2017)

Wasserstein (Earth Mover's) distance.

  • Gradient never vanishes β€” even with no distribution overlap
  • D must be 1-Lipschitz
  • WGAN-GP: gradient penalty for stability
πŸ“

LSGAN

Least-squares loss for D.

  • MSE instead of log-likelihood
  • Penalises samples far from boundary
  • More stable, less mode collapse
WGAN Loss LD = 𝔼[D(G(z))] βˆ’ 𝔼[D(x)]   (maximise for D β€” no log!) LG = βˆ’π”Ό[D(G(z))]   (minimise for G) Constraint: D must be 1-Lipschitz (β€–βˆ‡Dβ€– ≀ 1 everywhere). WGAN-GP adds gradient penalty: Ξ» Β· 𝔼[(β€–βˆ‡Dβ€–β‚‚ βˆ’ 1)Β²].

Radford et al. (2015) established the first stable recipe for CNN-based GANs. Before DCGAN, most GAN experiments produced noise or collapsed. DCGAN's design rules became gospel for all subsequent work:

DCGAN Design Rules

  • Replace pooling with strided convolutions (D) and transposed convolutions (G)
  • BatchNorm in both G and D (except G output and D input)
  • Remove fully connected hidden layers
  • ReLU in G (except output: Tanh), LeakyReLU in D

DCGAN Results

  • Generated 64Γ—64 bedroom images β€” first realistic CNN-generated images
  • Meaningful latent space: interpolating between z vectors = smooth visual transitions
  • "Smiling woman" βˆ’ "neutral woman" + "neutral man" = "smiling man"
  • Proved CNNs could learn rich image priors unsupervised
DCGAN Generator β€” noise vector upsampled through transposed convolutions to image
z 100-dim 4Γ—4 Γ—1024 Reshape 8Γ—8 Γ—512 BN+ReLU 16Γ—16 Γ—256 BN+ReLU 32Γ—32 Γ—128 BN+ReLU 64Γ—64Γ—3 RGB image Tanh (no BN) Key pattern: Spatial dims ↑ (Γ—2 each step) Channels ↓ (Γ·2 each step) Transposed conv = learnable upsampling Tanh β†’ output in [-1, 1] Real images also scaled to [-1, 1] to match

Vanilla GANs generate random images β€” you have no control over what comes out. Mirza & Osindero (2014) fixed this by feeding a condition label y to both G and D. Now G(z, y) generates an image of class y, and D(x, y) verifies it's a real image of that class.

Conditional GAN β€” class label y conditions both generator and discriminator
Standard GAN z G Random image (?) No control over output Conditional GAN z y=cat G 🐱 Cat image! D +y "Real cat?" pix2pix · text-to-image · class-conditional
pix2pix β€” Image-to-Image Translation

Isola et al. (2016) used conditional GANs for paired image-to-image translation: sketch β†’ photo, edges β†’ handbag, day β†’ night, satellite β†’ map. The condition is the input image itself. pix2pix showed that conditional GANs are a universal framework for image transformation tasks.

Karras et al. (NVIDIA, 2019) produced the first truly photorealistic synthetic faces. The key insight: separate the latent code into a style that controls appearance at each resolution level, rather than feeding noise directly into the first layer.

StyleGAN Innovations

  • Mapping network: z β†’ w (8-layer FC) β€” less entangled intermediate latent space
  • AdaIN: inject style w at each resolution level via Adaptive Instance Normalisation
  • Progressive growing: train at 4Γ—4, gradually grow to 1024Γ—1024
  • Stochastic variation: per-pixel noise at each layer for fine details (hair, freckles)

StyleGAN Evolution

  • StyleGAN (2019): 1024Γ—1024 photorealistic faces, FFHQ dataset
  • StyleGAN2 (2020): removes AdaIN artefacts, path length regularisation
  • StyleGAN3 (2021): alias-free β€” proper translation/rotation equivariance
  • StyleGAN-XL (2022): scaled to ImageNet-level diversity
StyleGAN β€” Mapping network + per-resolution style injection via AdaIN
z (512) Mapping Network 8Γ— FC w (512) style code Const 4Γ—4 4Γ—4 β†’ 8Γ—8 noise 16Γ—16 noise 64Γ—64 noise 256Γ—256 noise 1024Γ—1024 noise Output πŸ§‘ ── = AdaIN(w) style injection Noise = stochastic per-pixel detail Style controls per resolution: 4Γ—4 – 8Γ—8: pose, face shape 16Γ—16 – 64Γ—64: hair style, eyes, mouth 256Γ—256 – 1024Β²: colour, freckles, hair AdaIN formula: AdaIN(x,w) = wΟƒ Β· (xβˆ’ΞΌ)/Οƒ + wΞΌ w modulates feature statistics at each resolution layer

pix2pix requires paired training data β€” the exact same scene in both domains (e.g., the same street in day and night). This is often impossible to collect. Zhu et al. (2017) solved this with CycleGAN: learn translation between domains using only unpaired examples.

The trick: cycle consistency. Two generators (GAB: Aβ†’B, GBA: Bβ†’A) must satisfy GBA(GAB(a)) β‰ˆ a. If you translate a horse to a zebra and back, you should get the original horse. This constraint prevents the generators from hallucinating arbitrary outputs.

CycleGAN β€” cycle consistency ensures translation is invertible (no paired data)
Forward Cycle: Horse β†’ Zebra β†’ Horse 🐴 A horse GAB πŸ¦“ BΜ‚ fake zebra GBA 🐴 Γ‚ reconstructed Cycle loss: β€–A βˆ’ Â‖₁ β‰ˆ 0 Reverse Cycle: Zebra β†’ Horse β†’ Zebra πŸ¦“ B zebra GBA 🐴 Γ‚ GAB πŸ¦“ BΜ‚ Cycle loss: β€–B βˆ’ B̂‖₁ β‰ˆ 0 No paired data needed β€” only collections of horses and zebras
CycleGAN Applications

Horse ↔ zebra, summer ↔ winter, photo ↔ Monet painting, day ↔ night, apple ↔ orange. CycleGAN works on any two unpaired image domains. The cycle consistency loss is also used in unsupervised machine translation (text) and audio style transfer.

πŸ”„

Mode Collapse

G produces limited variety β€” finds a few modes that fool D, ignores the rest.

  • "Same face no matter what z is"
  • Fix: mini-batch discrimination
  • Fix: unrolled GANs, WGAN
βš–οΈ

Training Instability

D and G must stay balanced β€” if one dominates, the other gets no gradient.

  • Loss curves are NOT meaningful
  • Fix: spectral normalisation
  • Fix: D slower LR than G
πŸ“

Evaluation Challenge

How to measure "realistic and diverse"?

  • FID: FrΓ©chet Inception Distance
  • Lower FID = better quality + diversity
  • Needs 1000s of samples to estimate
TechniqueProblemHow It WorksUsed In
WGAN / WGAN-GPVanishing gradientsWasserstein distance + gradient penaltyMost modern GANs
Spectral NormalisationTraining instabilityConstrain weight matrix spectral normSN-GAN, BigGAN
Progressive GrowingHigh-res trainingStart 4Γ—4, gradually increase resolutionStyleGAN
Mini-batch DiscriminationMode collapsePass statistics across batch to DOriginal GAN improvements
Label SmoothingOverconfident DD target = 0.9 not 1.0Standard practice
GANs vs Diffusion Models

By 2022, diffusion models (Ch 6.6) largely replaced GANs for image generation. Diffusion models are more stable to train, don't suffer mode collapse, and achieve better FID scores. GANs remain relevant for real-time generation (single forward pass vs diffusion's iterative denoising) and image-to-image translation (CycleGAN, pix2pix).

Chapter 6.5 β€” Summary

  • GAN: Generator fools Discriminator; D trains to detect fakes β€” minimax game with Nash equilibrium at D(G(z)) = 0.5
  • WGAN: replaces JS divergence with Wasserstein distance β€” solves vanishing gradients; WGAN-GP adds gradient penalty
  • DCGAN: strided convolutions + BatchNorm + LeakyReLU = first stable image GAN; established CNN-GAN design rules
  • Conditional GAN: label y conditions both G and D β€” enables class-conditional generation and pix2pix image translation
  • StyleGAN: mapping network z β†’ w + AdaIN style injection per resolution = first photorealistic 1024Β² face generation
  • CycleGAN: unpaired image translation via cycle consistency loss β€” GBA(GAB(a)) β‰ˆ a; no paired data needed
  • Mode collapse: G generates limited variety; training instability: D/G balance is fragile; FID: standard evaluation metric
  • Modern trend: diffusion models replacing GANs for generation quality; GANs still best for real-time and image-to-image tasks
6.6
Chapter 6.6
Vision Transformers (ViT) & Modern Architectures

In 2020, a single idea flipped computer vision upside-down: what if we treated an image as a sequence of patches β€” just like tokens in a sentence? The Vision Transformer proved that convolutions are not necessary. Pure self-attention, given enough data and compute, learns to see.

CNNs have two hard-wired inductive biases: locality (convolutions see only a small neighbourhood) and translation equivariance (same filter applied everywhere). These are powerful priors β€” but they also limit expressiveness.

A 3Γ—3 conv at layer 1 sees only 9 pixels. To see the whole image, information must pass through many layers of pooling β€” getting progressively diluted. Transformers have no such constraints: self-attention connects every patch to every other patch in a single operation. Global context from layer 1.

CNN Strengths
CNN Limitations
β†’ Data-efficient (inductive biases help)
β†’ Translation equivariant by design
β†’ Computationally efficient: O(HW) not O(HW)Β²
β†’ Works well on small datasets
β†’ Fast inference on edge devices
β†’ Local receptive field by default
β†’ Long-range dependencies require many layers
β†’ Fixed spatial hierarchy (downsampling loses info)
β†’ Inductive biases may hurt on novel domains
β†’ Hard to model global context at early layers

Dosovitskiy et al. (Google Brain, 2020) applied the standard Transformer encoder β€” unchanged from NLP β€” directly to images. The trick: divide the image into fixed-size non-overlapping patches, flatten each patch, linearly project it, and treat the resulting sequence exactly like word tokens.

1. Patch
224Β² β†’ 196 patches (16Γ—16)
2. Flatten
16Γ—16Γ—3 = 768 values each
3. Embed
Linear: 768β†’D + pos embed
4. [CLS] + Transformer
12 layers, global attention
5. Classify
MLP on [CLS] β†’ class
ViT Patch Embedding Patches: xp ∈ ℝNΓ—(PΒ²Β·C)   where N = HW/PΒ² = 196, P = 16 zβ‚€ = [xcls ; xp₁E ; xpβ‚‚E ; ... ; xpNE] + Epos E ∈ ℝ(PΒ²C Γ— D) = patch projection, Epos ∈ ℝ((N+1) Γ— D) = learnable 1D position embeddings (2D not needed!)
ViT Pipeline β€” 16Γ—16 patches β†’ 196 tokens β†’ Transformer β†’ class label
224Γ—224 image 196 patches (14Γ—14 grid) 196 Γ— 768 (flattened) [CLS] + Epos 197 Γ— 768 ... 12Γ— Transformer global self-attention [CLS] output MLP Head cat 0.94 ViT Variants: Tiny: D=192, 12L, 3H, 5.7M Small: D=384, 12L, 6H, 22M Base: D=768, 12L, 12H, 86M Large: D=1024, 24L, 16H, 307M Huge: D=1280, 32L, 16H, 632M Key insight: Every patch attends to every other patch β€” global context from layer 1. No convolutions needed.
ViT Attention Maps β€” self-attention captures global structure immediately
Input Image Cat on background [CLS] Attention (Layer 6) [CLS] β†’ cat region Focuses on the object Head Patch Attention (Layer 10) Head β†’ body + tail Long-range connections! Why this matters: CNN needs 10+ layers for head patch β†’ tail patch. ViT does it in layer 1.
import torch import timm # Load pre-trained ViT-Base/16 (ImageNet-21k β†’ ImageNet-1k) model = timm.create_model('vit_base_patch16_224', pretrained=True) model.eval() print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}") # 86,567,656 print(f"Patch size: {model.patch_embed.patch_size}") # (16, 16) print(f"Num patches: {model.patch_embed.num_patches}") # 196 print(f"Embedding dim: {model.embed_dim}") # 768 print(f"Num heads: {model.blocks[0].attn.num_heads}") # 12 print(f"Num layers: {len(model.blocks)}") # 12 # Inference from torchvision import transforms from PIL import Image transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) img = transform(Image.open("cat.jpg")).unsqueeze(0) # (1, 3, 224, 224) with torch.no_grad(): logits = model(img) # (1, 1000) probs = logits.softmax(-1) top5 = probs.topk(5) print("Top-5 predictions:", top5.indices)

The original paper's most critical finding: ViT-Large trained on ImageNet-1k ONLY achieved 77.9% β€” worse than ResNet-50 (76.1%). But pre-trained on JFT-300M (Google's internal 300M-image dataset), it reached 88.55% β€” crushing every CNN. The conclusion: ViT needs massive data to overcome its lack of inductive biases.

ViT vs CNN β€” ViT overtakes CNNs at ~14M pre-training images
75% 80% 85% 88% 90% Top-1 Accuracy ImageNet-1k (1.3M) ImageNet-21k (14M) JFT-300M (300M) ResNet-152 ViT-Base ViT-Large 88.55% Crossover: ~14M images CNN wins ViT wins
Why ViT Needs More Data

CNNs have built-in priors: locality and translation equivariance guide learning even with limited data. ViT must learn these properties from data. With enough examples, ViT discovers even better representations β€” but on small datasets, it overfits to training patterns without discovering general visual structure.

Touvron et al. (Facebook, 2021) asked: can we train ViT on ImageNet-1k without Google's proprietary JFT-300M? The answer: yes, with three key tricks β€” knowledge distillation from a CNN teacher, aggressive data augmentation, and a careful training recipe.

DeiT adds a second special token: [DIST] (distillation token) alongside the standard [CLS]. [DIST] is trained to match the CNN teacher's soft labels, while [CLS] learns from ground-truth hard labels. At inference, both outputs are averaged. Result: DeiT-Base: 81.8% vs original ViT-Base on same data: 77.9%.

DeiT Distillation β€” [DIST] token learns from CNN teacher alongside [CLS]
Token sequence (198 tokens): [CLS] 196 patch embeddings [DIST] ← NEW! 12Γ— Transformer Encoder [CLS] β†’ MLP Head Cross-entropy loss vs ImageNet labels [DIST] β†’ Linear Head Distillation loss vs CNN teacher soft labels RegNet Teacher (CNN, frozen) Inference: avg( softmax(CLS), softmax(DIST) ) Results on ImageNet-1k only: ViT-Base (no distill): 77.9% DeiT-Base (distill): 81.8% (+3.9%) DeiT-S (22M params): 79.8%

Liu et al. (Microsoft, 2021) solved ViT's two biggest problems for dense prediction tasks: fixed single-scale output and quadratic attention cost. Swin Transformer (ICCV 2021 Best Paper) introduced window attention and hierarchical feature maps β€” making transformers practical for detection and segmentation.

Window Attention

Compute self-attention within local 7Γ—7 windows instead of globally.

  • Cost: O(MΒ²) per window, not O(HW)Β²
  • Linear complexity in image size
  • Shifted windows alternate each layer
  • Cross-boundary info flow restored

Hierarchical Features

Multi-scale feature maps like ResNet's pyramid.

  • Stage 1: 56Γ—56 (fine, small objects)
  • Stage 2: 28Γ—28 β†’ Stage 3: 14Γ—14
  • Stage 4: 7Γ—7 (coarse, large objects)
  • Plug directly into FPN for detection
Swin Transformer β€” window attention (Layer β„“) and shifted window (Layer β„“+1)
Layer β„“: Window Attention Attention within windows only β€” no cross-boundary Cost: O(MΒ²) per window β€” linear total Layer β„“+1: Shifted Window Windows shifted by (M/2, M/2) Cross-boundary connections restored! Hierarchical Feature Maps (like ResNet) 56Γ—56Γ—96 Stage 1 merge 28Γ—28 Stage 2 14Γ—14 Stage 3 7Γ—7 βœ“ Multi-scale for FPN ViT: single 14Γ—14 (no hierarchy) Swin = best backbone for detection & segmentation

Liu et al. (Facebook, 2022) asked: "What if we took ResNet and applied every Transformer design decision?" Starting from ResNet-50, they applied 7 systematic changes. The result: ConvNeXt-Base: 83.8% β€” matching Swin-Base (83.5%) without any attention mechanism.

ConvNeXt Block β€” ResNet + Transformer design principles
Input (96 channels) Depthwise 7Γ—7 Conv (96ch) LayerNorm 1Γ—1 Conv (96 β†’ 384) expand Γ—4 GELU 1Γ—1 Conv (384 β†’ 96) contract skip connection (+) 7 Upgrades: ResNet β†’ ConvNeXt 1. Training: AdamW, Mixup, CutMix, longer 2. Stage ratio: (3,4,6,3) β†’ (3,3,9,3) 3. Patchify stem: 4Γ—4 stride-4 conv 4. Depthwise 7Γ—7 conv (large receptive field) 5. Inverted bottleneck: 96β†’384β†’96 6. One GELU + one LayerNorm per block 7. Separate downsampling layers Result: ConvNeXt-Base 83.8% Swin-Base: 83.5% β€” essentially equal! Inspired by Transformer FFN: narrow β†’ wide β†’ narrow (inverted bottleneck) Minimal activations & normalisation
ModelInductive BiasAttentionHierarchyBest ForParamsTop-1
ResNet-50Strong (conv)NoneYesSmall data, fast inference25M76.1%
EfficientNetV2-MStrong (conv+NAS)NoneYesEfficient production54M85.1%
ViT-Base/16NoneGlobalNoLarge-scale pre-training86M81.8%
DeiT-Base/16Weak (distill)GlobalNoImageNet-scale tasks86M81.8%
Swin-BaseWeak (window)Local+shiftYesDetection, segmentation88M83.5%
ConvNeXt-BaseModerate (conv)NoneYesAll-around modern CNN89M83.8%
ViT-L/16 (MAE)NoneGlobalNoLarge-scale SOTA307M87.8%
SAM2 (ViT-H)NoneGlobalNoZero-shot segmentation641Mβ€”
Accuracy vs Throughput β€” finding the best architecture for your constraints
Throughput (images/sec, A100) β†’ ImageNet Top-1 (%) β†’ 76% 80% 84% 87% 90% 80 350 600 1000 1400 ResNet-50 EffNet-B4 ViT-B DeiT-B Swin-T Swin-B ConvNeXt-B ViT-L ViT-H 88.6% ← Pareto frontier

There is no single "best" architecture in 2024. For classification accuracy: ViT-Large with MAE pre-training. For efficiency: ConvNeXt or EfficientNetV2. For detection/segmentation: Swin or ConvNeXt backbone. For multimodal tasks: ViT dominates β€” it connects naturally to language models via shared attention.

Chapter 6.6 β€” Summary

  • ViT splits image into 196 non-overlapping 16Γ—16 patches, treats them as a token sequence for a standard Transformer encoder
  • Self-attention is global from layer 1 β€” no locality constraint unlike CNNs; every patch sees every other patch directly
  • ViT needs massive pre-training data: outperforms CNNs only above ~14M images; on small data, CNNs still win
  • DeiT: trains ViT on ImageNet-1k via knowledge distillation from a CNN teacher + [DIST] token β€” 81.8% vs 77.9%
  • Swin Transformer: window attention (linear cost) + shifted windows + hierarchical feature maps = best backbone for detection/segmentation
  • ConvNeXt: ResNet updated with 7 Transformer design choices β€” matches Swin without any attention (83.8% vs 83.5%)
  • No single best architecture: ViT for accuracy, ConvNeXt for efficiency, Swin for dense prediction, ViT for multimodal
6.7
Chapter 6.7
Multimodal AI β€” CLIP, DALL-E & Vision-Language Models

For decades, vision and language models lived in separate worlds. CLIP changed that in 2021 by learning a shared embedding space where "a photo of a dog" and an actual photo of a dog sit close together. One model. Two modalities. Zero task-specific training.

Unimodal models process only one type of data β€” a text-only LLM or an image-only CNN. Multimodal models process and relate multiple data types simultaneously: text + image, image + audio, video + text. Real-world tasks are inherently multimodal β€” "describe what's in this photo" requires both vision and language.

πŸ–ΌοΈβ†’πŸ“

Vision β†’ Language

  • Image captioning (BLIP, CoCa)
  • Visual QA β€” "What colour is the car?"
  • Document parsing (GPT-4V, DocVQA)
  • Medical image report generation
  • Chart and figure understanding
πŸ“β†’πŸ–ΌοΈ

Language β†’ Vision

  • Text-to-image (DALL-E 3, SDXL, Midjourney)
  • Image editing by text instruction
  • Text-guided inpainting
  • Text-to-video (Sora, Runway)
  • Text-to-3D generation
πŸ’¬πŸ–ΌοΈβ†’πŸ“

Text + Image β†’ Text

  • Multimodal chat (GPT-4V, LLaVA, Gemini)
  • "Explain this chart step by step"
  • Visual code debugging
  • Document Q&A with screenshots
πŸ”β†”οΈ

Text ↔ Image Retrieval

  • Find images matching a text query (CLIP)
  • Find text matching an image
  • Pinterest visual search, Google Lens
  • Open-vocabulary detection (GLIP, OwL-ViT)

Radford et al. (OpenAI, 2021) trained two encoders β€” one for images, one for text β€” jointly on 400 million (image, text) pairs scraped from the internet. No manual labels: web captions ARE the supervision. The goal: learn a shared embedding space where matching image-text pairs are close and non-matching pairs are far apart.

Image Encoder
Text Encoder
ResNet-50 or ViT (ViT-B/32, ViT-L/14)
Input: 224Γ—224 RGB image
Output: d-dim embedding (e.g., 512 or 1024)
Projected to shared embedding space via linear layer
Transformer with masked self-attention (GPT-style)
Input: text caption (up to 77 tokens)
Output: d-dim embedding
[EOS] token representation projected to shared space
CLIP InfoNCE Contrastive Loss L = βˆ’(1/2N) Ξ£α΅’ [log softmax(sim(Iα΅’,Tα΅’)/Ο„) + log softmax(sim(Tα΅’,Iα΅’)/Ο„)] sim(I, T) = (IΒ·T) / (β€–Iβ€–Β·β€–Tβ€–)   (cosine similarity) Ο„ = learnable temperature. Maximise similarity of N matching pairs, minimise NΒ²βˆ’N non-matching pairs. Actual batch size: 32,768 pairs.
CLIP Contrastive Training β€” align matching image-text pairs, separate non-matching
Images 🐱 cat photo 🐢 dog photo πŸš— car photo 🐦 bird photo Image Encoder I₁ Iβ‚‚ I₃ Iβ‚„ Texts "a photo of a cat" "a photo of a dog" "a red car" "a yellow bird" Text Encoder T₁ Tβ‚‚ T₃ Tβ‚„ 4Γ—4 Cosine Similarity Matrix I₁Iβ‚‚ I₃Iβ‚„ T₁ Tβ‚‚ T₃ Tβ‚„ 0.92 0.89 0.94 0.91 0.11 0.08 0.09 0.13 0.07 0.10 0.06 0.09 0.12 0.10 0.08 0.11 β–² Maximise diagonal (matching pairs) β–Ό Minimise off-diagonal (non-matching pairs) Actual batch: 32,768 pairs
🎯

Zero-Shot Image Classification

Write text prompts: "a photo of a {class}" for each class. Embed all prompts. Compare with image embedding via cosine similarity. Nearest = predicted class.

  • 76.2% Top-1 on ImageNet β€” zero task-specific training
  • Competitive with supervised ResNet-50 (76.1%)
πŸ”

Visual Similarity Search

Embed a text query, find images by cosine similarity in shared space.

  • Pinterest visual search, Google Lens
  • Stock photo search by description
  • Medical image retrieval
🌐

Open-Vocabulary Detection

Combine CLIP with detection models β†’ detect ANY category described in text.

  • OwL-ViT, GLIP, Grounding DINO
  • No retraining for new classes
  • "Find all red objects in this image"
πŸ—‚οΈ

Data Filtering & Curation

Filter web-crawled images by semantic content using text queries.

  • Built LAION-5B (5B image-text pairs)
  • LAION = training data for Stable Diffusion
  • CLIP score filters low-quality pairs
CLIP Zero-Shot Classification β€” no retraining, just text prompts for each class
πŸ• Query image Image Encoder I Class prompts β†’ Text Encoder β†’ Embeddings "a photo of a dog" "a photo of a cat" "a photo of a car" "a photo of a bird" "a photo of a fish" Text Encoder Cosine similarity scores: 0.91 dog βœ“ 0.23 0.08 0.11 0.06 CLIP zero-shot ImageNet accuracy: 76.2% Supervised ResNet-50: 76.1% β€” trained on labeled data CLIP: zero task-specific training needed
import torch import open_clip from PIL import Image # Load pre-trained CLIP model model, _, preprocess = open_clip.create_model_and_transforms( 'ViT-B-32', pretrained='openai' ) tokenizer = open_clip.get_tokenizer('ViT-B-32') model.eval() # Load and preprocess image image = preprocess(Image.open("cat.jpg")).unsqueeze(0) # (1, 3, 224, 224) # Zero-shot classification: define class prompts classes = ["cat", "dog", "car", "bird", "elephant"] text_prompts = [f"a photo of a {c}" for c in classes] text_tokens = tokenizer(text_prompts) # (5, 77) token sequences with torch.no_grad(): image_features = model.encode_image(image) # (1, 512) text_features = model.encode_text(text_tokens) # (5, 512) # Normalise to unit vectors before cosine similarity image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) # Cosine similarity: (1,512) @ (512,5) β†’ (1,5) similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1) for cls, prob in zip(classes, similarity[0]): print(f"{cls:12s}: {prob.item():.3f}") # cat : 0.823 ← highest # dog : 0.091 # car : 0.031 # bird : 0.034 # elephant : 0.021
πŸ”’

DALL-E 1 (2021)

  • Text β†’ BPE tokens (256) + VQ-VAE image tokens (1024)
  • 12B-param autoregressive Transformer
  • Predicts image tokens sequentially
  • Creative combinations but limited fidelity
  • Slow: sequential generation
🌊

DALL-E 2 (2022)

  • Text β†’ CLIP text embedding β†’ Prior β†’ CLIP image embedding
  • Diffusion decoder (unCLIP) β†’ high-res image
  • Much higher fidelity than DALL-E 1
  • Supports image variations (encode β†’ re-decode)
  • Uses CLIP embeddings as bridge
πŸ“

DALL-E 3 (2023)

  • Key innovation: highly descriptive synthetic captions
  • Re-captioned all training data with detailed descriptions
  • Much better instruction following
  • Integrated into ChatGPT
  • Handles complex prompts faithfully
DALL-E Architecture Evolution β€” autoregressive tokens β†’ CLIP-guided diffusion
DALL-E 1 β€” Autoregressive Text input BPE (256) 12B Autoregressive Transformer 1024 VQ tokens VQ-VAE dec πŸ–ΌοΈ Image Slow: sequential DALL-E 2 β€” CLIP + Diffusion (unCLIP) Text input CLIP text enc Prior: text emb β†’ img emb Diffusion Decoder (unCLIP) πŸ–ΌοΈ Image Higher fidelity! Supports variations DALL-E 3 Key Innovation β€” Descriptive Recaptioning Problem: web captions are short and vague ("sunset photo" β†’ misses: time, location, mood, colours) Solution: re-caption entire training set with detailed synthetic descriptions β†’ much better prompt adherence

VLMs accept both images and text as input and generate text as output. The core challenge: how do you connect a vision encoder to a language model? Three main approaches have emerged:

1. Feature Concatenation

Vision tokens prepended to text tokens before LLM.

  • LLM processes visual + text tokens together
  • Requires LLM pre-training on multimodal data
  • Example: Flamingo (cross-attention layers)
  • Limitation: LLM must learn vision interpretation

2. Projector / Adapter

MLP projector bridges vision encoder β†’ LLM input space.

  • Most common approach
  • Freeze LLM (or LoRA), train only projector
  • Examples: LLaVA-1.5, InternVL, Qwen-VL
  • Efficient: minimal new parameters

3. Native Multimodal

Vision and language trained together from scratch.

  • Best cross-modal reasoning
  • Most expensive to train
  • Examples: GPT-4V, Gemini, PaliGemma
  • Unified architecture β€” no connector needed
VLM Architecture β€” ViT + MLP Projector + LLM (e.g., LLaVA-1.5 design)
πŸ“Š Image ViT-L Image Encoder 256 tokens Γ—1024 πŸ”’ FROZEN MLP Projector 1024 β†’ 4096 πŸ”₯ TRAINABLE Text: "What is the highest bar?" LLM (Llama-3 / Vicuna) 256 visual tokens text tokens visual tokens treated like text tokens πŸ”’ FROZEN or πŸ”„ LoRA fine-tuned "The highest bar is March"
🌐

GPT-4V / GPT-4o (OpenAI)

  • First frontier VLM (2023)
  • Read documents, analyse charts
  • Solve visual math, describe scenes
  • GPT-4o: native voice + vision + text
  • 128K token context
πŸ’Ž

Gemini 1.5 Pro (Google)

  • Natively multimodal from pre-training
  • 1M token context window
  • Process 1 hour of video or 1000 images
  • Images, video, audio, code, text
  • Best for long-document & video tasks
πŸ¦™

Open-Source VLMs

  • LLaVA-1.5: CLIP + MLP + Vicuna β€” strong baseline
  • InternVL-2: competitive with GPT-4V on benchmarks
  • Qwen-VL: multilingual, multi-image
  • PaliGemma: SigLIP + Gemma, efficient open model
ModelArchitectureVision EncoderLLM BaseContextNotable
GPT-4V / 4oProprietaryUndisclosedGPT-4128KBest overall, native voice+vision
Gemini 1.5 ProNative multimodalProprietaryGemini1M tokensLong video, multi-image
Claude 3.5 SonnetProprietaryUndisclosedClaude 3.5200KDocument analysis, charts
LLaVA-1.5ViT+ProjectorCLIP ViT-L/336Vicuna-13B4KStrong open baseline
InternVL-2ViT+MLPInternViT-6BInternLM2-20B8KNear-frontier open
Qwen-VL-PlusViT+AdapterQwen ViTQwen-7B8KMultilingual, multi-image
PaliGemmaViT+LinearSigLIP-So400MGemma-2B/9B8KOpen, small, efficient
Multimodal Benchmark Tasks β€” chart QA, document QA, spatial reasoning, scene
Chart Understanding Jan Feb Mar Apr Q: "Highest month?" A: "February" Benchmark: ChartQA Document Understanding RECEIPT Coffee $3.50 Sandwich $12.00 Dessert $8.50 TOTAL $24.00 Q: "What is total?" A: "$24.00" Benchmark: DocVQA Spatial Reasoning circle square Q: "Circle left of square?" A: "Yes" Benchmark: GQA / CLEVR Scene Understanding Q: "How many people?" A: "2" Benchmark: VQAv2 GPT-4V vs Open Models vs Human (approximate): GPT-4V: 78.5% Open: ~75% GPT-4V: 87.2% Open: ~83% GPT-4V: ~82% Human: ~90% GPT-4V: 77.2% Human: 80.9%
BenchmarkTestsFormatGPT-4VBest OpenHuman
VQAv2General visual QAOpen-ended77.2%~75%80.9%
TextVQAText in imagesOpen-ended78.0%76.1%~85%
DocVQADocument understandingOpen-ended87.2%82.6%96%
ChartQAChart comprehensionOpen-ended78.5%74.8%80.5%
MMMUUniversity-level multimodalMCQ56.8%49.3%56.2%
MMBenchComprehensive multimodalMCQ75.8%72.4%β€”
Benchmark Saturation

Many multimodal benchmarks are rapidly saturating β€” models now exceed human performance on DocVQA and approach human-level on ChartQA. MMMU (university-level expert questions across 57 subjects) remains the most challenging, with GPT-4V barely reaching human-level performance. New harder benchmarks (MMMU-Pro, MathVista) are being developed constantly.

Chapter 6.7 β€” Summary

  • CLIP: jointly trained image + text encoders via contrastive loss on 400M image-text pairs β€” shared embedding space
  • Shared embedding: "a photo of a dog" and a dog photo have similar vectors β€” the key to zero-shot capabilities
  • CLIP zero-shot: 76.2% on ImageNet with no task-specific training β€” just text prompts per class
  • DALL-E 1: autoregressive over VQ-VAE tokens; DALL-E 2/3: CLIP + diffusion = higher fidelity + better instruction following
  • VLMs: vision encoder β†’ MLP projector β†’ LLM β€” visual tokens treated exactly like text tokens inside the language model
  • Frontier models: GPT-4V, Gemini 1.5 Pro (1M context), Claude 3.5 Sonnet lead on benchmarks
  • Open alternatives: LLaVA-1.5, InternVL-2, PaliGemma β€” competitive with proprietary models on most benchmarks
  • Key benchmarks: VQAv2, DocVQA, ChartQA, MMMU β€” MMMU remains hardest; most others nearly saturated
6.8
Chapter 6.8
Video Understanding & 3D Vision

A video is not just a sequence of images β€” it is time, motion, causality, and physics. 3D vision goes further: understanding the world as volumetric space, not flat projections. These are the hardest problems in computer vision, and also the most important for autonomous systems that must act in the real world.

Video adds the temporal dimension to images: motion, change, causality, events. Processing frames independently with image models misses all temporal information β€” you can't tell if a person is walking left or right. Adjacent frames are also ~95% identical pixels, making naive processing extremely redundant.

⏱️

Temporal Modelling

Must capture short-term motion (running, gestures) AND long-term events (scoring a goal over 10 seconds). Both time scales matter.

πŸ’»

Computational Cost

30fps Γ— 1 min = 1,800 frames. Can't process all at full resolution. Solutions: sampling, temporal pooling, sparse attention.

πŸ”€

Temporal Alignment

Two videos of "making coffee" have the same steps in different order and timing. Models must be robust to temporal variation.

Video as Spatial-Temporal Volume β€” (T, C, H, W) tensor
t=1 t=2 t=3 t=4 t=5 t=6 Time β†’ Video tensor: (T, C, H, W) = (30, 3, 224, 224) per second Motion = change in pixel/region position across dimension T H W T = frames 30fps Γ— 60s = 1,800 frames/min Can't process all!

Optical flow computes a dense per-pixel motion vector field between two consecutive frames. For each pixel (x,y) in frame t, it asks: where does this pixel move to in frame t+1? The result is an HΓ—WΓ—2 flow field (Ξ”x, Ξ”y per pixel). Used in action recognition, video stabilisation, compression, and motion segmentation.

Classical Methods

  • Lucas-Kanade (1981): sparse flow on corners/keypoints β€” fast, robust
  • Horn-Schunck (1981): dense regularised flow β€” smooth but slow
  • Farneback (2003): dense flow via polynomial expansion
  • All assume: brightness constancy + small motion

Deep Learning Methods

  • FlowNet (2015): first end-to-end CNN for optical flow
  • PWC-Net (2018): coarse-to-fine with cost volume
  • RAFT (2020): iterative refinement on 4D correlation volume β€” SOTA
  • RAFT generalises across domains β€” no motion assumptions
Optical Flow β€” per-pixel motion vectors between consecutive frames
Frame t object at x=80 Frame t+1 object moved to x=110 (+30px) Flow Field (Ξ”x, Ξ”y) Red arrows: rightward motion (Ξ”x=+30, Ξ”yβ‰ˆ-2) Gray: near-zero (static background) diff flow Colour encodes direction: Right Left Up Down

The history of video understanding mirrors the history of image understanding: hand-crafted features β†’ CNNs β†’ Transformers. Each generation solved the temporal modelling problem differently.

Video Architecture Evolution β€” Two-Stream β†’ 3D Conv β†’ Video Transformer
2014 β€” Two-Stream Networks RGB frames CNN₁ Optical Flow CNNβ‚‚ Fusion β†’ Action class Slow: optical flow is expensive 2015 β€” 3D Convolutions (C3D / I3D) T frames stacked 3D Conv KΓ—KΓ—T kernel Pool β†’ FC β†’ Action class Single network β€” no flow needed 2021 β€” Video Transformer (TimeSformer) TΓ—N patch tokens Spatial Attn (within frame) Temporal Attn (across frames) β†’ Action class Best accuracy, scalable, no optical flow needed
ModelYearArchitectureTemporal ModellingKinetics-400 AccSpeed
Two-Stream2014Dual CNNOptical flow88.0%Slow (flow)
C3D20153D CNN3D convolution79.9%Moderate
I3D2017Inflated 3D3D conv (ImageNet init)95.6%Moderate
R(2+1)D2018Factorised 3D2D spatial + 1D temporal96.8%Moderate
SlowFast2019Dual-speed CNNSlow (semantics) + Fast (motion)79.0% (val)Fast
TimeSformer2021ViT + AttnFactorised spatial+temporal attn80.7%Moderate
VideoMAE-H2022ViT-H MAEMasked video pre-training86.6%Moderate

Video generation is dramatically harder than image generation: objects must stay consistent across hundreds of frames, motion must follow physics, and coherent storylines span many seconds. The field progressed rapidly from 2022–2024.

2022Imagen VideoGoogle β€” cascade diffusion
2022Make-A-VideoMeta β€” spacetime attn
2023Gen-2Runway β€” commercial
2023Pika 1.0Accessible short clips
2024 Feb ⭐SoraOpenAI β€” 1 min coherent
Sora Deep Dive β€” Spacetime Patches

Sora's key innovation: treat video as a sequence of spacetime patches rather than frames. A spacetime patch spans Ξ”t Γ— Ξ”h Γ— Ξ”w β€” capturing motion intrinsically. These patches become tokens for a Diffusion Transformer (DiT), replacing the U-Net with a scalable Transformer architecture. This allows Sora to generate variable duration, resolution, and aspect ratio from a single model.

Sora's Spacetime Patches β€” video as 3D spatiotemporal token sequence
Frame-by-Frame (Old Approach) t=1 t=2 t=3 t=4 t=5 Process frames β†’ force temporal consistency Temporal consistency is an afterthought Hard to maintain over many frames Spacetime Patches (Sora) patch Ξ”tΓ—Ξ”hΓ—Ξ”w Video volume (TΓ—HΓ—W) divided into 3D spacetime patches Diffusion Transformer Video Motion is native β€” encoded in each patch Any resolution/duration from same model Limitation: temporal consistency hard Result: up to 1 minute coherent video

Depth estimation predicts the distance from camera to each pixel, producing an HΓ—W depth map. Monocular depth (single camera) is ambiguous β€” a nearby toy car looks like a distant real car. Deep learning now handles this, trained on stereo pairs or synthetic data to learn scale cues like perspective and size.

Monocular Depth

One RGB image β†’ depth map. Scale ambiguous β€” requires learned priors.

  • MiDaS (Intel 2020): relative depth, robust
  • DPT (2021): ViT backbone for accuracy
  • Depth Anything v2 (2024): 62M images, SOTA foundation model

Stereo Depth

Two cameras with known baseline β†’ triangulate from disparity (pixel shift).

  • Absolute scale available (unlike monocular)
  • Standard in autonomous driving hardware
  • IGEV-Stereo, RAFT-Stereo: learned stereo

RGB-D Sensors

Direct depth measurement hardware.

  • Structured light: Kinect, Intel RealSense
  • Time-of-Flight: iPhone LiDAR, Velodyne
  • Used in: AR, robotics, autonomous vehicles
Depth Map β€” per-pixel distance from camera (red=near, blue=far)
RGB Image Foreground / Midground / Sky Depth Model Depth Map (pseudo-colour) ~∞ (sky) ~10m ~1.5m Near (0m) Far (50m+)

A point cloud is an unordered set of 3D points, each a (x,y,z) coordinate. LiDAR sensors emit laser pulses and measure return time to build dense 3D point clouds at 1.3M points/second. Unlike images, point clouds have no grid structure, variable density, and only capture visible surfaces.

Image Grid vs Point Cloud β€” ordered pixels vs unordered 3D points
Image: Ordered Pixel Grid HΓ—WΓ—3 ordered matrix β€” CNN-friendly Dense, uniform, 2D Point Cloud: Unordered 3D Set occluded Unordered set β€” dense near, sparse far No grid β†’ need special network (PointNet)
PointNet Key Insight

PointNet (Qi et al., 2017): process each point independently with a shared MLP, then aggregate via max pooling. Max pooling over all points is permutation-invariant β€” doesn't matter what order you feed the points in, you get the same result. PointNet++ extends this with hierarchical local feature extraction (like CNN for point clouds).

Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) learn a 3D scene representation from 2D photos β€” given any new camera viewpoint, the model synthesises a photorealistic image. A small MLP maps (x, y, z, direction) β†’ (colour, density). Volume rendering integrates these values along camera rays. Given 20–100 posed photos β†’ synthesise any novel angle.

3D Gaussian Splatting (Kerbl et al., 2023) replaces NeRF's implicit MLP with explicit 3D Gaussians β€” each has a centre, covariance shape, colour, and opacity. Rendering projects Gaussians to 2D and rasterises directly on GPU. Result: near-real-time novel view synthesis at 30fps, better quality, and 30-minute training vs NeRF's hours.

🧠

NeRF (2020)

  • MLP: (x,y,z,dir) β†’ (colour, density)
  • Volume rendering along camera rays
  • Training: hours, Rendering: seconds/frame
  • Many variants: Instant-NGP (fast), Mip-NeRF (anti-aliased)
  • Foundation for all novel-view synthesis methods
🌟

3D Gaussian Splatting (2023)

  • Explicit 3D Gaussians: centre + shape + colour + opacity
  • Rasterise to screen β€” GPU-native, differentiable
  • Training: ~30 min, Rendering: 30fps real-time
  • Better quality, sharper edges than NeRF
  • Most impactful 3D vision paper of 2023
NeRF vs 3D Gaussian Splatting β€” implicit neural vs explicit Gaussian representation
NeRF β€” Implicit MLP cam1 cam2 cam3 MLP (x,y,z,dir)β†’(c,Οƒ) new view render Training: hours Rendering: seconds/frame 3D Gaussian Splatting 30fps! project Training: ~30 min Rendering: 30fps real-time Any viewpoint from photos β€” slow Any viewpoint from photos β€” real-time
# gsplat: fast differentiable Gaussian splatting renderer import torch from gsplat import rasterization def render_gaussians( means: torch.Tensor, # (N, 3): Gaussian centres in 3D quats: torch.Tensor, # (N, 4): quaternion rotations scales: torch.Tensor, # (N, 3): scale along each axis opacities: torch.Tensor, # (N,) : opacity values colors: torch.Tensor, # (N, 3): RGB colours viewmat: torch.Tensor, # (C, 4, 4): camera extrinsics K: torch.Tensor, # (C, 3, 3): camera intrinsics width: int, height: int ) -> torch.Tensor: # Differentiable rasterisation via tile-based splatting renders, alphas, meta = rasterization( means=means, quats=quats, scales=scales, opacities=opacities, colors=colors, viewmats=viewmat, Ks=K, width=width, height=height ) return renders # (C, H, W, 3) rendered images # Training: optimise Gaussian parameters to minimise L1 + SSIM vs training images # Init from SfM point cloud β†’ optimise ~30 min on RTX 4090 β†’ real-time 30fps render

πŸŽ“ Domain 6 Complete β€” Computer Vision & Multimodal AI

  • Ch 6.1: Images = 3D tensors (N,C,H,W); always normalise with ImageNet mean/std for pre-trained models. Canny, HOG, and SIFT dominated before 2012.
  • Ch 6.2: AlexNet 2012 = the inflection point. ResNet skip connections F(x)+x solved depth degradation; EfficientNet compound-scales depth, width, and resolution jointly.
  • Ch 6.3: Detection = localise + classify all objects. YOLO: single forward pass predicts all boxes at 30–160fps. IoU and mAP are the standard metrics.
  • Ch 6.4: Segmentation = per-pixel labelling. U-Net: symmetric encoder-decoder with concatenation skip connections. SAM: promptable zero-shot segmentation β€” click any point, get a mask.
  • Ch 6.5: GAN: Generator fools Discriminator via minimax game. StyleGAN: mapping network + AdaIN style injection per resolution = photorealistic faces. CycleGAN: unpaired domain translation.
  • Ch 6.6: ViT: image as 16Γ—16 patch tokens fed into a Transformer. Needs large pre-training data (~14M+); Swin adds hierarchy and shifted-window attention for detection/segmentation.
  • Ch 6.7: CLIP: shared image-text embedding via contrastive learning on 400M pairs β†’ 76.2% zero-shot ImageNet. VLMs: ViT + MLP projector + LLM = visual question answering at scale.
  • Ch 6.8: Video = (T,C,H,W) tensor; Sora treats it as spacetime patches in a Diffusion Transformer. 3D Gaussian Splatting: real-time novel-view synthesis from photos. NeRF, depth maps, and LiDAR power autonomous systems.

Domain 6 traces how AI learned to see, understand, and recreate the visual world. The key progression: raw pixels β†’ hand-crafted features (HOG, SIFT) β†’ learned features (CNNs) β†’ global attention (ViT) β†’ language-grounded vision (CLIP, VLMs). Multimodal AI bridges Domain 6 and Domain 5 β€” vision and language are converging into unified foundation models that process any modality through shared embedding spaces and transformer architectures.