AI Foundation · Domain 05 · Chapter 5.1

NLP Fundamentals

How computers learn to read — text processing, linguistic structure, and the pipeline from raw text to model input

5.1

Chapter 5.1

NLP Fundamentals & Text Preprocessing

Natural language is the most information-dense medium humans have ever created. Every sentence carries meaning at multiple levels simultaneously — lexical, syntactic, semantic, pragmatic. Teaching a machine to process all of these layers reliably is the central challenge of NLP.

What Is NLP? Core

Natural Language Processing (NLP) is the branch of AI that enables computers to understand, process, and generate human language. This sounds straightforward until you confront what language actually is: an ambiguous, context-dependent, culturally-loaded communication system that humans have spent millions of years evolving. No sentence carries a single, unambiguous meaning independent of context. A machine must learn to handle every level of this ambiguity simultaneously.

Language ambiguity operates at four distinct levels, each building on the one below. Lexical ambiguity arises when a single word has multiple meanings — "bank" can mean a financial institution or the edge of a river. Syntactic ambiguity occurs when sentence structure is unclear — "I saw the man with the telescope" leaves open whether the speaker or the man possesses the telescope. Semantic ambiguity involves phrases whose meaning is unclear even with structure resolved — "Can you pass the salt?" is literally a question about capability but functions as a request. Pragmatic ambiguity is the deepest: "It's cold in here" is an observation that functions as a request to close the window, but only if you already know the conversational norms.

The history of NLP traces a clear arc: Rule-based systems (1950s–1980s) used hand-crafted grammars and dictionaries — brittle and language-specific. Statistical NLP (1990s–2000s) replaced rules with probabilities learned from corpora. Neural NLP (2013–2017) used word embeddings (Word2Vec, GloVe) and RNNs to learn representations directly from data. Transformer-era (2018–present) introduced BERT, GPT, and their successors — models that learn language representations of staggering generality from massive corpora, making almost all previous approaches obsolete.

🔤

NLP Understanding Tasks

Text classification
Named entity recognition
Sentiment analysis
Question answering
Natural language inference
Coreference resolution

✍️

NLP Generation Tasks

Machine translation
Text summarisation
Dialogue / chatbots
Text completion
Code generation
Data-to-text narration

Four Levels of Language Ambiguity — each layer builds on the one below

Text Preprocessing In-depth

Before any model can process text, it must be transformed from raw characters into a form the model understands. Classical NLP pipelines involve a series of hand-engineered preprocessing steps, each reducing noise and normalising vocabulary. We trace each step using: "The Quick Brown Foxes are JUMPING over lazy dogs! They've been running."

Step 1 — Lowercasing. Convert all characters to lowercase. "The" and "the" are the same word — keeping both wastes vocabulary slots. This alone can reduce vocabulary size by 10–30% for English text.

Step 2 — Punctuation & special character removal. Strip characters that carry no lexical meaning for bag-of-words models. Important caveat: not always appropriate — punctuation carries meaning in some contexts (U.S.A, 3.14, emoticons, code). Remove selectively based on the task.

Step 3 — Tokenisation. Split text into meaningful units (tokens). The naïve approach is whitespace splitting. Better approaches handle contractions ("they've" → ["they", "'ve"]) and punctuation. Chapter 5.2 covers subword tokenisation for neural models in depth.

Step 4 — Stopword removal. Remove high-frequency words ("the", "is", "a") that carry little semantic weight in bag-of-words models. Critical warning: never remove stopwords for neural models or sequence tasks — position and function words are often critical to meaning ("not" changes everything).

Step 5 — Stemming. Reduce words to root form by stripping suffixes using heuristic rules. Porter Stemmer: "jumping" → "jump", "foxes" → "fox". Fast but imprecise — "university" → "univers". Two words with the same stem may not share meaning.

Step 6 — Lemmatisation. Morphologically reduce words to their dictionary form (lemma) using linguistic knowledge. "better" → "good", "ran" → "run", "foxes" → "fox". More accurate than stemming but requires WordNet. "Saw" → "see" (verb) or "saw" (noun) depending on POS tag.

Step 7 — Text normalisation. Expand contractions ("they've" → "they have"), normalise Unicode, standardise numbers, handle abbreviations and acronyms.

⚡ Modern NLP Note

Neural models and LLMs do NOT use most of these preprocessing steps. They process raw subword tokens (Chapter 5.2) directly from near-original text. These classical steps are for bag-of-words models, TF-IDF search engines, and traditional ML feature engineering. If you are building anything with BERT, GPT, or similar — skip everything except basic Unicode normalisation.

import re import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer text = "The Quick Brown Foxes are JUMPING over lazy dogs! They've been running." # Step 1: lowercase text_lower = text.lower() # → "the quick brown foxes are jumping over lazy dogs! they've been running." # Step 2: remove punctuation text_clean = re.sub(r"[^\w\s]", "", text_lower) # Step 3: tokenise tokens = word_tokenize(text_clean) # → ['the', 'quick', 'brown', 'foxes', 'are', 'jumping', ...] # Step 4: remove stopwords (classical NLP only — NOT for LLMs) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words] # → ['quick', 'brown', 'foxes', 'jumping', 'lazy', 'dogs', 'running'] # Step 5/6: lemmatise lemmatizer = WordNetLemmatizer() tokens = [lemmatizer.lemmatize(t, pos='v') for t in tokens] # "foxes" → "fox", "jumping" → "jump", "running" → "run" # Final: ['quick', 'brown', 'fox', 'jump', 'lazy', 'dog', 'run'] # For HuggingFace / LLM pipeline — just pass raw text: from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") tokens_bert = tokenizer(text, return_tensors="pt") # BERT handles subword tokenisation internally — no preprocessing needed

Classical NLP Preprocessing Pipeline — NOT used with modern neural models

Linguistic Structure Core

Even if you never build a classical NLP pipeline, understanding key linguistic concepts will help you reason about what language models are learning, diagnose failure modes, and work effectively with hybrid systems.

Part-of-Speech (POS) Tagging. Label each word with its grammatical role: Noun (NN), Verb (VB), Adjective (JJ), Adverb (RB), Determiner (DT), Preposition (IN). "The cat sat on the mat" → [DT, NN, VBD, IN, DT, NN]. POS tags disambiguate words with multiple roles — "run" is a verb in "I run" but a noun in "a home run". Used in: NER, information extraction, classical feature engineering.

Named Entity Recognition (NER). Identify and classify spans of text as named entities: PERSON, ORG, GPE (geopolitical entity), DATE, MONEY. "Apple's Tim Cook announced on Monday that sales exceeded $100B" → [Apple: ORG], [Tim Cook: PERSON], [Monday: DATE], [$100B: MONEY]. Modern BERT-based NER achieves near-human F1 scores on standard benchmarks.

Dependency Parsing. Identify grammatical relationships between words — subject, object, modifier, etc. "The cat chased the mouse" → "cat" is the nominal subject (nsubj) of "chased"; "mouse" is the direct object (dobj). Dependency parses are directed graphs enabling extraction of who did what to whom.

Coreference Resolution. Determine which mentions refer to the same entity. "When Mary arrived, she said she was tired" — both "she" instances refer to Mary. Crucial for coherent understanding across sentences. SpanBERT achieves state-of-the-art by jointly scoring mention pairs.

NER and POS Tagging — identifying entities and grammatical roles in a sentence

Traditional NLP Methods Reference

Before neural networks dominated NLP, three core representation methods powered almost every text application. They remain useful for lightweight tasks, interpretable systems, and benchmarking. Understanding them explains why neural embeddings were such a dramatic improvement.

Bag of Words (BoW) represents a document as a vector of word counts, completely discarding word order. "The cat sat" and "sat cat the" produce identical BoW vectors. Works well for document classification and spam filtering because topic is often determined by which words appear, not their order. Critical weakness: "not good" and "good" look nearly identical.

TF-IDF improves on raw counts by weighting words by how rare they are across the corpus. A word that appears frequently in one document but rarely elsewhere is a likely topic word. "The" appears in every document — TF-IDF assigns it near-zero weight. "Convolutional" in an AI paper is rare and gets high weight. Still discards order, but much more informative than raw counts.

n-grams partially recover word order by treating sequences of n adjacent words as features. Bigrams of "the cat sat": ["the cat", "cat sat"]. Captures local context but explodes vocabulary size exponentially with n. Word2Vec and subsequent neural embeddings made n-gram language models obsolete for most applications.

TF-IDF Formula TF-IDF(t, d) = TF(t, d) × IDF(t) TF(t, d) = count(t in d) / total_words(d) IDF(t) = log(N / df(t)) N = total documents in corpus · df(t) = documents containing term t High TF-IDF: term appears frequently in this doc but rarely corpus-wide → key topic term

Method	Captures Order	Vector Size	Sparse?	Semantic Meaning	Best For
Bag of Words	No	V (vocab size)	Yes — mostly 0s	No	Doc classification, spam
TF-IDF	No	V (vocab size)	Yes	No	Search, document similarity
n-grams	Local only	Vⁿ (explodes!)	Very sparse	No	Language models (pre-neural)
Word2Vec	No (fixed window)	d (e.g. 300)	No — dense	Yes	Semantic similarity, analogies

NLP Task Taxonomy Core

NLP encompasses a wide range of tasks grouped by the type of output they produce. Understanding this taxonomy helps you choose the right architecture (encoder-only, decoder-only, encoder-decoder), the right loss function, and the right evaluation metric for any given problem.

NLP Task Taxonomy — from classification to generation

Task	Input	Output	Key Metric	Modern Model
Sentiment Analysis	Review text	Positive / Negative / Neutral	Accuracy, F1	BERT, RoBERTa
Named Entity Recognition	Sentence	Token-level labels (B-I-O)	F1 per entity type	BERT-CRF, SpanBERT
Machine Translation	Source language text	Target language text	BLEU score	T5, NLLB-200, GPT-4
Summarisation	Long document	Short summary	ROUGE score	BART, Pegasus, GPT-4
Question Answering	Context + question	Answer span / free text	Exact Match, F1	GPT-4, Claude, Llama

∑ Chapter 5.1 Summary — NLP Fundamentals & Text Preprocessing

Language has four layers of ambiguity: lexical, syntactic, semantic, pragmatic — all must be handled, each building on the one below
NLP history: rule-based (1960s) → statistical (1990s) → neural embeddings (2013) → Transformer LLMs dominant from 2018
Classical preprocessing: lowercase → tokenise → remove stopwords → lemmatise — NOT used with neural models (BERT, GPT)
Stemming is fast but imprecise (heuristic suffix rules); lemmatisation is accurate but requires WordNet + POS context
BoW and TF-IDF: sparse, order-independent representations — still useful for search, lightweight classification, interpretable systems
Key linguistic annotations: POS tagging, NER, dependency parsing, coreference resolution — used in classical and hybrid NLP pipelines
NLP splits into: understanding tasks (classification, extraction) and generation tasks — different architectures, loss functions, and metrics

5.2

Chapter 5.2

Tokenisation — Words to Subwords

Tokenisation is the invisible foundation of every language model. Before a single parameter is trained, the tokeniser decides how text will be represented as integers — and that decision shapes what patterns the model can learn, how efficiently it processes different languages, and how much it costs to run at inference time.

Why Tokenise? Core

Neural language models operate on numbers, not text. Every word, character, or subword must be mapped to an integer ID from a fixed vocabulary before it can be fed into the model. Tokenisation is this mapping — it converts a raw string into a sequence of integers, each representing a "token" from the vocabulary. The choice of what constitutes a token has profound consequences for the model's capabilities.

The vocabulary dilemma has three corners. Word-level tokenisation uses whole words as tokens — intuitive, but English has 170,000+ words and with proper nouns, compounds, and morphological variants, the vocabulary explodes into millions. Words not seen during training become [UNK] (unknown) — the out-of-vocabulary problem. Character-level tokenisation uses individual characters — tiny vocabulary of ~128 ASCII characters, but sequences become very long. "Hello world" is 11 characters; a document of 1,000 words becomes ~6,000 characters. Attention's O(n²) complexity makes this expensive. Subword tokenisation is the sweet spot: split common words into whole tokens, rare or unknown words into subword pieces. "unhappiness" → ["un", "##happy", "##ness"] — known pieces, no OOV, reasonable sequence length.

Modern LLMs universally use subword tokenisation with vocabularies of 32,000–100,000 tokens. GPT-4 uses 100,277 tokens; LLaMA-3 uses 128,256. The vocabulary is fixed at training time and cannot be changed without retraining the model — making the tokeniser one of the most consequential design decisions in LLM development.

Tokenisation Granularity — Character vs Subword vs Word

Word & Character Tokenisers Core

Understanding why pure word and character tokenisers were abandoned helps clarify the design goals of modern subword tokenisers. Both extremes have fundamental problems that subword methods resolve.

Word tokenisation splits on whitespace and punctuation. Problems pile up quickly. Contractions: "don't" — is that one token or two ("do", "n't")? Hyphenated compounds: "state-of-the-art" — one or four? Morphological variants: "run", "running", "ran", "runs" require four separate vocabulary entries, even though they share meaning. Proper nouns, technical terms, and misspellings not seen during training become [UNK] — the model sees a blank where information should be. The English vocabulary alone exceeds 170,000 words; with all languages and domains, a truly universal word vocabulary would require millions of entries.

Character tokenisation has no OOV problem — the alphabet is fixed. But it fragments language into meaningless units from the model's perspective. The word "hello" becomes 5 separate tokens [h][e][l][l][o]. The model must learn from scratch that these 5 tokens together form a word unit — it cannot start with the useful prior that words are meaningful. More critically, sequence length explodes. A 1,000-word essay becomes ~6,000 character tokens. Transformer attention is O(n²) in sequence length — doubling the sequence length quadruples the compute cost. In practice, character-level models were impractical at scale.

Word Tokenisation

Character Tokenisation

"unhappiness" → [unhappiness] — 1 token

"unhappiness" → [u][n][h][a][p][p][i][n][e][s][s] — 11 tokens

OOV: "unhappinesses" → [UNK] — information lost

OOV: zero — any string is representable

Vocabulary: 100K–1M+ words needed

Vocabulary: ~128 ASCII or 256 bytes

Sequence: short and efficient

Sequence: very long — O(n²) attention cost

Morphology lost: run/running/ran = 3 separate IDs

Word structure lost: model must learn groupings from scratch

⚡ Byte-Level Note

Byte-level tokenisation tokenises raw UTF-8 bytes (0–255) rather than characters. Every document is representable — there is no OOV at the byte level. GPT-2 used byte-level BPE: start from 256 byte tokens, then apply BPE merges. This handles multilingual text naturally and is fully language-agnostic. GPT-4's tiktoken also uses byte-level BPE.

Byte-Pair Encoding (BPE) In-depth

Byte-Pair Encoding was introduced for NLP by Sennrich et al. (2016) as a data compression algorithm adapted for subword vocabulary construction. It is used by GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Falcon, and most modern decoder-based language models. The key insight is elegant: let the training data decide what the vocabulary tokens should be, by iteratively merging the most frequently co-occurring pairs.

The BPE algorithm is a simple loop. Initialise with a character vocabulary (or byte vocabulary for byte-level BPE). Count all adjacent token pairs across the entire training corpus. Merge the most frequent pair into a new single token and add it to the vocabulary. Repeat until the target vocabulary size is reached. The result: common words like "the", "and", "is" become single tokens; common morphological patterns like "-ing", "-tion", "un-" become tokens; rare words are split into recognisable subword pieces.

BPE Worked Example — Small Corpus

Corpus: "low"×2, "lower", "newer", "wider", "new"×2

Start: characters → [l][o][w] / [l][o][w][e][r] / [n][e][w][e][r] / ...

Count pairs: (e,r)=3, (l,o)=3, (o,w)=3, (n,e)=3, ...

Merge 1: (e,r) → "er" New vocab: {... er}

[l][o][w] / [l][o][w][er] / [n][e][w][er] / [w][i][d][er] / [n][e][w]

Merge 2: (l,o) → "lo" New vocab: {... er, lo}

[lo][w] / [lo][w][er] / [n][e][w][er] / [w][i][d][er] / [n][e][w]

Merge 3: (lo,w) → "low" New vocab: {... er, lo, low}

[low] / [low][er] / [n][e][w][er] → "lower" = ["low","er"]

Final: "lower" → ["low","er"] | "newer" → ["ne","w","er"] | "new" → ["n","e","w"]

BPE Algorithm — iterative merging of most frequent token pairs

import tiktoken # GPT-4 uses cl100k_base tokeniser (100,277-token BPE vocabulary) enc = tiktoken.get_encoding("cl100k_base") examples = [ "Hello, world!", "unhappiness", "ChatGPT", "supercalifragilisticexpialidocious", "1+1=2", "def fibonacci(n):", ] for text in examples: tokens = enc.encode(text) decoded = [enc.decode([t]) for t in tokens] print(f"{text!r:45s} → {len(tokens)} tokens: {decoded}") # Sample output: # 'Hello, world!' → 3 tokens: ['Hello', ',', ' world!'] # 'unhappiness' → 3 tokens: ['un', 'happiness', ''] # 'supercalifragilisticexpialidocious' → 11 tokens (splits into subwords) # 'def fibonacci(n):' → 5 tokens: ['def', ' fib', 'on', 'acci', '(n):'] # Round-trip test: encode then decode should return original text text = "The quick brown fox jumps over the lazy dog." assert enc.decode(enc.encode(text)) == text # always true for BPE

WordPiece Core

WordPiece is the tokenisation algorithm used by BERT, DistilBERT, ALBERT, and multilingual BERT (mBERT). It is mechanically similar to BPE but uses a different merge criterion: rather than merging the most frequent pair, it merges the pair that most increases the likelihood of the corpus under a language model. In practice, the merge score is: score(A, B) = freq(A+B) / (freq(A) × freq(B)). This prefers pairs where the joint occurrence is disproportionately high relative to how often each appears alone — capturing meaningful linguistic units rather than just common bigrams.

WordPiece uses a distinctive notation: a ## prefix marks a continuation subword — a piece that is attached to the preceding token rather than starting a new word. "playing" → ["play", "##ing"]. The "play" token has no prefix (it starts a word); "##ing" is always a suffix, never a standalone word. This makes the tokenisation reversible and unambiguous: joining tokens without spaces and stripping ## gives back the original word.

The standard BERT vocabulary contains 30,522 tokens. Unknown characters that cannot be represented by any combination of vocabulary tokens are mapped to [UNK] — rare for English but can occur with unusual Unicode characters. The vocabulary also includes special tokens: [CLS] (classification, prepended to all inputs), [SEP] (separator, marks sentence boundaries), [MASK] (for masked language modelling), and [PAD] (padding to fixed length).

WordPiece Tokenisation — ## marks continuation of a word

SentencePiece Core

SentencePiece (Kudo & Richardson, 2018) is a language-agnostic tokenisation framework used by T5, LLaMA-1/2/3, Mistral, Gemma, XLNet, and others. Its key architectural difference from BPE and WordPiece is that it operates on raw text including whitespace, without any pre-tokenisation step. BPE and WordPiece typically split on whitespace first (giving the language-specific assumption that spaces separate words), then apply subword segmentation within each word. SentencePiece treats spaces as regular characters — the text "Hello world" is tokenised as a single stream of characters including the space character.

To make tokenisation reversible, SentencePiece uses a special ▁ (U+2581, lower one-eighth block) character to mark word boundaries. The space before a word is encoded as ▁: "Hello world" → ["▁Hello", "▁world"]. Decoding is trivial: replace ▁ with a space, concatenate. This approach means the same tokeniser works identically for languages with no spaces (Japanese, Chinese) and languages with regular spacing (English, French) — making it the preferred choice for multilingual models.

SentencePiece supports two underlying algorithms. BPE mode is the same bottom-up merge algorithm as before. Unigram Language Model mode takes the opposite approach: start with a large vocabulary (e.g. all substrings up to length 16), then iteratively remove tokens whose removal least decreases corpus likelihood, until the target size is reached. Unigram produces multiple possible segmentations of a word and assigns probabilities to them — during training, samples are drawn from the distribution, providing a natural form of tokenisation regularisation.

Tokeniser	Algorithm	Vocab Marker	Vocab Size	Used By	Language Agnostic
BPE (byte-level)	Bottom-up merge (byte pairs)	None	50K / 100K	GPT-2, GPT-3, GPT-4, LLaMA	Yes (bytes)
WordPiece	Likelihood-based merge	## (continuation)	30K	BERT, DistilBERT, ALBERT	Partial
SentencePiece BPE	Bottom-up, raw text	▁ (word start)	32K–128K	LLaMA-2/3, T5, Gemma	Yes
SentencePiece Unigram	Top-down pruning	▁ (word start)	32K	mBERT, XLNet, T5	Yes
tiktoken	BPE on bytes	None	100K (cl100k)	GPT-4, GPT-4o, Codex	Yes

tiktoken & Practical Tokenisation In-depth

OpenAI's tiktoken is a fast BPE tokeniser library used by all GPT models. It supports three encodings: r50k_base (GPT-2/3, 50,257 tokens), p50k_base (Codex), and cl100k_base (GPT-4/GPT-4o, 100,277 tokens). The cl100k vocabulary was specifically designed to handle code and multilingual text more efficiently — common programming patterns like function definitions and import statements are often single tokens.

The rough practical rule is ¾ of a word per token for English text: 1,000 tokens ≈ 750 English words ≈ 4–5 average paragraphs. This ratio degrades significantly for non-English text. Chinese and Japanese characters are typically 1–4 tokens per character (since each character is a complex glyph encoded as multiple UTF-8 bytes). Arabic script runs 2–3 tokens per word. Code is token-efficient for English keywords but indentation and special characters add tokens. Understanding these ratios is essential for prompt engineering and cost estimation at scale.

import tiktoken enc = tiktoken.get_encoding("cl100k_base") # GPT-4 tokeniser def count_tokens(text: str) -> int: return len(enc.encode(text)) # Token quirks every practitioner should know tests = [ ("hello world", "lowercase"), ("Hello World", "capitalised — same count but different IDs"), (" hello world", "leading space can differ"), ("1234567890", "numbers split by digit groups"), ("你好世界", "Chinese: ~2-4 tokens PER CHARACTER"), (" def foo():", "indented code — spaces are tokens!"), ] for text, note in tests: toks = enc.encode(text) print(f"{text!r:30s} → {len(toks):3d} tokens # {note}") # Cost estimation (approximate, prices change frequently) def estimate_cost(input_tokens: int, output_tokens: int, model="gpt-4o"): rates = {"gpt-4o": (0.005, 0.015)} # ($/1K input, $/1K output) r_in, r_out = rates.get(model, (0.01, 0.03)) return input_tokens / 1000 * r_in + output_tokens / 1000 * r_out print(f"1K in + 500 out: ${estimate_cost(1000, 500):.4f}") # ~$0.0125

Token Counting & Context Windows In-depth

The context window is the maximum number of tokens a model can process in a single forward pass — both input prompt and generated output count against this limit. Exceeding the context window truncates input, silently losing information. Context window sizes have grown dramatically: from GPT-3's 2,048 tokens (2020) to GPT-4o's 128,000 (2024), Claude 3.5's 200,000, and Gemini 1.5 Pro's 1,000,000. However, larger context windows don't mean models use all context equally well — empirically, LLMs attend more strongly to the beginning and end of long contexts ("lost in the middle" effect).

Token counting matters for three practical reasons. Cost: commercial LLM APIs price per token — $5–$30 per million tokens for frontier models, multiplied by millions of API calls adds up. Context management: overflow silently truncates your prompt — a bug that is easy to miss and hard to debug. Latency: generation cost is proportional to output tokens; every unnecessary token in the response costs inference time and money. Practitioners routinely count tokens in prompt templates, conversation histories, and retrieved documents before sending API requests.

Context Window Sizes — from 8K tokens to 1M tokens (log scale)

⚠ Common Pitfalls — Tokenisation in Production

1. Leading spaces change token IDs. " hello" and "hello" produce different token IDs in tiktoken — relevant for prompt formatting. 2. Numbers split unexpectedly. "GPT-4" → ["G", "PT", "-", "4"]; phone numbers, dates, and prices consume far more tokens than you'd expect. 3. Non-English is expensive. A Chinese prompt of 100 characters may cost 200–400 tokens — 2–4× more than the equivalent English text. 4. Markdown inflates count. Headers, bold markers, code fences, and bullet points all consume tokens. Strip unnecessary formatting from retrieved context before sending. 5. Chat format overhead. OpenAI's chat completions API adds ~4 tokens per message for role/structure overhead — relevant for high-frequency fine-grained API calls.

∑ Chapter 5.2 Summary — Tokenisation

Tokenisation maps raw text to integer IDs from a fixed vocabulary (32K–100K tokens) before it can enter a neural model
Subword is the sweet spot: no OOV, balanced sequence length, handles morphology — all modern LLMs use it
BPE: iteratively merge the most frequent adjacent token pair — used by GPT-2, GPT-3, GPT-4, LLaMA; byte-level BPE is fully language-agnostic
WordPiece: ## marks continuations; likelihood-based merges; vocabulary = 30K — used by BERT, DistilBERT, mBERT
SentencePiece: operates on raw text including spaces; ▁ marks word starts; supports BPE and Unigram LM — used by T5, LLaMA, Gemma
Practical rule: ~¾ word per token for English; non-English and numbers use 2–4× more tokens — critical for cost and context management

5.3

Chapter 5.3

Word Embeddings — Meaning as Geometry

Word embeddings did not just improve NLP performance — they changed how we think about language. When Mikolov et al. showed in 2013 that "king − man + woman ≈ queen" held in a 300-dimensional vector space, it suggested that semantic relationships could be captured as geometric transformations. This was the first evidence that neural representations were not just feature maps — they were encoding structured knowledge about the world.

The Distributional Hypothesis Core

The theoretical foundation of all word embedding methods is the Distributional Hypothesis, stated by linguist J.R. Firth in 1957: "You shall know a word by the company it keeps." The idea is deceptively simple: words that appear in similar linguistic contexts tend to have similar meanings. "Dog" and "cat" both appear near words like "pet", "feed", "vet", "fur", "owner", "breed" — and this co-occurrence pattern reflects their shared semantic category. "Dog" and "quantum" do not share contexts, and they do not share meaning.

This hypothesis transforms the problem of meaning into a problem of statistics. Instead of defining what "happy" means philosophically, we can simply observe that "happy" appears with "smile", "joy", "content", "pleased", "glad" — and "sad" appears with "cry", "grief", "unhappy", "depressed" — and that these two distributional profiles are measurably different. The distributional hypothesis gives us a way to measure semantic similarity without any human annotation: compute the similarity between two words' context distributions.

Every word embedding method — Word2Vec, GloVe, FastText, and even the contextual embeddings of BERT — is an implementation of this hypothesis. They differ in how they model context (local window vs global matrix, character-level vs word-level, static vs contextual), but they all share the core insight: context distribution = meaning.

📖

The Key Insight

You do not need to define what words mean. Observe where they appear, and the geometry of the embedding space will capture the rest. No hand-crafted ontologies, no linguistic rules — just patterns in text.

🔍

Context Window

Most embedding methods use a fixed window of ±k surrounding words as "context". Window size k=5 means the 5 words before and after each target word. Larger k → more topical similarity. Smaller k → more syntactic similarity.

Word2Vec In-depth

Mikolov et al. (Google Brain, 2013) introduced Word2Vec — a family of shallow neural networks that learn word representations by predicting word context. The key insight was framing representation learning as a self-supervised prediction task: given a word, predict its surrounding words. No labels are needed — the text itself provides the training signal. Train on enough text (Google News, 100 billion words) and the resulting vectors encode semantic structure as geometry.

The architecture is deliberately simple: a single-layer neural network with no non-linearity in the hidden layer. The input is a one-hot vector of vocabulary size V. The single hidden layer projects this to a dense vector of dimension d (typically 300). The output layer projects back to V dimensions and applies softmax to produce a probability distribution over the vocabulary. The weight matrix of the hidden layer — shape V × d — is the embedding matrix. After training, each row is the embedding vector for one word.

Word2Vec uses two architectural variants: Skip-gram and CBOW (Continuous Bag of Words). Skip-gram predicts context words from a centre word and works better on small datasets and rare words. CBOW predicts the centre word from its context and trains faster on large corpora. Both are trained with a practical approximation — negative sampling — rather than full softmax over the entire vocabulary (computing softmax over 50,000+ words every step is prohibitively expensive).

With negative sampling, the objective becomes: for each training pair (centre, context), maximise the probability of the true pair while minimising the probability of k randomly sampled negative pairs. This reduces the per-step computation from O(V) to O(k), where k is typically 5–20. The result is a practical algorithm that can be trained on billions of words in hours on a single machine.

Skip-gram & CBOW Architectures In-depth

Given the sentence "The quick brown fox jumps over the lazy dog" with window size 2: Skip-gram takes the centre word "brown" and tries to predict each context word — ("brown", "quick"), ("brown", "The"), ("brown", "fox"), ("brown", "jumps"). One training pair for each context word in the window. CBOW takes all context words ["The", "quick", "fox", "jumps"] and averages their embeddings, then tries to predict the centre word "brown". CBOW is faster (averages context, one prediction per window); Skip-gram trains on more pairs and handles rare words better.

Word2Vec: Skip-gram and CBOW — two ways to learn from context

import gensim.downloader as api # Load pre-trained Word2Vec (Google News, 300d, 3M vocabulary) wv = api.load("word2vec-google-news-300") # Cosine similarity between word vectors print(wv.similarity("dog", "cat")) # → 0.76 (semantically close) print(wv.similarity("king", "queen")) # → 0.73 print(wv.similarity("apple", "motorcycle")) # → 0.04 (unrelated) print(wv.similarity("hot", "cold")) # → 0.36 — antonyms share context! # Most similar words print(wv.most_similar("python", topn=5)) # → [('ruby', 0.78), ('java', 0.76), ('perl', 0.74), ('php', 0.73), ...] # The famous analogy: king − man + woman = ? result = wv.most_similar( positive=["king", "woman"], negative=["man"], topn=3) print(result) # → [('queen', 0.71), ('princess', 0.65), ('monarch', 0.62)] # Train your own Word2Vec on custom corpus from gensim.models import Word2Vec sentences = [ ["the", "cat", "sat", "on", "the", "mat"], ["the", "dog", "sat", "by", "the", "fire"], ] model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, epochs=10) cat_vec = model.wv["cat"] # shape: (100,)

GloVe — Global Vectors Core

Pennington, Socher, and Manning (Stanford NLP, 2014) pointed out a conceptual limitation of Word2Vec: it trains on individual context windows, one at a time, effectively ignoring the global statistics of word co-occurrence across the entire corpus. If "ice" and "steam" both co-occur with "water" but "ice" co-occurs with "solid" and "steam" with "gas", this distinction should be captured — but Word2Vec only sees local windows, not the global ratio structure.

GloVe directly factorises the global word–word co-occurrence matrix X, where Xᵢⱼ is the count of how many times word j appears in the context of word i across the entire corpus. The objective is to find word vectors such that their dot product approximates the log of the co-occurrence count. The weighting function f(Xᵢⱼ) ensures that very frequent pairs (like "the–the") don't dominate the loss — pairs with Xᵢⱼ above a threshold are capped.

In practice, GloVe and Word2Vec produce embeddings of similar quality. GloVe often edges ahead on analogy tasks (because it explicitly models co-occurrence ratios); Word2Vec can be more efficient to train on very large corpora with negative sampling. Both have been largely superseded for downstream tasks by contextual embeddings (BERT, GPT), but GloVe remains popular as a lightweight baseline and for interpretability research.

GloVe Objective J = ∑ᵢⱼ f(Xᵢⱼ) (wᵢᵀ w̃ⱼ + bᵢ + b̃ⱼ − log Xᵢⱼ)² Xᵢⱼ = co-occurrence count of word i and word j across entire corpus wᵢ, w̃ⱼ = word vectors (word i and context word j) f(Xᵢⱼ) = weighting function that caps influence of very frequent co-occurrences Goal: dot product of word vectors ≈ log co-occurrence count → geometry encodes statistics

FastText — Subword Embeddings Core

Bojanowski et al. (Facebook AI Research, 2016) identified a critical gap in both Word2Vec and GloVe: they treat words as atomic units. "Run", "running", "runner", "runs" each get their own independent vector — the morphological relationship between them is invisible to the model. For languages with rich morphology (Finnish, Turkish, German, Arabic), this is devastating: a model may never see the exact form "Freundschaftsbezeigungen" (German for "demonstrations of friendship") but it shares meaningful subwords with common words.

FastText represents each word as a bag of character n-grams. For "where" with n=3: the word is decomposed as ["<wh", "whe", "her", "ere", "re>", "<where>"] (with boundary markers < and >). The final word vector is the sum of all its n-gram vectors. Each n-gram has its own embedding — these are what get trained. The boundary markers ensure "whe" in "where" and "whe" in "elsewhere" contribute differently because they occur in different boundary contexts.

The payoff: FastText can produce meaningful vectors for words never seen during training, including misspellings, technical jargon, and morphological variants. "Antidisestablishmentarianism" will share n-grams with "establish", "establishment", "disestablish", "ism", etc., and their combined embedding will be semantically meaningful. FastText is still used in production where domain vocabulary is highly variable — scientific text, social media, multilingual pipelines.

Word2Vec / GloVe

FastText

Word is atomic unit — one vector per word

Word = sum of character n-gram vectors

OOV → [UNK] — information completely lost

No OOV — any word representable by its n-grams

Rare words get poor embeddings (few training examples)

Rare words inherit n-gram vectors from common words

Best for: high-frequency common vocabulary tasks

Best for: multilingual, medical, social media, morphological languages

Inference: single lookup O(1)

Inference: sum of n-gram vectors — slightly slower

Vector Arithmetic & Analogies In-depth

The most celebrated discovery in word embeddings is that semantic relationships are encoded as consistent directions in vector space. The vector from "man" to "woman" is approximately the same as the vector from "king" to "queen", from "uncle" to "aunt", from "actor" to "actress". This "gender direction" is a consistent geometric transformation across the embedding space. Similarly, there is a "capital city direction" (France→Paris ≈ Germany→Berlin ≈ Japan→Tokyo), a "superlative direction" (big→biggest ≈ small→smallest), and a "past tense direction" (run→ran ≈ walk→walked).

This property emerged from training — it was not engineered in. It suggests that the distributional statistics of language contain enough signal to implicitly encode the relational structure of the world. The analogy task became a standard benchmark: given A:B::C:?, find D such that B−A+C ≈ D in vector space. Word2Vec achieves ~65% accuracy on the Google Analogy Dataset (20,000 analogies across semantic and syntactic categories) — far above what was thought possible with shallow models.

Famous Word Vector Analogies

king − man + woman ≈ queen # gender direction

Paris − France + Germany ≈ Berlin # capital city direction

biggest − big + small ≈ smallest # superlative direction

running − run + walk ≈ walking # verb tense direction

doctor − man + woman ≈ nurse # ⚠ encodes gender bias!

Word Embedding Space — semantic relationships encoded as geometry

Embedding Properties & Limitations Core

Word embeddings inherit — and amplify — the biases present in their training data. Bolukbasi et al. (2016) demonstrated that in Word2Vec trained on Google News: "doctor − nurse ≈ man − woman", "programmer − homemaker ≈ man − woman", "brilliant − dull ≈ man − woman". These gender stereotypes are encoded as geometric structure in the embedding space. When downstream models use these embeddings, the bias propagates: a résumé classifier using Word2Vec may discriminate based on field-specific vocabulary that encodes gender. Debiasing techniques exist (projecting out the gender direction) but are only partially effective — the bias is distributed across the space, not concentrated in one direction.

The most fundamental limitation of all static word embeddings is context independence: every word has a single vector regardless of usage. The word "bank" in "She went to the bank to withdraw money" and "She sat on the river bank" produce the exact same 300-dimensional vector. The model averages the two senses of "bank" into one representation — losing the information needed to distinguish them. This polysemy problem is unresolvable within the static embedding framework, no matter how large the training corpus. It is the primary motivation for contextual embeddings: BERT, GPT, and their successors assign each word occurrence a different vector based on its surrounding context (Chapter 5.4).

⚠ Critical Limitation — Static Embeddings Cannot Handle Polysemy

Word2Vec, GloVe, and FastText give every word one vector for all contexts. "I deposited money at the bank" and "I fished at the river bank" produce the same "bank" embedding — the vector is a weighted average of all senses. For tasks requiring word-sense disambiguation, coreference resolution, or semantic role labelling, static embeddings hit a ceiling that no amount of data or dimensions can overcome. This is why BERT (2018) was a watershed: it introduced position-and-context-dependent representations, effectively making static embeddings obsolete for most NLP tasks.

Method	Training Objective	Context	OOV	Typical Dim	Still Used?
Word2Vec	Predict context / centre	Local window	[UNK]	100–300	Baselines, feature eng
GloVe	Factorise co-occurrence matrix	Global corpus	[UNK]	100–300	NLP baselines
FastText	Subword n-gram sum	Local window	No OOV	300	Multilingual, rare vocab
BERT (contextual)	Masked language model	Full sentence	Subword	768	Yes — encoder tasks
GPT (contextual)	Causal language model	Causal window	Subword	768–12288	Yes — generation

∑ Chapter 5.3 Summary — Word Embeddings

Distributional hypothesis: words with similar contexts have similar meaning — foundation of all embedding methods (Firth, 1957)
Word2Vec: two architectures trained by predicting word context — Skip-gram (centre → context, better for rare words) and CBOW (context → centre, faster)
"king − man + woman ≈ queen" — semantic relationships are directions in geometry; emerged from training, not engineered
GloVe: factorises the global co-occurrence matrix — explicitly models co-occurrence ratios across the entire corpus
FastText: word = sum of character n-gram vectors — handles OOV, morphologically rich languages, and rare vocabulary
Critical limitation: static embeddings give the same vector regardless of context — "bank" is identical in "river bank" and "bank account" — solved by BERT (Chapter 5.4)

5.4

Chapter 5.4

Contextual Embeddings & Pre-trained Language Models

Static word embeddings were a revolution — but they hit a ceiling. The same vector for "bank" in every sentence is a fundamental architectural limit, not a training data problem. The field needed representations that compute word meaning dynamically based on context. ELMo, ULMFiT, and then BERT answered that need — and in doing so, established the pre-train/fine-tune paradigm that defines modern NLP.

The Problem with Static Embeddings Core

Word2Vec, GloVe, and FastText assign each word type exactly one vector, shared across all its occurrences. This is adequate for words with a single dominant sense — "elephant" nearly always means the same thing. But English has thousands of polysemous words: "bank" (financial institution / river edge), "bat" (cricket equipment / flying mammal), "light" (not heavy / illumination / a lamp), "book" (a publication / to reserve), "well" (healthy / a water source / interjection). The Word2Vec vector for "bank" is a weighted average of all its senses — useful for neither.

The polysemy ceiling is not solvable by training on more data. No matter how large the corpus, a single vector must average all contexts. The architecture itself is the limitation: static embeddings compute representations before seeing the sentence. What is needed is a model that reads the full sentence, then assigns each word a representation based on its role in that specific sentence. This is exactly what contextual embedding models provide.

Static vs Contextual Embeddings — context disambiguates polysemous words

ELMo — Embeddings from Language Models Core

Peters et al. (AllenNLP, 2018) introduced ELMo — Embeddings from Language Models — the first widely adopted contextual word embedding. The architecture is a two-layer bidirectional LSTM language model pre-trained on 1 billion words (1 Billion Word Benchmark). Two passes through the sentence: a forward LM reads left to right and learns to predict the next word; a backward LM reads right to left and learns to predict the previous word. For each token, the forward and backward hidden states from all layers are concatenated, producing a context-sensitive representation.

ELMo representations are used as frozen features — the ELMo model is not fine-tuned on downstream tasks. Instead, the pre-computed ELMo vectors are concatenated to the input of existing task-specific models (NER taggers, QA systems, coreference models). This "feature-based" approach produced large, consistent improvements across NLP benchmarks — the first empirical proof that language model pre-training transfers broadly. ELMo improved the state-of-the-art on 6 NLP tasks simultaneously, which was extraordinary at the time.

ELMo's key limitation: the underlying architecture is an LSTM, which processes sequences sequentially (O(n) depth) and cannot parallelise across token positions. Training is slow and the representations are computed sequentially at inference. The Transformer architecture (Chapter 5.5), with O(1) depth and full parallelism via self-attention, replaced the LSTM backbone in every subsequent contextual embedding model.

ELMo — bidirectional LSTM creates context-dependent word representations

The ULMFiT Pre-train / Fine-tune Paradigm In-depth

Howard & Ruder (2018) introduced ULMFiT (Universal Language Model Fine-Tuning) — the paper that established the three-stage paradigm now universal in NLP. Where ELMo froze the language model and used it as a feature extractor, ULMFiT's insight was that the language model itself should be fine-tuned end-to-end on downstream tasks. This shifts the mental model from "use LM features" to "adapt a pre-trained LM for each task". BERT and GPT made this paradigm dominant — but ULMFiT proved it worked first.

ULMFiT introduced two now-standard fine-tuning techniques. Discriminative fine-tuning: assign different learning rates to each layer — earlier layers (which capture general syntax and morphology) are updated very slowly; later layers (which capture task-specific semantics) are updated faster. Gradual unfreezing: start fine-tuning only the last layer, then progressively unfreeze earlier layers one at a time. This prevents catastrophic forgetting — the phenomenon where fine-tuning on a small task dataset destroys the broad language knowledge acquired during pre-training.

The three stages generalise directly to all modern pre-trained language models. Stage 1 (pre-training) is expensive but done once and shared via model hubs. Stage 2 (domain adaptation) is optional but valuable for specialised domains (biomedical, legal, code). Stage 3 (task fine-tuning) is cheap — hours on a single GPU with hundreds to thousands of labelled examples, compared to millions required to train from scratch. This cost asymmetry is the fundamental economic argument for the transformer pre-training paradigm.

Pre-train → Domain Adapt → Task Fine-tune — the modern NLP paradigm

Self-Supervised Pre-training Tasks In-depth

Causal Language Modelling (CLM) — used by GPT, GPT-2, GPT-3, LLaMA: predict the next token given all previous tokens. Objective: maximise P(xₜ | x₁,...,xₜ₋₁). The model never sees future tokens during training — it processes left-to-right with a causal (triangular) attention mask. This makes it naturally suited to generation: at inference, repeatedly predict the next token and append it to the sequence.

Masked Language Modelling (MLM) — used by BERT, RoBERTa, DeBERTa: randomly mask 15% of input tokens (replacing them with [MASK]), then predict the original token using the full surrounding context. Objective: maximise P(masked | all other tokens). Because both left and right context is available simultaneously, the model builds bidirectional representations — excellent for understanding tasks (classification, NER, QA) but not for generation.

Next Sentence Prediction (NSP) was used in original BERT alongside MLM: given two text segments, predict whether they appear consecutively in the source document. Later analysis (RoBERTa, 2019) showed NSP adds little benefit and can hurt performance by forcing artificially short segments — it was removed in subsequent models. Replaced Token Detection (RTD) — used by ELECTRA: a small generator network creates plausible but fake token replacements; the main discriminator must identify which tokens were replaced. Every token gets a training signal (not just 15% as in MLM), making ELECTRA 4× more efficient for the same computational budget.

Self-Supervised Pre-training Tasks — CLM, MLM, NSP, RTD

cat sat on mat" Generator replaced "cat" → "dog" Discriminator: label each token orig/replaced ALL tokens trained — 4× more efficient than MLM

Sentence Embeddings In-depth

Word-level contextual embeddings (one vector per token) are essential for token-level tasks like NER and POS tagging — but many applications require a single vector for an entire sentence, paragraph, or document. How do you pool a variable-length sequence of token vectors into one fixed-size representation? Three approaches have been widely used.

[CLS] token pooling (BERT's approach): prepend a special [CLS] (classification) token to every input. The Transformer processes all tokens together with full self-attention. In theory, the [CLS] token's output representation aggregates information from the entire sequence. In practice, this works well after fine-tuning on a specific task — but out-of-the-box BERT [CLS] vectors perform poorly on semantic similarity benchmarks, because BERT was not trained to produce meaningful sentence-level representations in [CLS].

Mean pooling: average all token embeddings in the final layer. Surprisingly effective as a baseline — often outperforms [CLS] pooling on zero-shot semantic similarity without fine-tuning. Simple to implement and parameter-free. Sentence-BERT (SBERT, Reimers & Gurevych, 2019) addresses both approaches by fine-tuning BERT with a siamese / triplet network objective on sentence pairs — training the model to produce similar vectors for semantically similar sentences. SBERT dramatically outperforms naive pooling on STS benchmarks and is 20–30× faster for pair-wise similarity computation than vanilla BERT (which requires a separate forward pass for every pair).

Sentence Embeddings — [CLS] pooling vs Mean pooling

Embedding Models & Semantic Search Core

Sentence embeddings power semantic search — retrieving documents by meaning rather than keyword overlap. A query like "What city is France's capital?" should retrieve "Paris is the seat of French government" even though it shares no keywords. The approach: embed all documents once into a vector index; at query time, embed the query and retrieve the closest document vectors by cosine similarity. This is the retrieval component of Retrieval-Augmented Generation (RAG) systems.

Model	Provider	Dimensions	Max Tokens	Best For	MTEB Score
text-embedding-3-large	OpenAI	3072	8,191	General purpose, RAG	~65
text-embedding-3-small	OpenAI	1536	8,191	Cost-efficient RAG	~62
text-embedding-ada-002	OpenAI	1536	8,191	Legacy, widely used	~61
all-MiniLM-L6-v2	SBERT / HuggingFace	384	256	Fast, lightweight, on-device	~56
e5-large-v2	Microsoft	1024	512	Strong open-source baseline	~63
BGE-M3	BAAI	1024	8,192	Multilingual, long documents	~65

from sentence_transformers import SentenceTransformer import numpy as np from sklearn.metrics.pairwise import cosine_similarity model = SentenceTransformer('all-MiniLM-L6-v2') # Corpus of documents to index docs = [ "The cat sat on the mat", "Dogs are loyal companions", "Paris is the capital of France", "Neural networks learn from data", "The Eiffel Tower is in Paris", ] # Encode all documents once (production: store in a vector DB) doc_embeddings = model.encode(docs) # shape: (5, 384) # At query time: encode the question query = "What city is the capital of France?" query_emb = model.encode([query]) # shape: (1, 384) # Rank by cosine similarity scores = cosine_similarity(query_emb, doc_embeddings)[0] results = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True) for doc, score in results: print(f"{score:.3f}: {doc}") # → 0.852: "Paris is the capital of France" ← correct top result # → 0.701: "The Eiffel Tower is in Paris" # → 0.212: "Neural networks learn from data" # OpenAI embedding API (for production) from openai import OpenAI client = OpenAI() response = client.embeddings.create( model="text-embedding-3-small", input=["Hello world"] ) vec = response.data[0].embedding # list of 1536 floats

∑ Chapter 5.4 Summary — Contextual Embeddings & Pre-trained LMs

Static embeddings fail on polysemy: same vector for "bank" in all contexts — fundamental architecture limit, not a data problem
ELMo: first practical contextual embeddings using bidirectional LSTM LM — forward + backward hidden states concatenated per token
ULMFiT established the pre-train / fine-tune paradigm: expensive once, cheap adaptation — with discriminative LR and gradual unfreezing
CLM (GPT): predict next token — left-to-right, generative. MLM (BERT): mask 15%, predict — bidirectional, understanding
RTD (ELECTRA): discriminate original vs replaced tokens — 4× more compute-efficient than MLM
Sentence embeddings: [CLS] pooling or mean pooling → SBERT fine-tuning dramatically improves semantic similarity quality for RAG and search

5.5

Chapter 5.5

The GPT Family

From GPT-1 to GPT-4 — how decoder-only transformers and scale created the generative AI era. The GPT lineage proved a single, deceptively simple idea: train a very large decoder-only Transformer on very large data with a next-token-prediction objective, and intelligence-like capabilities emerge.

GPT Architecture In-Depth

GPT = Generative Pre-trained Transformer — a decoder-only Transformer. Unlike BERT (encoder-only, bidirectional), GPT uses causal (masked) self-attention: each token can attend ONLY to previous tokens. This constraint is not a limitation — it's the design that enables generation. You can't look at future tokens while generating them.

The key architectural choices that distinguish GPT from BERT:

Causal Self-Attention

Each token attends only to tokens before it. Implemented via a triangular mask that sets future positions to −∞ before softmax. This makes the model autoregressive — it can generate one token at a time, left to right.

No Encoder

GPT uses a single stack of N transformer decoder blocks. No encoder–decoder cross-attention. The entire prompt and generated text flow through the same stack. Simplicity at scale.

Autoregressive Generation

Given a prompt, predict the next token → append it → repeat. Each forward pass produces one token. Generation is sequential by nature — you can't parallelise the generation of future tokens (though prompt processing is parallel).

Why Decoder-Only Wins for Generation

Encoder-only models (BERT) see the full context bidirectionally — great for understanding, but can't generate. Decoder-only enforces the causal constraint that makes autoregressive generation coherent and consistent.

GPT Decoder-Only Architecture — causal attention, autoregressive generation

Neural Scaling Laws In-Depth

Kaplan et al. (OpenAI, 2020) discovered that language model loss decreases as a smooth power law as you increase model size (N), dataset size (D), or compute (C). This isn't a vague trend — it's a precise mathematical relationship: L(N) ∝ N^−α where α ≈ 0.076. Double the parameters and loss drops predictably.

Three factors drive scaling: N (parameters), D (dataset tokens), and C (compute in FLOPs). The breakthrough insight: you must scale all three together. Scaling parameters alone while holding data fixed gives diminishing returns.

The Chinchilla Finding (Hoffmann et al., 2022)

Optimal scaling allocates equal compute budget to parameters AND data. GPT-3 was undertrained: 175B params trained on only 300B tokens. Chinchilla-optimal would be ~3.5T tokens. LLaMA-2 7B trained on 2T tokens — far more tokens per parameter than GPT-3 — and performed remarkably well. Practical implication: "Train a smaller model on more data" — better for inference costs.

Kaplan Scaling Laws (2020) L(N) ≈ (N_c/N)^α_N L(D) ≈ (D_c/D)^α_D L(C) ≈ (C_c/C)^α_C N = parameters, D = dataset tokens, C = compute (FLOPs)
Chinchilla Optimal Scaling (2022) N_opt ∝ C^0.5 D_opt ∝ C^0.5 For every 2× increase in compute → double BOTH model size AND training tokens

Neural Scaling Laws — Loss vs Model Size (log-log plot)

GPT-1 to GPT-4 Timeline In-Depth

📝

GPT-1 — June 2018

117M parameters, trained on BookCorpus. First generative pre-training paper. Demonstrated that unsupervised pre-training + supervised fine-tuning on 12 tasks produced strong NLU results. Proof of concept.

🚀

GPT-2 — Feb 2019

1.5B parameters, trained on WebText (40GB of Reddit-filtered web pages). First model to show zero-shot capabilities — performing tasks with no task-specific training. The "too dangerous to release" controversy put LLMs in the public consciousness.

⚡

GPT-3 — June 2020

175B parameters, trained on 300B tokens. Introduced few-shot learning from the prompt alone — no gradient updates needed. In-context learning: provide examples in the prompt, and GPT-3 generalises. This changed everything.

🎯

InstructGPT — Jan 2022

GPT-3 fine-tuned with RLHF (Reinforcement Learning from Human Feedback). Follows instructions, avoids harmful output. Much more useful than raw GPT-3. Foundation for alignment research.

💬

ChatGPT — Nov 2022

GPT-3.5 + RLHF + chat interface. 100 million users in 60 days — fastest product adoption in history. Made LLMs accessible to non-technical users. Started the "AI moment".

🧠

GPT-4 — March 2023

Multimodal (image + text), estimated ~1 trillion parameters. Professional exam performance: passed the bar exam (90th percentile), SAT, medical licensing. Step change in reasoning quality.

🌐

GPT-4o — 2024

Native voice + vision, fast inference, GPT-4 quality at lower cost. "Omni" model — unified multimodal architecture. Real-time conversation with vision understanding.

🔗

o1, o3 — 2024–2025

Reasoning models with chain-of-thought. New frontier: models that "think" before answering, spending more compute at inference time. Trade speed for accuracy on complex tasks.

GPT Family — exponential parameter growth from 117M to ~1T

Emergent Capabilities In-Depth

Wei et al. (2022) documented a surprising phenomenon: certain capabilities appear suddenly at a scale threshold — they are essentially absent in smaller models and then abruptly present in larger ones. These emergent abilities were not explicitly trained. The model was only ever trained to predict the next token. Yet above a certain parameter count, it can perform multi-step arithmetic, chain-of-thought reasoning, translation between unseen language pairs, and code generation.

Multi-step Arithmetic

Below ~10B params ≈ random performance. Above 100B → suddenly works with high accuracy. The model learns to decompose calculations despite never being explicitly taught arithmetic.

Chain-of-thought Reasoning

Appears around 100B parameters. Prompting "Let's think step by step" has zero effect on small models but dramatically improves large model accuracy on multi-step reasoning tasks.

Unseen Language Translation

Models trained primarily on English data can translate between language pairs never seen during training. This capability emerges at scale — evidence of internal multilingual representations.

Code Generation

Near-zero at 1B, functional at 10B, excellent at 100B+. Models go from generating syntactic garbage to writing correct, complex programs — a phase transition in capability.

The Debate: Are Emergent Abilities Real?

Schaeffer et al. (2023) argued that emergent abilities may be measurement artifacts — they appear "sudden" because we use discontinuous metrics (e.g., exact-match accuracy). With continuous metrics (e.g., log-likelihood), improvement is smooth. The debate continues, but the practical observation holds: there are capability thresholds below which models are useless at certain tasks.

Emergent Abilities — sudden capability jumps at scale thresholds

Inference & Sampling Strategies Core

How does an LLM actually generate text? At each step, the model outputs a probability distribution over the entire vocabulary. The decoding strategy determines which token to pick from that distribution. This choice dramatically affects output quality, diversity, and creativity.

Greedy

Always pick the most probable token. Deterministic, often repetitive for long texts. argmax at every step.

Beam Search

Maintain top-k sequences at each step, pick the best overall. Better quality than greedy, still not diverse.

Sampling

Sample randomly from the full distribution. Diverse but can produce incoherent text — low-probability tokens get chosen.

Top-k Sampling

Sample only from the top k most likely tokens. k=50 is common. Balances diversity and coherence.

Top-p / Nucleus

Sample from the smallest set of tokens whose cumulative probability ≥ p. Adaptive vocabulary — more tokens when distribution is flat, fewer when peaked.

Temperature

Scale logits by T before softmax. T<1 → sharper (more deterministic). T>1 → flatter (more creative/chaotic). T=0 ≈ greedy.

LLM Decoding Strategies — greedy, top-k, top-p, temperature

Open-Source LLMs Core

The open-source LLM ecosystem exploded in 2023–2025. Models from Meta, Mistral, Alibaba, Google, Microsoft, and others are approaching closed-source frontier quality. This table captures the major families as of 2024–2025.

Model	Provider	Params	Context	License	Notable
LLaMA 3 8B/70B/405B	Meta	8B–405B	128K	Llama 3	Best open-source 2024
Mistral 7B / 8×7B	Mistral AI	7B / ~45B	32K	Apache 2.0	Efficient MoE (Mixtral)
Qwen2.5 7B/72B	Alibaba	7B–72B	128K	Qwen	Strong multilingual
Gemma 2 9B/27B	Google	9B / 27B	8K	Gemma	Strong at size
Phi-3 mini/small	Microsoft	3.8B / 7B	128K	MIT	Small but capable
DeepSeek-R1	DeepSeek	7B–671B	64K	MIT	Reasoning-focused
Command-R+	Cohere	104B	128K	CC BY-NC	RAG-optimised

∑ Chapter 5.5 Summary — The GPT Family

GPT = decoder-only Transformer + causal (left-to-right) attention = autoregressive generation
Scaling laws: loss decreases as power law with parameters, data, and compute
Chinchilla: optimal training = equal compute budget for parameters AND data
Emergent abilities: capabilities appear suddenly at scale thresholds — not trained explicitly
Inference: top-p (nucleus) sampling at temperature 0.7–1.0 is the typical LLM generation setting
Open-source LLMs (LLaMA 3, Mistral, Qwen) are approaching closed-source frontier quality

5.6

Chapter 5.6

BERT & Encoder Models

BERT introduced a paradigm shift: instead of predicting the next word left-to-right, mask some words and predict them using the FULL surrounding context. This bidirectional pre-training produces richer representations that dominate understanding tasks — classification, NER, QA, and semantic search.

BERT Architecture In-Depth

Devlin et al. (Google, 2018): BERT — Bidirectional Encoder Representations from Transformers. BERT uses an encoder-only Transformer stack — no decoder, no causal mask. Every token attends to all other tokens simultaneously (bidirectional attention). This is the key difference from GPT: BERT sees the full context before producing representations.

BERT-base

12 Transformer layers
768 hidden dimension
12 attention heads
110M parameters

BERT-large

24 Transformer layers
1024 hidden dimension
16 attention heads
340M parameters

BERT Model Sizes BERT-base: L=12 layers, H=768 dim, A=12 heads, 110M parameters BERT-large: L=24 layers, H=1024 dim, A=16 heads, 340M parameters Input: [CLS] sentence_A [SEP] sentence_B [SEP] Output: Contextual embedding for every input token (shape: seq_len × 768)

BERT vs GPT Attention — bidirectional vs causal masking

BERT Special Inputs Core

BERT's input representation is the sum of three embedding types (not concatenated). Every input is prepended with [CLS] and sentence pairs are separated by [SEP].

🏷️

[CLS] Token

Classification token prepended to every input. Its final hidden state is used as the aggregate sequence representation for classification tasks.

✂️

[SEP] Token

Separator token between sentence A and sentence B. Also appended at the end of the input sequence.

🎭

[MASK] Token

Replaces 15% of tokens during pre-training. Of those 15%: 80% → [MASK], 10% → random word, 10% → kept unchanged.

Segment embeddings tell BERT which sentence each token belongs to (Sentence A vs Sentence B). The three input components are summed element-wise: Token Embedding + Positional Embedding + Segment Embedding.

BERT Input = Token Embedding + Positional Embedding + Segment Embedding

Fine-tuning BERT In-Depth

BERT's power lies in fine-tuning: take the pre-trained backbone and add a thin task-specific head. All BERT weights are updated during fine-tuning (with a small learning rate). Four canonical task types:

📄

Sequence Classification

Sentiment, topic, NLI. Add FC layer on top of [CLS] embedding → class probabilities. Fine-tune all BERT weights + FC layer.

🏷️

Token Classification

NER, POS tagging. Add FC layer on every token embedding → per-token labels. Each token gets a label independently.

❓

Extractive Question Answering

Input: [CLS] question [SEP] passage [SEP]. Output: start + end position — which span in the passage is the answer. Two vectors classify each token as answer-start or answer-end.

🔗

Sentence Pair Tasks

Similarity, entailment. Input: [CLS] sentence A [SEP] sentence B [SEP]. Use [CLS] embedding as pair representation.

BERT Fine-tuning Tasks — one backbone, four output heads

Code — BERT Fine-tuning with HuggingFace

from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
# Load pre-trained BERT + add classification head
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Tokenise input (handles [CLS] and [SEP] automatically) def tokenize_fn(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

dataset = load_dataset('imdb')
tokenized = dataset.map(tokenize_fn, batched=True)

training_args = TrainingArguments(
    output_dir='./bert-sentiment',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5, # CRITICAL: small LR for fine-tuning pre-trained model
weight_decay=0.01,
    evaluation_strategy='epoch',
    warmup_steps=500
)

trainer = Trainer(model=model, args=training_args,
                  train_dataset=tokenized['train'], eval_dataset=tokenized['test'])
trainer.train()

The BERT Family Core

Model	Key Innovation	Params	Training Data	Notable Improvement
BERT-base	Bidirectional Transformer + MLM + NSP	110M	BookCorpus + Wikipedia	Baseline
RoBERTa (Facebook)	Remove NSP, larger batches, more data, longer training	125M	160GB text	5–10% improvement on GLUE
DistilBERT (HuggingFace)	Knowledge distillation from BERT (40% smaller)	66M	Same	60% faster, 97% of BERT's performance
ALBERT (Google)	Cross-layer parameter sharing, sentence order prediction	12M–235M	Same	Same performance, fraction of params
DeBERTa (Microsoft)	Disentangled attention (separate content + position)	86M–1.5B	160GB	State-of-the-art on SuperGLUE
ELECTRA (Google)	Replaced Token Detection (more efficient training)	14M–335M	Same	4× more efficient than BERT

BERT vs GPT

Aspect	BERT (Encoder)	GPT (Decoder)
Attention	Bidirectional (all tokens)	Causal (left-to-right only)
Pre-training	Masked LM + NSP	Next-token prediction
Best For	Understanding (classify, NER, QA)	Generation (chat, completion)
Output	Contextual embeddings	Generated text
Fine-tuning	Add task head, small dataset OK	Prompt-based, few-shot
Scale	110M–1.5B	117M–1.8T+

When to Use Encoders vs Decoders Core

Use BERT / Encoder Models

Use GPT / Decoder Models

→ Understanding tasks (classification, NER, QA)
→ Sentence embeddings for semantic search
→ NLI and entailment
→ Smaller, faster fine-tuning
→ Bidirectional context needed
→ Tasks with fixed input→label format

→ Generation tasks (chat, completion, summarisation)
→ Zero/few-shot prompting
→ Reasoning over long contexts
→ Instruction following
→ Tasks requiring flexible output format
→ When you have no task-specific labels

∑ Chapter 5.6 Summary — BERT & Encoder Models

BERT: encoder-only, bidirectional attention — each token sees all tokens simultaneously
Pre-training: Masked LM (predict 15% masked tokens) + NSP on Wikipedia + BookCorpus
Fine-tuning: add task-specific head on [CLS] (classification) or all tokens (NER)
RoBERTa improves BERT by: more data, remove NSP, larger batches, longer training
DistilBERT: 40% smaller, 60% faster, 97% of BERT performance via knowledge distillation
Use BERT for understanding tasks; use GPT for generation and instruction following

5.7

Chapter 5.7

Prompt Engineering

Prompt engineering is the art and science of crafting inputs to get desired outputs from LLMs. The same model can produce radically different quality depending on how you ask — mastering the prompt is mastering the interface to intelligence.

What Is Prompt Engineering? Core

LLMs are not search engines — they are conditional probability machines. Given your prompt as the beginning of a document, they predict what comes next. The quality and structure of that beginning determines everything about the continuation.

Compare: "What is the capital of France?" vs "Answer as a geography teacher giving a detailed explanation: What is the capital of France?" — same factual answer but very different style and depth.

Mental Model

You are writing the beginning of a document that the LLM will continue. The better the beginning, the better the continuation.

Five prompt components:

① Instruction

What you want the model to do. Be specific and explicit: "Summarise in 3 bullets" not "Summarise".

② Context

Background information the model needs: domain, audience, constraints, prior conversation.

③ Input Data

The actual content to process: text to classify, code to review, question to answer.

④ Output Format

How you want the answer: JSON, bullet list, table, single word, code block.

⑤ Examples

Demonstrations of desired input→output pairs (few-shot). The model learns the pattern in-context.

Zero-Shot & Few-Shot Prompting In-Depth

Zero-shot: ask the model without any examples. Works well for simple, well-defined tasks.

Few-shot (in-context learning): provide 2–5 examples before asking. Brown et al. (GPT-3, 2020) showed that providing examples dramatically improves performance. The model is not fine-tuned — it adapts to the task from examples in its context window.

One-shot: exactly one example — sometimes all you need for well-defined tasks.

Key Insight

Example selection matters: choose examples that cover edge cases and represent the full range of expected inputs. Diverse examples outperform similar ones.

Zero-Shot, One-Shot, Few-Shot — In-Context Learning

Chain-of-Thought Prompting In-Depth

Wei et al. (Google, 2022): "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". The key insight: prompting LLMs to "think step by step" dramatically improves reasoning accuracy on math, logic, and multi-step problems.

Why it works: the model generates intermediate steps → each step conditions the next → less error accumulation. The chain of reasoning acts as a scratchpad that keeps the model on track.

Standard Prompting

Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many?

A: 11

↑ Direct answer — works for simple tasks, fails for multi-step

CoT Prompting

Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many?

A: Roger started with 5. Each can has 3 balls, so 2 cans = 6 balls. 5 + 6 = 11. The answer is 11.

↑ Explicit steps — each step conditions the next

Zero-Shot CoT (Kojima et al., 2022)

Just add "Let's think step by step" to any prompt — no examples needed. This simple suffix unlocks reasoning in large models.

Chain-of-Thought — explicit reasoning steps dramatically improve accuracy

Structured Prompting In-Depth

Structure your prompts for reliable, parseable output. Three key techniques:

📋

Output Formatting

Ask for JSON, XML, or specific structure. Example: "Return as JSON: {"sentiment": "...", "confidence": 0-1}"

→ Parse programmatically, no regex hacks

🎭

Role Prompting

"You are an expert Python developer with 10 years of experience."

→ Sets persona, knowledge domain, and response style

🔒

Delimiters

Use triple backticks, XML tags, or --- to separate instruction from data.

→ Prevents prompt injection, clarifies boundaries

Code — Structured Entity Extraction with OpenAI API

import openai, json

def extract_entities(text: str) -> dict:
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You extract named entities from text. Return JSON only."},
            {"role": "user", "content": f"""Extract entities from:
```{text}```

Return format: {{"people": [...], "organizations": [...], "locations": [...], "dates": [...]}}"}
        ],
        temperature=0.0,           # deterministic for structured output
        response_format={"type": "json_object"}   # GPT-4 JSON mode
    )
    return json.loads(response.choices[0].message.content)

result = extract_entities("Tim Cook announced Apple's Q3 earnings in Cupertino on Tuesday, August 1st.")
print(json.dumps(result, indent=2))
# {"people": ["Tim Cook"], "organizations": ["Apple"], "locations": ["Cupertino"], "dates": ["August 1st"]}

System Prompts

In chat-based APIs (OpenAI, Anthropic, etc.), the system prompt sets the model's behaviour, persona, and constraints before the user speaks. It's the most powerful lever for controlling output quality.

What Goes in System Prompts

Role and persona definition
Output format requirements
Constraints and guardrails
Domain knowledge or context
Tone and style instructions

Best Practices

Be explicit and specific — don't assume the model infers intent
Put constraints up front (format, length, language)
Use delimiters to separate user content from instructions
Test with adversarial inputs
Don't rely on system prompt secrecy for security

System Prompt Template You are [ROLE] with expertise in [DOMAIN]. Your task: [INSTRUCTION] Rules: 1) [CONSTRAINT] 2) [FORMAT] 3) [GUARDRAIL] If uncertain, say "I don't know" — do NOT hallucinate. Always put the most important constraints first — models attend more strongly to the beginning.

Prompt Patterns Catalogue Core

Pattern	When to Use	Template	Example
Role Pattern	Need domain expertise	"You are a [role] with [experience]..."	"You are a senior Python engineer reviewing code for bugs"
Step-by-Step	Multi-step reasoning, math	"Think step by step..."	"Solve this problem step by step: ..."
Output Format	Need structured data	"Return as JSON/list/table..."	"Return as JSON: {fields}"
Few-Shot	Task hard to specify, need examples	"[Example 1]→[Output 1]\n[Input]→?"	Sentiment, classification, entity extraction
Chain-of-Thought	Reasoning, math, logic	"[problem] Let's think step by step"	Math word problems, logical puzzles
Delimiter	Long context, avoid injection	"Summarise: ```{text}```"	Document processing, code review
Self-Ask	Complex multi-hop questions	"Are there any follow-up questions?"	Research synthesis, fact verification

Common Pitfalls Core

⚠️

Prompt Injection

Malicious input overrides instructions: "Ignore previous instructions and..."

Mitigation: use delimiters, validate inputs, separate system and user content.

⚠️

Prompt Leaking

User can extract system prompt: "Repeat all your instructions"

Mitigation: don't rely on prompt secrecy for security, use proper access controls.

💡

Ambiguous Instructions

Vague prompts → inconsistent outputs. Be explicit: "Respond in 3 bullet points of max 20 words each".

💡

Lost in the Middle

LLMs attend better to start and end of context. Put most important info first or last. (Liu et al., 2023: "Lost in the Middle" phenomenon)

∑ Chapter 5.7 Summary — Prompt Engineering

Few-shot in-context learning: examples in the prompt teach the task — no gradient updates needed
Chain-of-thought: "Let's think step by step" — explicit reasoning steps reduce errors
Structured output: specify JSON/XML format → parse programmatically
ReAct pattern: Thought→Action→Observation loop — foundation of tool-using agents (Domain 8)
Prompt injection: user input can override instructions — always use delimiters to separate content
"Lost in the Middle": LLMs attend best to start and end of context — put key info there

5.8

Chapter 5.8

Retrieval-Augmented Generation

RAG is the bridge between an LLM's frozen knowledge and the living, changing world. Instead of retraining a model every time information changes, retrieve relevant documents at query time and inject them into the prompt — grounding the model's answers in real, verifiable sources.

Why RAG? Core

Three fundamental LLM limitations make RAG essential for production systems:

📅

Knowledge Cutoff

GPT-4's training data has a cutoff date. Any event after it is unknown. "What happened at the UN Security Council yesterday?" → hallucination.

RAG fix: retrieve yesterday's news, inject into context.

🌀

Hallucination on Specifics

LLMs confabulate details — addresses, phone numbers, dates, internal policies. "What is our Q3 refund policy?" → makes something up.

RAG fix: retrieve actual policy document, ground the answer.

🔒

Private Knowledge

Your internal docs, contracts, code, Slack history — not in any LLM. "Summarise our client contract with Acme Corp" → impossible.

RAG fix: embed and retrieve from your private document store.

RAG Architecture In-Depth

RAG has two phases: an offline indexing pipeline (run once or periodically) and an online query pipeline (run at every user question). Both share the same embedding model and vector database.

Indexing Phase (Offline)

Load documents (PDFs, web pages, Word, Slack, etc.)
Chunk into smaller pieces (e.g., 512 tokens each)
Generate embedding vector for each chunk
Store vectors in vector database

Query Phase (Online)

User asks a question
Embed the question (same model)
Vector search: find top-k similar chunks
Inject retrieved chunks into LLM prompt
LLM generates grounded answer

RAG Architecture — Indexing (offline) and Query (online) pipelines

Code — Simple RAG with LangChain

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA

# === INDEXING PHASE ===
# Load and chunk documents
with open("company_policy.txt", "r") as f:
    text = f.read()

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.create_documents([text])
print(f"Created {len(chunks)} chunks")

# Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# === QUERY PHASE ===
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})  # top 4 chunks

llm = ChatOpenAI(model="gpt-4o", temperature=0.0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True     # show which docs were retrieved
)

result = qa_chain.invoke("What is the return policy for electronics?")
print(result["result"])
print("Sources:", [d.metadata for d in result["source_documents"]])

Vector Databases In-Depth

Vector databases are specialised stores for embedding vectors, optimised for similarity search. The core operation is Approximate Nearest Neighbour (ANN) search: "find the k vectors closest to this query vector." Exact search is O(n·d) — ANN algorithms like HNSW make this dramatically faster.

Vector Search — k-nearest neighbour retrieval by cosine similarity

Database	Type	ANN Algorithm	Filtering	Best For	Hosted?
Chroma	Open source	HNSW	Metadata	Dev / prototyping	Self-hosted / Cloud
Pinecone	Cloud-native	Proprietary	Metadata + hybrid	Production, scale	Cloud only
Weaviate	Open source	HNSW	GraphQL	Hybrid search, graphs	Both
Qdrant	Open source	HNSW	Payload	High performance	Both
pgvector	PostgreSQL ext	IVF / HNSW	Full SQL	Existing PG infra	Self-hosted
Milvus	Open source	IVF, HNSW	Scalar	Billion-scale	Self-hosted / Cloud

Chunking Strategies In-Depth

Chunk size matters enormously — wrong size degrades RAG quality. Too small: insufficient context in each chunk → incomplete answers. Too large: irrelevant content mixed with relevant → noisy retrieval. Chunks should overlap (50–100 tokens) to avoid splitting context across boundaries.

📏

Fixed-Size

Split every 512 tokens regardless of content. Simple but crude — may cut mid-sentence.

📝

Sentence-Based

Split on sentence boundaries. Better semantic coherence, preserves complete thoughts.

🔄

Recursive

Try paragraphs, then sentences, then words. LangChain's default. Best general-purpose strategy.

🧠

Semantic

Use embedding similarity to detect topic changes. Expensive but produces the most coherent chunks.

📑

Document-Aware

Use document structure (headers, sections, tables). Best for structured documents like reports and manuals.

Chunking Strategies — how you split documents impacts retrieval quality

Advanced RAG Patterns In-Depth

Naive RAG (embed → search → generate) works for many cases. These advanced patterns dramatically improve precision and recall for production systems:

🔀

Hybrid Search

Combine dense (embedding) + sparse (BM25 keyword) search. Better recall for exact phrase matches AND semantic similarity. Use Reciprocal Rank Fusion (RRF) to merge result lists.

🏆

Re-Ranking

Initial retrieval: top-20 by fast ANN → re-rank with expensive cross-encoder → return top-5. Cross-encoders read query + document together for much better relevance.

🔄

Query Transformation

Rewrite query before retrieval. HyDE: generate hypothetical answer, then embed that. Multi-query: generate 3 variants → retrieve for all → merge results.

👨‍👧

Parent-Child Chunks

Index small child chunks for precision retrieval. Return larger parent chunk for context. Best of both worlds — precise matching with rich context.

Advanced RAG: Two-Stage Retrieve-Then-Rerank Pipeline

RAG vs Fine-Tuning Core

RAG

Fine-Tuning

✓ Knowledge updatable at any time
✓ Cites sources, verifiable answers
✓ No training required
✓ Lower cost than fine-tuning
✗ Retrieval quality is the bottleneck
✗ Context window limits
✗ Latency of retrieval step

✓ Knowledge baked into weights
✓ No retrieval latency
✓ Better for style / format / behaviour
✗ Knowledge is static (needs retraining)
✗ Can't cite specific sources
✗ Expensive to update frequently

RAG and fine-tuning are not competing approaches — they are complementary. Fine-tune to change HOW the model communicates (tone, format, domain vocabulary). Use RAG to change WHAT the model knows (current facts, private documents, enterprise data). The best production systems use both.

∑ Chapter 5.8 Summary — Retrieval-Augmented Generation

RAG solves: knowledge cutoff, hallucination on specifics, private/proprietary data
Pipeline: Chunk docs → Embed → Store in vector DB → At query: embed query → ANN search → inject → generate
Chunking: 512 tokens with 50-token overlap is a reasonable default — recursive splitting preserves structure
Vector DB: stores embeddings for ANN similarity search — cosine similarity finds semantically similar chunks
Advanced RAG: hybrid search + re-ranking dramatically improves retrieval precision
RAG vs fine-tuning: use both — RAG for dynamic knowledge, fine-tuning for style/behaviour

5.9

Chapter 5.9

Hallucination, Alignment & Evaluation

LLMs are trained to produce fluent, probable text — not factual text. Understanding why they hallucinate, how alignment steers them toward human values, and how to rigorously evaluate their output is essential for responsible deployment.

Hallucination In-Depth

Hallucination: LLMs generate factually incorrect information with apparent confidence. This is not a bug — it's a consequence of the training objective. The model was rewarded for coherent, fluent text, not for verified facts.

📛

Factual Hallucination

"Einstein won the Nobel Prize for relativity" — he actually won for the photoelectric effect. Plausible, confident, wrong.

📚

Citation Hallucination

Fabricated paper titles, non-existent authors, wrong DOIs. The model generates citation-shaped text that looks real but doesn't exist.

👤

Entity Hallucination

Made-up people, places, company names that sound real. "Westbrook Medical Center" — doesn't exist but sounds plausible.

🧩

Reasoning Hallucination

Correct-sounding reasoning leading to a wrong conclusion. Each step looks valid, but the chain produces an incorrect answer.

Intrinsic vs extrinsic hallucination:

Intrinsic Hallucination

Contradicts the provided context. The document says population = 5M, but the answer says 10M. Detectable by comparing output to source.

Extrinsic Hallucination

Fabricated content not in context. Generated from the model's world knowledge — may or may not be true. Harder to detect without external verification.

Hallucination Taxonomy — intrinsic (contradicts context) vs extrinsic (fabricated)

Why LLMs Hallucinate Core

Root cause: LLMs are trained to produce fluent, probable text — not factual text. The model doesn't "know" it doesn't know something. Confidence calibration is poor: models are confidently wrong, which is more dangerous than being uncertainly wrong.

🎯

Training Objective

Maximise next-token probability → rewarded for coherent text, not verified facts. The loss function doesn't distinguish true from plausible.

🧠

Memorisation vs Generalisation

Facts not seen enough times in training → model interpolates between facts. It generates a blend of real knowledge and pattern-matched confabulation.

😊

Sycophancy

Models trained with RLHF learn to tell users what they want to hear. If you suggest a wrong answer, the model may agree rather than correct you.

Critical Danger

The hallucination-confidence problem: models are confidently wrong. A model that says "I'm not sure" is safer than one that states a fabricated fact with full certainty. This is why calibration research is critical.

Alignment In-Depth

Alignment problem: ensure AI systems behave according to human values and intentions. A highly capable but misaligned AI is dangerous — capability without alignment amplifies harm.

The specification problem: how do you formally specify "what humans actually want"? Even well-intentioned reward functions can be gamed — the model maximises the metric in unintended ways (reward hacking).

RLHF (Partial Solution)

Human preferences act as a proxy for values. Humans rank model outputs → reward model trained on rankings → PPO optimises policy. Imperfect but significant improvement over base models.

Constitutional AI (Anthropic)

Model learns from its own self-critique using a constitution of principles. Generate → critique against principles → revise. Scales better than human labelling.

Key alignment challenges:

Distributional Shift

Behaves well in training distribution but fails on out-of-distribution deployment inputs.

Reward Hacking

Satisfies the letter but not the spirit of the reward. Finds loopholes in the reward function.

Deceptive Alignment

Appears aligned during evaluation, behaves differently when deployed. The hardest failure mode to detect.

Capability vs Alignment — the central tension in LLM development

Helpful, Harmless, Honest (HHH) Core

Anthropic's HHH framework defines the three axes of aligned model behaviour. The tension: being more helpful sometimes means being slightly less cautious (and vice versa). Constitutional AI resolves this by giving the model explicit principles to follow.

🤝

Helpful

Genuinely helps users accomplish tasks. Unhelpfulness is never trivially "safe" — a model that refuses everything harms users who have legitimate needs.

🛡️

Harmless

Avoids generating content that causes real-world harm. Calibrated — not reflexively refusing edge cases. Context matters: medical information for a nurse vs a stranger.

🔍

Honest

Doesn't claim certainty it doesn't have. Proactively shares relevant information. Doesn't pursue hidden agendas or deceive about its nature.

Constitutional AI Process

Generate → critique against principles → revise → repeat. The model becomes its own alignment judge, guided by a written constitution of values. Scales far better than per-output human labelling.

NLP Evaluation Metrics In-Depth

Automatic metrics enable scalable evaluation, but each has significant blind spots. Understanding their strengths and weaknesses is essential for trustworthy evaluation.

📊

BLEU (Translation)

Precision of n-gram overlap between generated and reference text. Range: 0–1. Weakness: doesn't capture meaning, penalises valid paraphrase.

📝

ROUGE (Summarisation)

Recall of n-gram overlap with reference. ROUGE-N: n-gram recall. ROUGE-L: longest common subsequence. Weakness: length bias, synonym-blind.

🌐

METEOR

Combines precision, recall, and semantic matching via WordNet synonyms. Better correlation with human judgement than BLEU alone.

🤖

BERTScore

Uses BERT embeddings to measure semantic similarity. More robust than n-gram metrics — captures paraphrase and meaning equivalence.

Key Metric Formulas BLEU = BP · exp(∑ wₙ log pₙ) Where: pₙ = n-gram precision, BP = brevity penalty ROUGE-N = (# overlapping n-grams) / (# n-grams in reference) Perplexity = exp(H(p,q)) = exp(-(1/N) ∑ log q(xᵢ)) Lower perplexity = better language model. Measures how "surprised" the model is by the test text.

Automatic vs Human Evaluation — speed-accuracy tradeoff

LLM Benchmarks In-Depth

Benchmark	Tests	Format	Human Baseline	Note
MMLU	57-subject knowledge (57K questions)	Multiple choice	~89%	Knowledge breadth
HumanEval	Python function generation (164 problems)	Code generation	~75%	Coding
GSM8K	Grade school math (8.5K problems)	Multi-step reasoning	~95%	Math
MATH	Competition math (12.5K problems)	Multi-step hard math	~40% (students)	Hard math
ARC-AGI	Visual pattern reasoning	Novel test patterns	~85%	Novel reasoning
GPQA Diamond	PhD-level science (448 questions)	Multiple choice	~65%	Expert knowledge
MT-Bench	Multi-turn dialogue quality	GPT-4 as judge	—	Chat quality
Chatbot Arena	Head-to-head human preference	ELO rating	—	Real-world preference

Goodhart's Law applies everywhere in LLM evaluation: when a benchmark becomes a target, it ceases to be a good measure. Models trained on benchmark-adjacent data score artificially high. The most trustworthy evaluation is diverse human assessment on novel, never-before-seen tasks.

∑ Chapter 5.9 Summary — Hallucination, Alignment & Evaluation

Hallucination: LLMs generate confident falsehoods — trained for fluency, not factual accuracy
Types: factual, citation, entity, reasoning hallucinations — RAG and temperature=0 reduce them
Alignment: ensure models behave according to human values (Helpful, Harmless, Honest)
RLHF and Constitutional AI: current best approaches to alignment — imperfect but significant improvement
BLEU/ROUGE: n-gram metrics for translation/summarisation — fast but miss semantic equivalence
Human evaluation remains the gold standard — automatic metrics can be gamed

5.10

Chapter 5.10

LLM Fine-Tuning in Practice

Fine-tuning is the final tool in the LLM adaptation toolkit — used when prompting and RAG aren't enough. Modern techniques like QLoRA make it possible to fine-tune a 70B parameter model on a single consumer GPU.

When to Fine-Tune (vs RAG vs Prompting) In-Depth

Decision framework — try in order (cheapest first):

1️⃣

Prompt Engineering (Free)

Can you get the desired behaviour with a better prompt? Better system prompt, clearer instructions, output format specification. Always try this first.

2️⃣

Few-Shot Examples (Free)

Add 3–10 examples to the prompt. Often dramatically improves output quality for classification, extraction, and formatting tasks.

3️⃣

RAG (Moderate Cost)

Does the model need access to external, updated, or private knowledge? RAG grounds answers in retrieved documents without any training.

4️⃣

Fine-Tuning (Higher Cost)

Does the model need to change its behaviour, style, or domain expertise? Fine-tuning bakes capabilities into the weights.

Fine-tuning IS the right choice when:

Consistent Format

Always return JSON in a specific schema, or follow a strict output template.

Domain Vocabulary

Medical jargon, legal language, internal code style that prompts can't reliably teach.

Reduce Token Usage

Bake instructions into weights that would otherwise consume context window space.

Faster Inference

A smaller fine-tuned model can outperform a larger prompted model — lower cost per query.

500+ Examples

You have high-quality labelled data. Without data, fine-tuning can't help.

When to Fine-Tune — try prompting and RAG first

Data Preparation In-Depth

The #1 determinant of fine-tuning quality is data quality. Garbage in, garbage out — but amplified by the power of gradient descent. Format: instruction-following datasets use the chat message format (system / user / assistant).

Minimum Dataset Sizes

500 examples — for format/style changes
1,000+ examples — for new capability
5,000+ examples — for complex domain tasks

Data Quality Checklist

Diverse inputs (edge cases, different phrasings)
Consistent output quality (human-reviewed)
No duplicate or near-duplicate examples
Balanced classes (for classification)

Synthetic Data Generation

Use GPT-4 to generate training examples, then human spot-check 10–20%. This is the fastest way to build a high-quality dataset. Filter aggressively — 500 excellent examples beat 5,000 mediocre ones.

Code — Data Preparation for Fine-Tuning (JSONL Chat Format)

import json
from pathlib import Path

# Chat format for instruction fine-tuning (OpenAI / LLaMA format)
def create_training_example(instruction: str, input_text: str, output: str) -> dict:
    messages = [
        {"role": "system", "content": "You are a helpful assistant specialised in contract analysis."},
        {"role": "user",   "content": f"{instruction}\n\n{input_text}" if input_text else instruction},
        {"role": "assistant", "content": output}
    ]
    return {"messages": messages}

# Example dataset creation
examples = [
    create_training_example(
        instruction="Extract the termination clause from this contract:",
        input_text="...contract text...",
        output="Termination clause (Section 12): Either party may terminate with 30 days written notice..."
    ),
    # Add 499+ more examples
]

# Validate format
for ex in examples:
    assert len(ex["messages"]) == 3
    assert ex["messages"][-1]["role"] == "assistant"
    assert len(ex["messages"][-1]["content"]) > 0, "Empty response!"

# Save as JSONL (one JSON per line)
with open("train.jsonl", "w") as f:
    for ex in examples:
        f.write(json.dumps(ex) + "\n")

print(f"Training examples: {len(examples)}")
print(f"Avg response length: {sum(len(e['messages'][-1]['content']) for e in examples) / len(examples):.0f} chars")

Supervised Fine-Tuning (SFT) In-Depth

SFT objective: minimise cross-entropy loss on the assistant turns only. The user/system turns are masked — the model doesn't compute loss on the prompt tokens, only on the response it should have generated.

SFT Loss Function L = -(1/Nᵣ) ∑ log P(aₜ | s, u, a₁,...,aₜ₋₁) Only computed over response (assistant) tokens a₁,...,aₙ. System prompt s and user message u are MASKED.

Key hyperparameters:

Parameter	Typical Range	Notes
Epochs	1–3	More = overfitting, memorisation
Learning Rate	1e-5 to 2e-4	10–100× lower than pre-training
Batch Size	8–64	Use gradient accumulation for limited VRAM
Warmup	3–10% of steps	Prevents early instability
Max Seq Length	2048–4096	Match model's typical context

Catastrophic Forgetting

Fine-tuning on a narrow task → model forgets general capabilities. Mitigation: use LoRA (only updates a small fraction of weights), or mix in general instruction-following data alongside your task data.

LoRA & QLoRA in Practice In-Depth

QLoRA (Dettmers et al., 2023): LoRA applied on a 4-bit quantised base model. This makes it possible to fine-tune a 70B model on a single 48GB GPU — impossible without quantisation.

QLoRA Hardware Requirements — which GPU can fine-tune which model size

Code — QLoRA Fine-Tuning with Unsloth (2× faster)

# Unsloth: 2x faster fine-tuning with memory-efficient kernels
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments

# Load model in 4-bit quantisation
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,      # QLoRA: 4-bit quantised base
    dtype=None
)

# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                   # LoRA rank
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.0,
    bias="none",
    use_gradient_checkpointing="unsloth"   # saves 30% VRAM
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,    # effective batch = 8
        num_train_epochs=2,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        output_dir="./output",
        warmup_ratio=0.05,
        lr_scheduler_type="cosine"
    )
)

trainer.train()
model.save_pretrained("./my-llama3-ft")   # saves only LoRA weights (~50MB)

DPO: Direct Preference Optimisation Core

Rafailov et al. (2023): "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." RLHF requires three components: SFT model + reward model + PPO training — complex, unstable, expensive.

DPO insight: the optimal RLHF policy has a closed-form solution — no separate reward model needed. DPO uses preference pairs: (prompt, chosen_response, rejected_response). Train to increase likelihood of chosen relative to rejected. Simpler, more stable, often achieves similar or better results.

DPO Loss Function L = -E[log σ(β log(πₖ(yₑ|x)/πᵣ(yₑ|x)) - β log(πₖ(yₗ|x)/πᵣ(yₗ|x)))] yₑ = chosen response, yₗ = rejected response, πₖ = fine-tuned model, πᵣ = reference SFT model, β = temperature

RLHF

DPO

→ Requires reward model training
→ PPO: complex RL algorithm
→ 3 separate models to maintain
→ Compute intensive
→ Hyperparameter-sensitive
→ Gold standard for alignment

→ No separate reward model
→ Direct gradient on preference pairs
→ Only 2 models (policy + reference)
→ Simpler, more stable
→ Fewer hyperparameters
→ Increasingly preferred (2023–2025)

Full Fine-Tuning Pipeline Core

1️⃣

Define Task & Collect Data

500–2,000 examples in JSONL chat format. Ensure quality over quantity.

2️⃣

Data Validation

Check format, deduplicate, quality filter. Remove examples with empty or low-quality responses.

3️⃣

Choose Base Model

LLaMA 3, Mistral, Qwen — pick based on size, language, licence, and your hardware.

4️⃣

SFT with QLoRA

Unsloth or HuggingFace TRL. r=16, lr=2e-4, 1–3 epochs. Monitor loss convergence.

5️⃣

Evaluate on Hold-out

Task-specific metrics + human evaluation. Check for catastrophic forgetting on general tasks.

6️⃣

Optionally: DPO

Preference tuning on failure cases. Collect chosen/rejected pairs from model outputs.

7️⃣

Merge & Quantise

Merge LoRA adapters into full model. Quantise to GGUF (4-bit) for efficient inference.

8️⃣

Deploy

Ollama (local), vLLM (server), or cloud API. Monitor quality in production, collect feedback.

🎓 Domain 5 Complete — NLP & Large Language Models

Ch 5.1: NLP = four ambiguity layers: lexical, syntactic, semantic, pragmatic. Classical preprocessing (stopwords, stemming) is NOT used with neural models.
Ch 5.2: BPE tokenisation: iteratively merge most frequent pairs. GPT-4 uses 100K-vocab BPE (~¾ word per token).
Ch 5.3: Word2Vec: context predicts embedding. "king − man + woman ≈ queen" — geometry encodes meaning.
Ch 5.4: Contextual embeddings: same word, different vector per context. Pre-train then fine-tune = the modern NLP paradigm.
Ch 5.5: GPT = decoder-only, causal attention, autoregressive. Scaling laws: loss ∝ N^-α. Chinchilla: equal budget for params and data.
Ch 5.6: BERT = encoder-only, bidirectional attention, MLM pre-training. Use for understanding; GPT for generation.
Ch 5.7: Few-shot ICL: examples in prompt adapt behaviour. Chain-of-thought: "think step by step" dramatically improves reasoning.
Ch 5.8: RAG: Chunk → Embed → Vector DB → Retrieve → Generate. Solves knowledge cutoff, hallucination on specifics, and private data.
Ch 5.9: Hallucination = LLMs generate confident falsehoods — trained for fluency not facts. HHH: Helpful, Harmless, Honest.
Ch 5.10: Fine-tune when prompt+RAG isn't enough. QLoRA fine-tunes 70B on a single GPU. DPO replaces RLHF's complexity.

🚀 Go Deeper — Fine-Tuning LLMs

Most applications today avoid fine-tuning and instead use prompting or RAG — faster, cheaper, and no training infrastructure needed.

Fine-tuning becomes important when:

Strict behaviour control is needed — consistent output format, tone, or safety guardrails
Domain-specific patterns must be learned — legal contracts, medical notes, proprietary code styles

→ Covered in depth: Fine-Tuning LLMs (Advanced)

Domain 5 is where theory meets the frontier. The GPT family and BERT established the modern NLP paradigm that all of AI now follows. Prompt engineering, RAG, and fine-tuning are the three tools every AI practitioner uses daily. Domain 8 (Agentic AI) will show how LLMs with tools become autonomous agents. Domain 9 (AI Ethics) will address the alignment and hallucination challenges at scale.