AI Foundation · Domain 05 · Chapter 5.1

NLP Fundamentals

How computers learn to read — text processing, linguistic structure, and the pipeline from raw text to model input

5.1
Chapter 5.1
NLP Fundamentals & Text Preprocessing

Natural language is the most information-dense medium humans have ever created. Every sentence carries meaning at multiple levels simultaneously — lexical, syntactic, semantic, pragmatic. Teaching a machine to process all of these layers reliably is the central challenge of NLP.

Natural Language Processing (NLP) is the branch of AI that enables computers to understand, process, and generate human language. This sounds straightforward until you confront what language actually is: an ambiguous, context-dependent, culturally-loaded communication system that humans have spent millions of years evolving. No sentence carries a single, unambiguous meaning independent of context. A machine must learn to handle every level of this ambiguity simultaneously.

Language ambiguity operates at four distinct levels, each building on the one below. Lexical ambiguity arises when a single word has multiple meanings — "bank" can mean a financial institution or the edge of a river. Syntactic ambiguity occurs when sentence structure is unclear — "I saw the man with the telescope" leaves open whether the speaker or the man possesses the telescope. Semantic ambiguity involves phrases whose meaning is unclear even with structure resolved — "Can you pass the salt?" is literally a question about capability but functions as a request. Pragmatic ambiguity is the deepest: "It's cold in here" is an observation that functions as a request to close the window, but only if you already know the conversational norms.

The history of NLP traces a clear arc: Rule-based systems (1950s–1980s) used hand-crafted grammars and dictionaries — brittle and language-specific. Statistical NLP (1990s–2000s) replaced rules with probabilities learned from corpora. Neural NLP (2013–2017) used word embeddings (Word2Vec, GloVe) and RNNs to learn representations directly from data. Transformer-era (2018–present) introduced BERT, GPT, and their successors — models that learn language representations of staggering generality from massive corpora, making almost all previous approaches obsolete.

🔤

NLP Understanding Tasks

  • Text classification
  • Named entity recognition
  • Sentiment analysis
  • Question answering
  • Natural language inference
  • Coreference resolution
✍️

NLP Generation Tasks

  • Machine translation
  • Text summarisation
  • Dialogue / chatbots
  • Text completion
  • Code generation
  • Data-to-text narration
Four Levels of Language Ambiguity — each layer builds on the one below
LEXICAL word meanings, synonyms, polysemy SYNTACTIC grammar structure, parse trees SEMANTIC meaning of phrases and sentences PRAGMATIC meaning in context and intent "bank" — financial or river? "man with telescope" — who? "Can you pass the salt?" "It's cold" → close window Each layer depends on correctly resolving all layers below it

Before any model can process text, it must be transformed from raw characters into a form the model understands. Classical NLP pipelines involve a series of hand-engineered preprocessing steps, each reducing noise and normalising vocabulary. We trace each step using: "The Quick Brown Foxes are JUMPING over lazy dogs! They've been running."

Step 1 — Lowercasing. Convert all characters to lowercase. "The" and "the" are the same word — keeping both wastes vocabulary slots. This alone can reduce vocabulary size by 10–30% for English text.

Step 2 — Punctuation & special character removal. Strip characters that carry no lexical meaning for bag-of-words models. Important caveat: not always appropriate — punctuation carries meaning in some contexts (U.S.A, 3.14, emoticons, code). Remove selectively based on the task.

Step 3 — Tokenisation. Split text into meaningful units (tokens). The naïve approach is whitespace splitting. Better approaches handle contractions ("they've" → ["they", "'ve"]) and punctuation. Chapter 5.2 covers subword tokenisation for neural models in depth.

Step 4 — Stopword removal. Remove high-frequency words ("the", "is", "a") that carry little semantic weight in bag-of-words models. Critical warning: never remove stopwords for neural models or sequence tasks — position and function words are often critical to meaning ("not" changes everything).

Step 5 — Stemming. Reduce words to root form by stripping suffixes using heuristic rules. Porter Stemmer: "jumping" → "jump", "foxes" → "fox". Fast but imprecise — "university" → "univers". Two words with the same stem may not share meaning.

Step 6 — Lemmatisation. Morphologically reduce words to their dictionary form (lemma) using linguistic knowledge. "better" → "good", "ran" → "run", "foxes" → "fox". More accurate than stemming but requires WordNet. "Saw" → "see" (verb) or "saw" (noun) depending on POS tag.

Step 7 — Text normalisation. Expand contractions ("they've" → "they have"), normalise Unicode, standardise numbers, handle abbreviations and acronyms.

⚡ Modern NLP Note

Neural models and LLMs do NOT use most of these preprocessing steps. They process raw subword tokens (Chapter 5.2) directly from near-original text. These classical steps are for bag-of-words models, TF-IDF search engines, and traditional ML feature engineering. If you are building anything with BERT, GPT, or similar — skip everything except basic Unicode normalisation.

import re import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer text = "The Quick Brown Foxes are JUMPING over lazy dogs! They've been running." # Step 1: lowercase text_lower = text.lower() # → "the quick brown foxes are jumping over lazy dogs! they've been running." # Step 2: remove punctuation text_clean = re.sub(r"[^\w\s]", "", text_lower) # Step 3: tokenise tokens = word_tokenize(text_clean) # → ['the', 'quick', 'brown', 'foxes', 'are', 'jumping', ...] # Step 4: remove stopwords (classical NLP only — NOT for LLMs) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words] # → ['quick', 'brown', 'foxes', 'jumping', 'lazy', 'dogs', 'running'] # Step 5/6: lemmatise lemmatizer = WordNetLemmatizer() tokens = [lemmatizer.lemmatize(t, pos='v') for t in tokens] # "foxes" → "fox", "jumping" → "jump", "running" → "run" # Final: ['quick', 'brown', 'fox', 'jump', 'lazy', 'dog', 'run'] # For HuggingFace / LLM pipeline — just pass raw text: from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") tokens_bert = tokenizer(text, return_tensors="pt") # BERT handles subword tokenisation internally — no preprocessing needed
Classical NLP Preprocessing Pipeline — NOT used with modern neural models
Raw Text The Quick Brown... Lowercase the quick brown... Clean no punct the quick Tokenise ['the', 'quick'...] Remove Stopwords ['quick', 'brown'...] Lemmatise ['quick', 'fox','jump'] Feature Vector [0,1,1,0,1...] ⚠ This entire pipeline is skipped for neural models (BERT, GPT) — they use raw subword tokens Used in: TF-IDF search, spam filters, bag-of-words classifiers, traditional ML feature engineering

Even if you never build a classical NLP pipeline, understanding key linguistic concepts will help you reason about what language models are learning, diagnose failure modes, and work effectively with hybrid systems.

Part-of-Speech (POS) Tagging. Label each word with its grammatical role: Noun (NN), Verb (VB), Adjective (JJ), Adverb (RB), Determiner (DT), Preposition (IN). "The cat sat on the mat" → [DT, NN, VBD, IN, DT, NN]. POS tags disambiguate words with multiple roles — "run" is a verb in "I run" but a noun in "a home run". Used in: NER, information extraction, classical feature engineering.

Named Entity Recognition (NER). Identify and classify spans of text as named entities: PERSON, ORG, GPE (geopolitical entity), DATE, MONEY. "Apple's Tim Cook announced on Monday that sales exceeded $100B" → [Apple: ORG], [Tim Cook: PERSON], [Monday: DATE], [$100B: MONEY]. Modern BERT-based NER achieves near-human F1 scores on standard benchmarks.

Dependency Parsing. Identify grammatical relationships between words — subject, object, modifier, etc. "The cat chased the mouse" → "cat" is the nominal subject (nsubj) of "chased"; "mouse" is the direct object (dobj). Dependency parses are directed graphs enabling extraction of who did what to whom.

Coreference Resolution. Determine which mentions refer to the same entity. "When Mary arrived, she said she was tired" — both "she" instances refer to Mary. Crucial for coherent understanding across sentences. SpanBERT achieves state-of-the-art by jointly scoring mention pairs.

NER and POS Tagging — identifying entities and grammatical roles in a sentence
Apple 's CEO Tim Cook announced profits on Tuesday in Cupertino NNP POS NNP NNP NNP VBD NNS IN NNP IN NNP ORG PERSON DATE GPE NER LEGEND ORG PERSON DATE GPE (Location) "Apple's CEO Tim Cook announced profits on Tuesday in Cupertino."

Before neural networks dominated NLP, three core representation methods powered almost every text application. They remain useful for lightweight tasks, interpretable systems, and benchmarking. Understanding them explains why neural embeddings were such a dramatic improvement.

Bag of Words (BoW) represents a document as a vector of word counts, completely discarding word order. "The cat sat" and "sat cat the" produce identical BoW vectors. Works well for document classification and spam filtering because topic is often determined by which words appear, not their order. Critical weakness: "not good" and "good" look nearly identical.

TF-IDF improves on raw counts by weighting words by how rare they are across the corpus. A word that appears frequently in one document but rarely elsewhere is a likely topic word. "The" appears in every document — TF-IDF assigns it near-zero weight. "Convolutional" in an AI paper is rare and gets high weight. Still discards order, but much more informative than raw counts.

n-grams partially recover word order by treating sequences of n adjacent words as features. Bigrams of "the cat sat": ["the cat", "cat sat"]. Captures local context but explodes vocabulary size exponentially with n. Word2Vec and subsequent neural embeddings made n-gram language models obsolete for most applications.

TF-IDF Formula TF-IDF(t, d) = TF(t, d) × IDF(t) TF(t, d) = count(t in d) / total_words(d) IDF(t) = log(N / df(t)) N = total documents in corpus  ·  df(t) = documents containing term t High TF-IDF: term appears frequently in this doc but rarely corpus-wide → key topic term
Method Captures Order Vector Size Sparse? Semantic Meaning Best For
Bag of Words No V (vocab size) Yes — mostly 0s No Doc classification, spam
TF-IDF No V (vocab size) Yes No Search, document similarity
n-grams Local only Vn (explodes!) Very sparse No Language models (pre-neural)
Word2Vec No (fixed window) d (e.g. 300) No — dense Yes Semantic similarity, analogies

NLP encompasses a wide range of tasks grouped by the type of output they produce. Understanding this taxonomy helps you choose the right architecture (encoder-only, decoder-only, encoder-decoder), the right loss function, and the right evaluation metric for any given problem.

NLP Task Taxonomy — from classification to generation
NLP Tasks All NLP problems fit one of ↓ Classification Sentiment · NLI Spam · Topic BERT, RoBERTa Extraction NER · Relations Events · Coref BERT-CRF Generation Translation · Summary Dialogue · Stories GPT, T5, BART Structured Prediction POS · Parsing BERT-CRF Question Answering Extractive · Open GPT-4, Claude Metric: Accuracy, F1 Input → Label "great film" → POS Metric: F1 per entity Token → Label span → [PERSON] Metric: BLEU, ROUGE Seq → Seq document → summary Metric: LAS, UAS Token → Structure word → [nsubj] Metric: EM, F1 Context+Q → Answer SQuAD, TriviaQA Architecture: Encoder-only (BERT) for understanding · Decoder-only (GPT) for generation · Encoder-Decoder (T5) for seq2seq
Task Input Output Key Metric Modern Model
Sentiment Analysis Review text Positive / Negative / Neutral Accuracy, F1 BERT, RoBERTa
Named Entity Recognition Sentence Token-level labels (B-I-O) F1 per entity type BERT-CRF, SpanBERT
Machine Translation Source language text Target language text BLEU score T5, NLLB-200, GPT-4
Summarisation Long document Short summary ROUGE score BART, Pegasus, GPT-4
Question Answering Context + question Answer span / free text Exact Match, F1 GPT-4, Claude, Llama

∑ Chapter 5.1 Summary — NLP Fundamentals & Text Preprocessing

  • Language has four layers of ambiguity: lexical, syntactic, semantic, pragmatic — all must be handled, each building on the one below
  • NLP history: rule-based (1960s) → statistical (1990s) → neural embeddings (2013) → Transformer LLMs dominant from 2018
  • Classical preprocessing: lowercase → tokenise → remove stopwords → lemmatise — NOT used with neural models (BERT, GPT)
  • Stemming is fast but imprecise (heuristic suffix rules); lemmatisation is accurate but requires WordNet + POS context
  • BoW and TF-IDF: sparse, order-independent representations — still useful for search, lightweight classification, interpretable systems
  • Key linguistic annotations: POS tagging, NER, dependency parsing, coreference resolution — used in classical and hybrid NLP pipelines
  • NLP splits into: understanding tasks (classification, extraction) and generation tasks — different architectures, loss functions, and metrics
5.2
Chapter 5.2
Tokenisation — Words to Subwords

Tokenisation is the invisible foundation of every language model. Before a single parameter is trained, the tokeniser decides how text will be represented as integers — and that decision shapes what patterns the model can learn, how efficiently it processes different languages, and how much it costs to run at inference time.

Neural language models operate on numbers, not text. Every word, character, or subword must be mapped to an integer ID from a fixed vocabulary before it can be fed into the model. Tokenisation is this mapping — it converts a raw string into a sequence of integers, each representing a "token" from the vocabulary. The choice of what constitutes a token has profound consequences for the model's capabilities.

The vocabulary dilemma has three corners. Word-level tokenisation uses whole words as tokens — intuitive, but English has 170,000+ words and with proper nouns, compounds, and morphological variants, the vocabulary explodes into millions. Words not seen during training become [UNK] (unknown) — the out-of-vocabulary problem. Character-level tokenisation uses individual characters — tiny vocabulary of ~128 ASCII characters, but sequences become very long. "Hello world" is 11 characters; a document of 1,000 words becomes ~6,000 characters. Attention's O(n²) complexity makes this expensive. Subword tokenisation is the sweet spot: split common words into whole tokens, rare or unknown words into subword pieces. "unhappiness" → ["un", "##happy", "##ness"] — known pieces, no OOV, reasonable sequence length.

Modern LLMs universally use subword tokenisation with vocabularies of 32,000–100,000 tokens. GPT-4 uses 100,277 tokens; LLaMA-3 uses 128,256. The vocabulary is fixed at training time and cannot be changed without retraining the model — making the tokeniser one of the most consequential design decisions in LLM development.

Tokenisation Granularity — Character vs Subword vs Word
Character ★ Subword Word H e l l o ... un ##happy ##ness unhappiness Vocab: ~128 Seq: very long OOV: none ✓ Vocab: 32K–100K Seq: balanced OOV: none ✓ Vocab: 1M+ Seq: short OOV: yes ✗ Tradeoff: smaller vocab → longer sequences & more computation · larger vocab → OOV risk

Understanding why pure word and character tokenisers were abandoned helps clarify the design goals of modern subword tokenisers. Both extremes have fundamental problems that subword methods resolve.

Word tokenisation splits on whitespace and punctuation. Problems pile up quickly. Contractions: "don't" — is that one token or two ("do", "n't")? Hyphenated compounds: "state-of-the-art" — one or four? Morphological variants: "run", "running", "ran", "runs" require four separate vocabulary entries, even though they share meaning. Proper nouns, technical terms, and misspellings not seen during training become [UNK] — the model sees a blank where information should be. The English vocabulary alone exceeds 170,000 words; with all languages and domains, a truly universal word vocabulary would require millions of entries.

Character tokenisation has no OOV problem — the alphabet is fixed. But it fragments language into meaningless units from the model's perspective. The word "hello" becomes 5 separate tokens [h][e][l][l][o]. The model must learn from scratch that these 5 tokens together form a word unit — it cannot start with the useful prior that words are meaningful. More critically, sequence length explodes. A 1,000-word essay becomes ~6,000 character tokens. Transformer attention is O(n²) in sequence length — doubling the sequence length quadruples the compute cost. In practice, character-level models were impractical at scale.

Word Tokenisation
Character Tokenisation
"unhappiness" → [unhappiness] — 1 token
"unhappiness" → [u][n][h][a][p][p][i][n][e][s][s] — 11 tokens
OOV: "unhappinesses" → [UNK] — information lost
OOV: zero — any string is representable
Vocabulary: 100K–1M+ words needed
Vocabulary: ~128 ASCII or 256 bytes
Sequence: short and efficient
Sequence: very long — O(n²) attention cost
Morphology lost: run/running/ran = 3 separate IDs
Word structure lost: model must learn groupings from scratch
⚡ Byte-Level Note

Byte-level tokenisation tokenises raw UTF-8 bytes (0–255) rather than characters. Every document is representable — there is no OOV at the byte level. GPT-2 used byte-level BPE: start from 256 byte tokens, then apply BPE merges. This handles multilingual text naturally and is fully language-agnostic. GPT-4's tiktoken also uses byte-level BPE.

Byte-Pair Encoding was introduced for NLP by Sennrich et al. (2016) as a data compression algorithm adapted for subword vocabulary construction. It is used by GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Falcon, and most modern decoder-based language models. The key insight is elegant: let the training data decide what the vocabulary tokens should be, by iteratively merging the most frequently co-occurring pairs.

The BPE algorithm is a simple loop. Initialise with a character vocabulary (or byte vocabulary for byte-level BPE). Count all adjacent token pairs across the entire training corpus. Merge the most frequent pair into a new single token and add it to the vocabulary. Repeat until the target vocabulary size is reached. The result: common words like "the", "and", "is" become single tokens; common morphological patterns like "-ing", "-tion", "un-" become tokens; rare words are split into recognisable subword pieces.

BPE Worked Example — Small Corpus
Corpus: "low"×2, "lower", "newer", "wider", "new"×2
Start: characters → [l][o][w] / [l][o][w][e][r] / [n][e][w][e][r] / ...
  Count pairs: (e,r)=3, (l,o)=3, (o,w)=3, (n,e)=3, ...
Merge 1: (e,r) → "er"   New vocab: {... er}
  [l][o][w] / [l][o][w][er] / [n][e][w][er] / [w][i][d][er] / [n][e][w]
Merge 2: (l,o) → "lo"   New vocab: {... er, lo}
  [lo][w] / [lo][w][er] / [n][e][w][er] / [w][i][d][er] / [n][e][w]
Merge 3: (lo,w) → "low"  New vocab: {... er, lo, low}
  [low] / [low][er] / [n][e][w][er] → "lower" = ["low","er"]
Final: "lower" → ["low","er"]  |  "newer" → ["ne","w","er"]  |  "new" → ["n","e","w"]
BPE Algorithm — iterative merging of most frequent token pairs
STEP "low" "lower" VOCABULARY ADDED Start l o w l o w e r most freq×3 l, o, w, e, r, ... Merge 1 (e,r)→er l o w l o w er next: (l,o)×3 + er Merge 2 (l,o)→lo lo w lo w er next: (lo,w)×3 + lo Merge 3 (lo,w)→low low low er + low ← "lower" = low+er ✓ VOCABULARY GROWTH Base: l, o, w, e, r, n, i, d, ... +er → covers: lower, newer, wider +lo → covers: low, lower +low → "lower" = 2 tokens! Continue until vocab target (e.g. 50K) BPE never discards character tokens — unknown words always decomposable to characters or bytes
import tiktoken # GPT-4 uses cl100k_base tokeniser (100,277-token BPE vocabulary) enc = tiktoken.get_encoding("cl100k_base") examples = [ "Hello, world!", "unhappiness", "ChatGPT", "supercalifragilisticexpialidocious", "1+1=2", "def fibonacci(n):", ] for text in examples: tokens = enc.encode(text) decoded = [enc.decode([t]) for t in tokens] print(f"{text!r:45s} → {len(tokens)} tokens: {decoded}") # Sample output: # 'Hello, world!' → 3 tokens: ['Hello', ',', ' world!'] # 'unhappiness' → 3 tokens: ['un', 'happiness', ''] # 'supercalifragilisticexpialidocious' → 11 tokens (splits into subwords) # 'def fibonacci(n):' → 5 tokens: ['def', ' fib', 'on', 'acci', '(n):'] # Round-trip test: encode then decode should return original text text = "The quick brown fox jumps over the lazy dog." assert enc.decode(enc.encode(text)) == text # always true for BPE

WordPiece is the tokenisation algorithm used by BERT, DistilBERT, ALBERT, and multilingual BERT (mBERT). It is mechanically similar to BPE but uses a different merge criterion: rather than merging the most frequent pair, it merges the pair that most increases the likelihood of the corpus under a language model. In practice, the merge score is: score(A, B) = freq(A+B) / (freq(A) × freq(B)). This prefers pairs where the joint occurrence is disproportionately high relative to how often each appears alone — capturing meaningful linguistic units rather than just common bigrams.

WordPiece uses a distinctive notation: a ## prefix marks a continuation subword — a piece that is attached to the preceding token rather than starting a new word. "playing" → ["play", "##ing"]. The "play" token has no prefix (it starts a word); "##ing" is always a suffix, never a standalone word. This makes the tokenisation reversible and unambiguous: joining tokens without spaces and stripping ## gives back the original word.

The standard BERT vocabulary contains 30,522 tokens. Unknown characters that cannot be represented by any combination of vocabulary tokens are mapped to [UNK] — rare for English but can occur with unusual Unicode characters. The vocabulary also includes special tokens: [CLS] (classification, prepended to all inputs), [SEP] (separator, marks sentence boundaries), [MASK] (for masked language modelling), and [PAD] (padding to fixed length).

WordPiece Tokenisation — ## marks continuation of a word
playing play ##ing unhappiness un ##happy ##ness tokenization token ##ization ChatGPT Chat ##G ##PT antidisestablish- mentarianism anti ##dis ##establish ##ment ##arian ##ism Word start (no ##) ## continuation Decode: strip ## and join → "play" + "ing" = "playing" · BERT vocab: 30,522 tokens

SentencePiece (Kudo & Richardson, 2018) is a language-agnostic tokenisation framework used by T5, LLaMA-1/2/3, Mistral, Gemma, XLNet, and others. Its key architectural difference from BPE and WordPiece is that it operates on raw text including whitespace, without any pre-tokenisation step. BPE and WordPiece typically split on whitespace first (giving the language-specific assumption that spaces separate words), then apply subword segmentation within each word. SentencePiece treats spaces as regular characters — the text "Hello world" is tokenised as a single stream of characters including the space character.

To make tokenisation reversible, SentencePiece uses a special ▁ (U+2581, lower one-eighth block) character to mark word boundaries. The space before a word is encoded as ▁: "Hello world" → ["▁Hello", "▁world"]. Decoding is trivial: replace ▁ with a space, concatenate. This approach means the same tokeniser works identically for languages with no spaces (Japanese, Chinese) and languages with regular spacing (English, French) — making it the preferred choice for multilingual models.

SentencePiece supports two underlying algorithms. BPE mode is the same bottom-up merge algorithm as before. Unigram Language Model mode takes the opposite approach: start with a large vocabulary (e.g. all substrings up to length 16), then iteratively remove tokens whose removal least decreases corpus likelihood, until the target size is reached. Unigram produces multiple possible segmentations of a word and assigns probabilities to them — during training, samples are drawn from the distribution, providing a natural form of tokenisation regularisation.

Tokeniser Algorithm Vocab Marker Vocab Size Used By Language Agnostic
BPE (byte-level) Bottom-up merge (byte pairs) None 50K / 100K GPT-2, GPT-3, GPT-4, LLaMA Yes (bytes)
WordPiece Likelihood-based merge ## (continuation) 30K BERT, DistilBERT, ALBERT Partial
SentencePiece BPE Bottom-up, raw text ▁ (word start) 32K–128K LLaMA-2/3, T5, Gemma Yes
SentencePiece Unigram Top-down pruning ▁ (word start) 32K mBERT, XLNet, T5 Yes
tiktoken BPE on bytes None 100K (cl100k) GPT-4, GPT-4o, Codex Yes

OpenAI's tiktoken is a fast BPE tokeniser library used by all GPT models. It supports three encodings: r50k_base (GPT-2/3, 50,257 tokens), p50k_base (Codex), and cl100k_base (GPT-4/GPT-4o, 100,277 tokens). The cl100k vocabulary was specifically designed to handle code and multilingual text more efficiently — common programming patterns like function definitions and import statements are often single tokens.

The rough practical rule is ¾ of a word per token for English text: 1,000 tokens ≈ 750 English words ≈ 4–5 average paragraphs. This ratio degrades significantly for non-English text. Chinese and Japanese characters are typically 1–4 tokens per character (since each character is a complex glyph encoded as multiple UTF-8 bytes). Arabic script runs 2–3 tokens per word. Code is token-efficient for English keywords but indentation and special characters add tokens. Understanding these ratios is essential for prompt engineering and cost estimation at scale.

import tiktoken enc = tiktoken.get_encoding("cl100k_base") # GPT-4 tokeniser def count_tokens(text: str) -> int: return len(enc.encode(text)) # Token quirks every practitioner should know tests = [ ("hello world", "lowercase"), ("Hello World", "capitalised — same count but different IDs"), (" hello world", "leading space can differ"), ("1234567890", "numbers split by digit groups"), ("你好世界", "Chinese: ~2-4 tokens PER CHARACTER"), (" def foo():", "indented code — spaces are tokens!"), ] for text, note in tests: toks = enc.encode(text) print(f"{text!r:30s} → {len(toks):3d} tokens # {note}") # Cost estimation (approximate, prices change frequently) def estimate_cost(input_tokens: int, output_tokens: int, model="gpt-4o"): rates = {"gpt-4o": (0.005, 0.015)} # ($/1K input, $/1K output) r_in, r_out = rates.get(model, (0.01, 0.03)) return input_tokens / 1000 * r_in + output_tokens / 1000 * r_out print(f"1K in + 500 out: ${estimate_cost(1000, 500):.4f}") # ~$0.0125

The context window is the maximum number of tokens a model can process in a single forward pass — both input prompt and generated output count against this limit. Exceeding the context window truncates input, silently losing information. Context window sizes have grown dramatically: from GPT-3's 2,048 tokens (2020) to GPT-4o's 128,000 (2024), Claude 3.5's 200,000, and Gemini 1.5 Pro's 1,000,000. However, larger context windows don't mean models use all context equally well — empirically, LLMs attend more strongly to the beginning and end of long contexts ("lost in the middle" effect).

Token counting matters for three practical reasons. Cost: commercial LLM APIs price per token — $5–$30 per million tokens for frontier models, multiplied by millions of API calls adds up. Context management: overflow silently truncates your prompt — a bug that is easy to miss and hard to debug. Latency: generation cost is proportional to output tokens; every unnecessary token in the response costs inference time and money. Practitioners routinely count tokens in prompt templates, conversation histories, and retrieved documents before sending API requests.

Context Window Sizes — from 8K tokens to 1M tokens (log scale)
Context window (tokens) — logarithmic scale: 8K → 1M LLaMA 3 70B  8,192 tokens (~6 pages) GPT-3.5  16,384 tokens (~12 pages) GPT-4o  128K (~100 pages) Claude 3.5  200K (~150 pages) Gemini 1.5 Pro 1M 1,000,000 tokens (~750 pages / ~1 full novel)
⚠ Common Pitfalls — Tokenisation in Production

1. Leading spaces change token IDs. " hello" and "hello" produce different token IDs in tiktoken — relevant for prompt formatting. 2. Numbers split unexpectedly. "GPT-4" → ["G", "PT", "-", "4"]; phone numbers, dates, and prices consume far more tokens than you'd expect. 3. Non-English is expensive. A Chinese prompt of 100 characters may cost 200–400 tokens — 2–4× more than the equivalent English text. 4. Markdown inflates count. Headers, bold markers, code fences, and bullet points all consume tokens. Strip unnecessary formatting from retrieved context before sending. 5. Chat format overhead. OpenAI's chat completions API adds ~4 tokens per message for role/structure overhead — relevant for high-frequency fine-grained API calls.

∑ Chapter 5.2 Summary — Tokenisation

  • Tokenisation maps raw text to integer IDs from a fixed vocabulary (32K–100K tokens) before it can enter a neural model
  • Subword is the sweet spot: no OOV, balanced sequence length, handles morphology — all modern LLMs use it
  • BPE: iteratively merge the most frequent adjacent token pair — used by GPT-2, GPT-3, GPT-4, LLaMA; byte-level BPE is fully language-agnostic
  • WordPiece: ## marks continuations; likelihood-based merges; vocabulary = 30K — used by BERT, DistilBERT, mBERT
  • SentencePiece: operates on raw text including spaces; ▁ marks word starts; supports BPE and Unigram LM — used by T5, LLaMA, Gemma
  • Practical rule: ~¾ word per token for English; non-English and numbers use 2–4× more tokens — critical for cost and context management
5.3
Chapter 5.3
Word Embeddings — Meaning as Geometry

Word embeddings did not just improve NLP performance — they changed how we think about language. When Mikolov et al. showed in 2013 that "king − man + woman ≈ queen" held in a 300-dimensional vector space, it suggested that semantic relationships could be captured as geometric transformations. This was the first evidence that neural representations were not just feature maps — they were encoding structured knowledge about the world.

The theoretical foundation of all word embedding methods is the Distributional Hypothesis, stated by linguist J.R. Firth in 1957: "You shall know a word by the company it keeps." The idea is deceptively simple: words that appear in similar linguistic contexts tend to have similar meanings. "Dog" and "cat" both appear near words like "pet", "feed", "vet", "fur", "owner", "breed" — and this co-occurrence pattern reflects their shared semantic category. "Dog" and "quantum" do not share contexts, and they do not share meaning.

This hypothesis transforms the problem of meaning into a problem of statistics. Instead of defining what "happy" means philosophically, we can simply observe that "happy" appears with "smile", "joy", "content", "pleased", "glad" — and "sad" appears with "cry", "grief", "unhappy", "depressed" — and that these two distributional profiles are measurably different. The distributional hypothesis gives us a way to measure semantic similarity without any human annotation: compute the similarity between two words' context distributions.

Every word embedding method — Word2Vec, GloVe, FastText, and even the contextual embeddings of BERT — is an implementation of this hypothesis. They differ in how they model context (local window vs global matrix, character-level vs word-level, static vs contextual), but they all share the core insight: context distribution = meaning.

📖

The Key Insight

You do not need to define what words mean. Observe where they appear, and the geometry of the embedding space will capture the rest. No hand-crafted ontologies, no linguistic rules — just patterns in text.

🔍

Context Window

Most embedding methods use a fixed window of ±k surrounding words as "context". Window size k=5 means the 5 words before and after each target word. Larger k → more topical similarity. Smaller k → more syntactic similarity.

Mikolov et al. (Google Brain, 2013) introduced Word2Vec — a family of shallow neural networks that learn word representations by predicting word context. The key insight was framing representation learning as a self-supervised prediction task: given a word, predict its surrounding words. No labels are needed — the text itself provides the training signal. Train on enough text (Google News, 100 billion words) and the resulting vectors encode semantic structure as geometry.

The architecture is deliberately simple: a single-layer neural network with no non-linearity in the hidden layer. The input is a one-hot vector of vocabulary size V. The single hidden layer projects this to a dense vector of dimension d (typically 300). The output layer projects back to V dimensions and applies softmax to produce a probability distribution over the vocabulary. The weight matrix of the hidden layer — shape V × d — is the embedding matrix. After training, each row is the embedding vector for one word.

Word2Vec uses two architectural variants: Skip-gram and CBOW (Continuous Bag of Words). Skip-gram predicts context words from a centre word and works better on small datasets and rare words. CBOW predicts the centre word from its context and trains faster on large corpora. Both are trained with a practical approximation — negative sampling — rather than full softmax over the entire vocabulary (computing softmax over 50,000+ words every step is prohibitively expensive).

With negative sampling, the objective becomes: for each training pair (centre, context), maximise the probability of the true pair while minimising the probability of k randomly sampled negative pairs. This reduces the per-step computation from O(V) to O(k), where k is typically 5–20. The result is a practical algorithm that can be trained on billions of words in hours on a single machine.

Given the sentence "The quick brown fox jumps over the lazy dog" with window size 2: Skip-gram takes the centre word "brown" and tries to predict each context word — ("brown", "quick"), ("brown", "The"), ("brown", "fox"), ("brown", "jumps"). One training pair for each context word in the window. CBOW takes all context words ["The", "quick", "fox", "jumps"] and averages their embeddings, then tries to predict the centre word "brown". CBOW is faster (averages context, one prediction per window); Skip-gram trains on more pairs and handles rare words better.

Word2Vec: Skip-gram and CBOW — two ways to learn from context
SKIP-GRAM centre word → predict context brown centre word embed quick The fox jumps king dog Learns: which words share contexts with "brown" → dense vector captures this CBOW context words → predict centre The quick fox jumps avg embed brown predicted Context vectors averaged → predicts "brown" Faster training; Skip-gram better for rare words Both train by self-supervision: no labels needed — the text itself is the signal
import gensim.downloader as api # Load pre-trained Word2Vec (Google News, 300d, 3M vocabulary) wv = api.load("word2vec-google-news-300") # Cosine similarity between word vectors print(wv.similarity("dog", "cat")) # → 0.76 (semantically close) print(wv.similarity("king", "queen")) # → 0.73 print(wv.similarity("apple", "motorcycle")) # → 0.04 (unrelated) print(wv.similarity("hot", "cold")) # → 0.36 — antonyms share context! # Most similar words print(wv.most_similar("python", topn=5)) # → [('ruby', 0.78), ('java', 0.76), ('perl', 0.74), ('php', 0.73), ...] # The famous analogy: king − man + woman = ? result = wv.most_similar( positive=["king", "woman"], negative=["man"], topn=3) print(result) # → [('queen', 0.71), ('princess', 0.65), ('monarch', 0.62)] # Train your own Word2Vec on custom corpus from gensim.models import Word2Vec sentences = [ ["the", "cat", "sat", "on", "the", "mat"], ["the", "dog", "sat", "by", "the", "fire"], ] model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, epochs=10) cat_vec = model.wv["cat"] # shape: (100,)

Pennington, Socher, and Manning (Stanford NLP, 2014) pointed out a conceptual limitation of Word2Vec: it trains on individual context windows, one at a time, effectively ignoring the global statistics of word co-occurrence across the entire corpus. If "ice" and "steam" both co-occur with "water" but "ice" co-occurs with "solid" and "steam" with "gas", this distinction should be captured — but Word2Vec only sees local windows, not the global ratio structure.

GloVe directly factorises the global word–word co-occurrence matrix X, where Xᵢⱼ is the count of how many times word j appears in the context of word i across the entire corpus. The objective is to find word vectors such that their dot product approximates the log of the co-occurrence count. The weighting function f(Xᵢⱼ) ensures that very frequent pairs (like "the–the") don't dominate the loss — pairs with Xᵢⱼ above a threshold are capped.

In practice, GloVe and Word2Vec produce embeddings of similar quality. GloVe often edges ahead on analogy tasks (because it explicitly models co-occurrence ratios); Word2Vec can be more efficient to train on very large corpora with negative sampling. Both have been largely superseded for downstream tasks by contextual embeddings (BERT, GPT), but GloVe remains popular as a lightweight baseline and for interpretability research.

GloVe Objective J = ∑ᵢⱼ f(Xᵢⱼ) (wᵢᵀ w̃ⱼ + bᵢ + b̃ⱼ − log Xᵢⱼ)² Xᵢⱼ = co-occurrence count of word i and word j across entire corpus wᵢ, w̃ⱼ = word vectors (word i and context word j) f(Xᵢⱼ) = weighting function that caps influence of very frequent co-occurrences Goal: dot product of word vectors ≈ log co-occurrence count → geometry encodes statistics

Bojanowski et al. (Facebook AI Research, 2016) identified a critical gap in both Word2Vec and GloVe: they treat words as atomic units. "Run", "running", "runner", "runs" each get their own independent vector — the morphological relationship between them is invisible to the model. For languages with rich morphology (Finnish, Turkish, German, Arabic), this is devastating: a model may never see the exact form "Freundschaftsbezeigungen" (German for "demonstrations of friendship") but it shares meaningful subwords with common words.

FastText represents each word as a bag of character n-grams. For "where" with n=3: the word is decomposed as ["<wh", "whe", "her", "ere", "re>", "<where>"] (with boundary markers < and >). The final word vector is the sum of all its n-gram vectors. Each n-gram has its own embedding — these are what get trained. The boundary markers ensure "whe" in "where" and "whe" in "elsewhere" contribute differently because they occur in different boundary contexts.

The payoff: FastText can produce meaningful vectors for words never seen during training, including misspellings, technical jargon, and morphological variants. "Antidisestablishmentarianism" will share n-grams with "establish", "establishment", "disestablish", "ism", etc., and their combined embedding will be semantically meaningful. FastText is still used in production where domain vocabulary is highly variable — scientific text, social media, multilingual pipelines.

Word2Vec / GloVe
FastText
Word is atomic unit — one vector per word
Word = sum of character n-gram vectors
OOV → [UNK] — information completely lost
No OOV — any word representable by its n-grams
Rare words get poor embeddings (few training examples)
Rare words inherit n-gram vectors from common words
Best for: high-frequency common vocabulary tasks
Best for: multilingual, medical, social media, morphological languages
Inference: single lookup O(1)
Inference: sum of n-gram vectors — slightly slower

The most celebrated discovery in word embeddings is that semantic relationships are encoded as consistent directions in vector space. The vector from "man" to "woman" is approximately the same as the vector from "king" to "queen", from "uncle" to "aunt", from "actor" to "actress". This "gender direction" is a consistent geometric transformation across the embedding space. Similarly, there is a "capital city direction" (France→Paris ≈ Germany→Berlin ≈ Japan→Tokyo), a "superlative direction" (big→biggest ≈ small→smallest), and a "past tense direction" (run→ran ≈ walk→walked).

This property emerged from training — it was not engineered in. It suggests that the distributional statistics of language contain enough signal to implicitly encode the relational structure of the world. The analogy task became a standard benchmark: given A:B::C:?, find D such that B−A+C ≈ D in vector space. Word2Vec achieves ~65% accuracy on the Google Analogy Dataset (20,000 analogies across semantic and syntactic categories) — far above what was thought possible with shallow models.

Famous Word Vector Analogies
king − man + woman ≈ queen         # gender direction
Paris − France + Germany ≈ Berlin     # capital city direction
biggest − big + small ≈ smallest      # superlative direction
running − run + walk ≈ walking        # verb tense direction
doctor − man + woman ≈ nurse          # ⚠ encodes gender bias!
Word Embedding Space — semantic relationships encoded as geometry
king − man + woman ≈ queen  |  Paris − France + Germany ≈ Berlin  |  geometry encodes semantic relations ROYALTY man woman king queen −man+woman prince princess COUNTRIES & CAPITALS france paris germany berlin japan tokyo ANIMALS dog cat puppy kitten pet VERB TENSE run running ran walk walked SENTIMENT good great joy bad awful sad pos | neg 2D projection of 300-dimensional embedding space — actual geometry is high-dimensional but these cluster separations hold

Word embeddings inherit — and amplify — the biases present in their training data. Bolukbasi et al. (2016) demonstrated that in Word2Vec trained on Google News: "doctor − nurse ≈ man − woman", "programmer − homemaker ≈ man − woman", "brilliant − dull ≈ man − woman". These gender stereotypes are encoded as geometric structure in the embedding space. When downstream models use these embeddings, the bias propagates: a résumé classifier using Word2Vec may discriminate based on field-specific vocabulary that encodes gender. Debiasing techniques exist (projecting out the gender direction) but are only partially effective — the bias is distributed across the space, not concentrated in one direction.

The most fundamental limitation of all static word embeddings is context independence: every word has a single vector regardless of usage. The word "bank" in "She went to the bank to withdraw money" and "She sat on the river bank" produce the exact same 300-dimensional vector. The model averages the two senses of "bank" into one representation — losing the information needed to distinguish them. This polysemy problem is unresolvable within the static embedding framework, no matter how large the training corpus. It is the primary motivation for contextual embeddings: BERT, GPT, and their successors assign each word occurrence a different vector based on its surrounding context (Chapter 5.4).

⚠ Critical Limitation — Static Embeddings Cannot Handle Polysemy

Word2Vec, GloVe, and FastText give every word one vector for all contexts. "I deposited money at the bank" and "I fished at the river bank" produce the same "bank" embedding — the vector is a weighted average of all senses. For tasks requiring word-sense disambiguation, coreference resolution, or semantic role labelling, static embeddings hit a ceiling that no amount of data or dimensions can overcome. This is why BERT (2018) was a watershed: it introduced position-and-context-dependent representations, effectively making static embeddings obsolete for most NLP tasks.

Method Training Objective Context OOV Typical Dim Still Used?
Word2Vec Predict context / centre Local window [UNK] 100–300 Baselines, feature eng
GloVe Factorise co-occurrence matrix Global corpus [UNK] 100–300 NLP baselines
FastText Subword n-gram sum Local window No OOV 300 Multilingual, rare vocab
BERT (contextual) Masked language model Full sentence Subword 768 Yes — encoder tasks
GPT (contextual) Causal language model Causal window Subword 768–12288 Yes — generation

∑ Chapter 5.3 Summary — Word Embeddings

  • Distributional hypothesis: words with similar contexts have similar meaning — foundation of all embedding methods (Firth, 1957)
  • Word2Vec: two architectures trained by predicting word context — Skip-gram (centre → context, better for rare words) and CBOW (context → centre, faster)
  • "king − man + woman ≈ queen" — semantic relationships are directions in geometry; emerged from training, not engineered
  • GloVe: factorises the global co-occurrence matrix — explicitly models co-occurrence ratios across the entire corpus
  • FastText: word = sum of character n-gram vectors — handles OOV, morphologically rich languages, and rare vocabulary
  • Critical limitation: static embeddings give the same vector regardless of context — "bank" is identical in "river bank" and "bank account" — solved by BERT (Chapter 5.4)
5.4
Chapter 5.4
Contextual Embeddings & Pre-trained Language Models

Static word embeddings were a revolution — but they hit a ceiling. The same vector for "bank" in every sentence is a fundamental architectural limit, not a training data problem. The field needed representations that compute word meaning dynamically based on context. ELMo, ULMFiT, and then BERT answered that need — and in doing so, established the pre-train/fine-tune paradigm that defines modern NLP.

Word2Vec, GloVe, and FastText assign each word type exactly one vector, shared across all its occurrences. This is adequate for words with a single dominant sense — "elephant" nearly always means the same thing. But English has thousands of polysemous words: "bank" (financial institution / river edge), "bat" (cricket equipment / flying mammal), "light" (not heavy / illumination / a lamp), "book" (a publication / to reserve), "well" (healthy / a water source / interjection). The Word2Vec vector for "bank" is a weighted average of all its senses — useful for neither.

The polysemy ceiling is not solvable by training on more data. No matter how large the corpus, a single vector must average all contexts. The architecture itself is the limitation: static embeddings compute representations before seeing the sentence. What is needed is a model that reads the full sentence, then assigns each word a representation based on its role in that specific sentence. This is exactly what contextual embedding models provide.

Static vs Contextual Embeddings — context disambiguates polysemous words
STATIC (Word2Vec) "She visited the river bank " "She went to the bank for a loan" EMBEDDING SPACE bank one vector Same vector — meaning averaged, ambiguous CONTEXTUAL (BERT) "She visited the river bank " "She went to the bank for a loan" EMBEDDING SPACE bank₁ (nature) bank₂ (finance) Different vectors — context captured ✓

Peters et al. (AllenNLP, 2018) introduced ELMo — Embeddings from Language Models — the first widely adopted contextual word embedding. The architecture is a two-layer bidirectional LSTM language model pre-trained on 1 billion words (1 Billion Word Benchmark). Two passes through the sentence: a forward LM reads left to right and learns to predict the next word; a backward LM reads right to left and learns to predict the previous word. For each token, the forward and backward hidden states from all layers are concatenated, producing a context-sensitive representation.

ELMo representations are used as frozen features — the ELMo model is not fine-tuned on downstream tasks. Instead, the pre-computed ELMo vectors are concatenated to the input of existing task-specific models (NER taggers, QA systems, coreference models). This "feature-based" approach produced large, consistent improvements across NLP benchmarks — the first empirical proof that language model pre-training transfers broadly. ELMo improved the state-of-the-art on 6 NLP tasks simultaneously, which was extraordinary at the time.

ELMo's key limitation: the underlying architecture is an LSTM, which processes sequences sequentially (O(n) depth) and cannot parallelise across token positions. Training is slow and the representations are computed sequentially at inference. The Transformer architecture (Chapter 5.5), with O(1) depth and full parallelism via self-attention, replaced the LSTM backbone in every subsequent contextual embedding model.

ELMo — bidirectional LSTM creates context-dependent word representations
Input sentence: "The bank lent money" The bank lent money FWD h₁→ h₂→ h₃→ h₄→ knows "The" knows "The bank" knows left ctx BWD ←h₁ ←h₂ ←h₃ ←h₄ knows "money" knows "lent money" knows right ctx ELMo("bank") = [h₂→ ; ←h₂] ← "bank" after "The" AND before "lent money" → finance sense captured

Howard & Ruder (2018) introduced ULMFiT (Universal Language Model Fine-Tuning) — the paper that established the three-stage paradigm now universal in NLP. Where ELMo froze the language model and used it as a feature extractor, ULMFiT's insight was that the language model itself should be fine-tuned end-to-end on downstream tasks. This shifts the mental model from "use LM features" to "adapt a pre-trained LM for each task". BERT and GPT made this paradigm dominant — but ULMFiT proved it worked first.

ULMFiT introduced two now-standard fine-tuning techniques. Discriminative fine-tuning: assign different learning rates to each layer — earlier layers (which capture general syntax and morphology) are updated very slowly; later layers (which capture task-specific semantics) are updated faster. Gradual unfreezing: start fine-tuning only the last layer, then progressively unfreeze earlier layers one at a time. This prevents catastrophic forgetting — the phenomenon where fine-tuning on a small task dataset destroys the broad language knowledge acquired during pre-training.

The three stages generalise directly to all modern pre-trained language models. Stage 1 (pre-training) is expensive but done once and shared via model hubs. Stage 2 (domain adaptation) is optional but valuable for specialised domains (biomedical, legal, code). Stage 3 (task fine-tuning) is cheap — hours on a single GPU with hundreds to thousands of labelled examples, compared to millions required to train from scratch. This cost asymmetry is the fundamental economic argument for the transformer pre-training paradigm.

Pre-train → Domain Adapt → Task Fine-tune — the modern NLP paradigm
STAGE 1 Pre-training Language model on large general corpus Wikipedia + Books + web text (100B+ words) Expensive: weeks, 100s of GPUs ✓ Done once, shared publicly transfer STAGE 2 Domain Adapt Fine-tune LM on domain text (optional) legal / medical / code Moderate: days/GPU adapt STAGE 3 Task Fine-tuning Train on task labels Classifier / NER / QA Cheap: hours, one GPU Stage 1 knowledge transfers to stages 2 and 3 — vast language understanding, cheap specialisation

Causal Language Modelling (CLM) — used by GPT, GPT-2, GPT-3, LLaMA: predict the next token given all previous tokens. Objective: maximise P(xₜ | x₁,...,xₜ₋₁). The model never sees future tokens during training — it processes left-to-right with a causal (triangular) attention mask. This makes it naturally suited to generation: at inference, repeatedly predict the next token and append it to the sequence.

Masked Language Modelling (MLM) — used by BERT, RoBERTa, DeBERTa: randomly mask 15% of input tokens (replacing them with [MASK]), then predict the original token using the full surrounding context. Objective: maximise P(masked | all other tokens). Because both left and right context is available simultaneously, the model builds bidirectional representations — excellent for understanding tasks (classification, NER, QA) but not for generation.

Next Sentence Prediction (NSP) was used in original BERT alongside MLM: given two text segments, predict whether they appear consecutively in the source document. Later analysis (RoBERTa, 2019) showed NSP adds little benefit and can hurt performance by forcing artificially short segments — it was removed in subsequent models. Replaced Token Detection (RTD) — used by ELECTRA: a small generator network creates plausible but fake token replacements; the main discriminator must identify which tokens were replaced. Every token gets a training signal (not just 15% as in MLM), making ELECTRA 4× more efficient for the same computational budget.

Self-Supervised Pre-training Tasks — CLM, MLM, NSP, RTD
CLM — CAUSAL LANGUAGE MODEL (GPT family) The cat sat on the ??? predict Seen: The cat sat on the Predict: "mat" (next token) P(xₜ | x₁...xₜ₋₁) — left-to-right only Model: GPT, GPT-2, GPT-3, GPT-4, LLaMA MLM — MASKED LANGUAGE MODEL (BERT family) The [MASK] sat on mat → cat 15% of tokens masked randomly Predict: "cat" using BOTH sides P(masked | all other tokens) — bidirectional Model: BERT, RoBERTa, DeBERTa, ALBERT NSP — NEXT SENTENCE PREDICTION (original BERT) Sentence A: "The cat sat." B (True): "It was comfortable." B (False): "Apple is a tech company." Binary: IsNext? Removed in RoBERTa RTD — REPLACED TOKEN DETECTION (ELECTRA) The orig dog replaced! sat orig on orig mat orig Original: "The cat sat on mat" Generator replaced "cat" → "dog" Discriminator: label each token orig/replaced ALL tokens trained — 4× more efficient than MLM

Word-level contextual embeddings (one vector per token) are essential for token-level tasks like NER and POS tagging — but many applications require a single vector for an entire sentence, paragraph, or document. How do you pool a variable-length sequence of token vectors into one fixed-size representation? Three approaches have been widely used.

[CLS] token pooling (BERT's approach): prepend a special [CLS] (classification) token to every input. The Transformer processes all tokens together with full self-attention. In theory, the [CLS] token's output representation aggregates information from the entire sequence. In practice, this works well after fine-tuning on a specific task — but out-of-the-box BERT [CLS] vectors perform poorly on semantic similarity benchmarks, because BERT was not trained to produce meaningful sentence-level representations in [CLS].

Mean pooling: average all token embeddings in the final layer. Surprisingly effective as a baseline — often outperforms [CLS] pooling on zero-shot semantic similarity without fine-tuning. Simple to implement and parameter-free. Sentence-BERT (SBERT, Reimers & Gurevych, 2019) addresses both approaches by fine-tuning BERT with a siamese / triplet network objective on sentence pairs — training the model to produce similar vectors for semantically similar sentences. SBERT dramatically outperforms naive pooling on STS benchmarks and is 20–30× faster for pair-wise similarity computation than vanilla BERT (which requires a separate forward pass for every pair).

Sentence Embeddings — [CLS] pooling vs Mean pooling
[CLS] TOKEN POOLING [CLS] I love Paris [SEP] BERT Encoder (12 layers, self-attention) [CLS] ↑ used ignored [CLS] output → Sentence Vector (768-dim) MEAN POOLING I love Paris BERT Encoder all 3 used ∑/n Average all tokens → Sentence Vector (768-dim) SBERT: siamese fine-tuned BERT significantly outperforms both on semantic similarity — the standard for production RAG

∑ Chapter 5.4 Summary — Contextual Embeddings & Pre-trained LMs

  • Static embeddings fail on polysemy: same vector for "bank" in all contexts — fundamental architecture limit, not a data problem
  • ELMo: first practical contextual embeddings using bidirectional LSTM LM — forward + backward hidden states concatenated per token
  • ULMFiT established the pre-train / fine-tune paradigm: expensive once, cheap adaptation — with discriminative LR and gradual unfreezing
  • CLM (GPT): predict next token — left-to-right, generative. MLM (BERT): mask 15%, predict — bidirectional, understanding
  • RTD (ELECTRA): discriminate original vs replaced tokens — 4× more compute-efficient than MLM
  • Sentence embeddings: [CLS] pooling or mean pooling → SBERT fine-tuning dramatically improves semantic similarity quality for RAG and search
5.5
Chapter 5.5
The GPT Family

From GPT-1 to GPT-4 — how decoder-only transformers and scale created the generative AI era. The GPT lineage proved a single, deceptively simple idea: train a very large decoder-only Transformer on very large data with a next-token-prediction objective, and intelligence-like capabilities emerge.

GPT = Generative Pre-trained Transformer — a decoder-only Transformer. Unlike BERT (encoder-only, bidirectional), GPT uses causal (masked) self-attention: each token can attend ONLY to previous tokens. This constraint is not a limitation — it's the design that enables generation. You can't look at future tokens while generating them.

The key architectural choices that distinguish GPT from BERT:

Causal Self-Attention

Each token attends only to tokens before it. Implemented via a triangular mask that sets future positions to −∞ before softmax. This makes the model autoregressive — it can generate one token at a time, left to right.

No Encoder

GPT uses a single stack of N transformer decoder blocks. No encoder–decoder cross-attention. The entire prompt and generated text flow through the same stack. Simplicity at scale.

Autoregressive Generation

Given a prompt, predict the next token → append it → repeat. Each forward pass produces one token. Generation is sequential by nature — you can't parallelise the generation of future tokens (though prompt processing is parallel).

Why Decoder-Only Wins for Generation

Encoder-only models (BERT) see the full context bidirectionally — great for understanding, but can't generate. Decoder-only enforces the causal constraint that makes autoregressive generation coherent and consistent.

GPT Decoder-Only Architecture — causal attention, autoregressive generation
Token Embedding + Positional Encoding Masked Self-Attn → Add&Norm → FFN → Add&Norm Block 1 Masked Self-Attn → Add&Norm → FFN → Add&Norm Block 2 … N Masked Self-Attn → Add&Norm → FFN → Add&Norm Block N Linear → Softmax → Next Token Probs Causal Mask −∞ −∞ −∞ 0 0 0 Each token sees only past Autoregressive Generation "The cat" → [GPT] → "sat" (p=0.42) "The cat sat" → [GPT] → "on" (p=0.71) "The cat sat on" → [GPT] → "the" (p=0.83) token by token

Kaplan et al. (OpenAI, 2020) discovered that language model loss decreases as a smooth power law as you increase model size (N), dataset size (D), or compute (C). This isn't a vague trend — it's a precise mathematical relationship: L(N) ∝ N−α where α ≈ 0.076. Double the parameters and loss drops predictably.

Three factors drive scaling: N (parameters), D (dataset tokens), and C (compute in FLOPs). The breakthrough insight: you must scale all three together. Scaling parameters alone while holding data fixed gives diminishing returns.

The Chinchilla Finding (Hoffmann et al., 2022)

Optimal scaling allocates equal compute budget to parameters AND data. GPT-3 was undertrained: 175B params trained on only 300B tokens. Chinchilla-optimal would be ~3.5T tokens. LLaMA-2 7B trained on 2T tokens — far more tokens per parameter than GPT-3 — and performed remarkably well. Practical implication: "Train a smaller model on more data" — better for inference costs.

Kaplan Scaling Laws (2020) L(N) ≈ (Nc/N)αN    L(D) ≈ (Dc/D)αD    L(C) ≈ (Cc/C)αC N = parameters, D = dataset tokens, C = compute (FLOPs)
Chinchilla Optimal Scaling (2022) Nopt ∝ C0.5    Dopt ∝ C0.5 For every 2× increase in compute → double BOTH model size AND training tokens
Neural Scaling Laws — Loss vs Model Size (log-log plot)
Model Parameters (log scale) Validation Loss 100M 1B 10B 100B 1T 1.0 2.0 3.0 4.0 Kaplan Zone Chinchilla Zone GPT-2 (1.5B) GPT-3 (175B) GPT-4 (~1T est.) LLaMA-3 70B Chinchilla 70B Every 10× params → ~40% loss reduction
📝

GPT-1 — June 2018

117M parameters, trained on BookCorpus. First generative pre-training paper. Demonstrated that unsupervised pre-training + supervised fine-tuning on 12 tasks produced strong NLU results. Proof of concept.

🚀

GPT-2 — Feb 2019

1.5B parameters, trained on WebText (40GB of Reddit-filtered web pages). First model to show zero-shot capabilities — performing tasks with no task-specific training. The "too dangerous to release" controversy put LLMs in the public consciousness.

GPT-3 — June 2020

175B parameters, trained on 300B tokens. Introduced few-shot learning from the prompt alone — no gradient updates needed. In-context learning: provide examples in the prompt, and GPT-3 generalises. This changed everything.

🎯

InstructGPT — Jan 2022

GPT-3 fine-tuned with RLHF (Reinforcement Learning from Human Feedback). Follows instructions, avoids harmful output. Much more useful than raw GPT-3. Foundation for alignment research.

💬

ChatGPT — Nov 2022

GPT-3.5 + RLHF + chat interface. 100 million users in 60 days — fastest product adoption in history. Made LLMs accessible to non-technical users. Started the "AI moment".

🧠

GPT-4 — March 2023

Multimodal (image + text), estimated ~1 trillion parameters. Professional exam performance: passed the bar exam (90th percentile), SAT, medical licensing. Step change in reasoning quality.

🌐

GPT-4o — 2024

Native voice + vision, fast inference, GPT-4 quality at lower cost. "Omni" model — unified multimodal architecture. Real-time conversation with vision understanding.

🔗

o1, o3 — 2024–2025

Reasoning models with chain-of-thought. New frontier: models that "think" before answering, spending more compute at inference time. Trade speed for accuracy on complex tasks.

GPT Family — exponential parameter growth from 117M to ~1T
Model Parameters (log) 100M 1B 10B 100B 1T 117M GPT-1 1.5B GPT-2 175B GPT-3 ~1T (est.) GPT-4 70B LLaMA-3 7B Mistral 13× 117× GPT (closed) Open-source

Wei et al. (2022) documented a surprising phenomenon: certain capabilities appear suddenly at a scale threshold — they are essentially absent in smaller models and then abruptly present in larger ones. These emergent abilities were not explicitly trained. The model was only ever trained to predict the next token. Yet above a certain parameter count, it can perform multi-step arithmetic, chain-of-thought reasoning, translation between unseen language pairs, and code generation.

Multi-step Arithmetic

Below ~10B params ≈ random performance. Above 100B → suddenly works with high accuracy. The model learns to decompose calculations despite never being explicitly taught arithmetic.

Chain-of-thought Reasoning

Appears around 100B parameters. Prompting "Let's think step by step" has zero effect on small models but dramatically improves large model accuracy on multi-step reasoning tasks.

Unseen Language Translation

Models trained primarily on English data can translate between language pairs never seen during training. This capability emerges at scale — evidence of internal multilingual representations.

Code Generation

Near-zero at 1B, functional at 10B, excellent at 100B+. Models go from generating syntactic garbage to writing correct, complex programs — a phase transition in capability.

The Debate: Are Emergent Abilities Real?

Schaeffer et al. (2023) argued that emergent abilities may be measurement artifacts — they appear "sudden" because we use discontinuous metrics (e.g., exact-match accuracy). With continuous metrics (e.g., log-likelihood), improvement is smooth. The debate continues, but the practical observation holds: there are capability thresholds below which models are useless at certain tasks.

Emergent Abilities — sudden capability jumps at scale thresholds
Model Scale — Parameters (log) Task Performance (%) 1B 10B 100B 1T 0% 25% 50% 75% 100% random Multi-digit arithmetic Chain-of-thought reasoning Unseen language translation ~10B threshold ~100B threshold

How does an LLM actually generate text? At each step, the model outputs a probability distribution over the entire vocabulary. The decoding strategy determines which token to pick from that distribution. This choice dramatically affects output quality, diversity, and creativity.

Greedy

Always pick the most probable token. Deterministic, often repetitive for long texts. argmax at every step.

Beam Search

Maintain top-k sequences at each step, pick the best overall. Better quality than greedy, still not diverse.

Sampling

Sample randomly from the full distribution. Diverse but can produce incoherent text — low-probability tokens get chosen.

Top-k Sampling

Sample only from the top k most likely tokens. k=50 is common. Balances diversity and coherence.

Top-p / Nucleus

Sample from the smallest set of tokens whose cumulative probability ≥ p. Adaptive vocabulary — more tokens when distribution is flat, fewer when peaked.

Temperature

Scale logits by T before softmax. T<1 → sharper (more deterministic). T>1 → flatter (more creative/chaotic). T=0 ≈ greedy.

LLM Decoding Strategies — greedy, top-k, top-p, temperature
Prompt: "The cat sat on the" Greedy (argmax) mat (0.42) ✓ floor (0.25) roof → Always "mat" — deterministic, repetitive Top-k (k=3) mat (0.42) floor (0.25) roof → Sample from top 3 — diverse, controlled Top-p (p=0.9) mat (0.42) floor (0.25) roof bed → Nucleus: smallest set ≥ 90% — adaptive Temp = 0.1 mat (0.91) → Very sharp — nearly deterministic Temp = 2.0 mat floor roof bed sky → Very flat — creative / chaotic Typical production setting: top-p = 0.9, temperature = 0.7–1.0 Low temp for factual tasks, high temp for creative tasks

The open-source LLM ecosystem exploded in 2023–2025. Models from Meta, Mistral, Alibaba, Google, Microsoft, and others are approaching closed-source frontier quality. This table captures the major families as of 2024–2025.

Model Provider Params Context License Notable
LLaMA 3 8B/70B/405B Meta 8B–405B 128K Llama 3 Best open-source 2024
Mistral 7B / 8×7B Mistral AI 7B / ~45B 32K Apache 2.0 Efficient MoE (Mixtral)
Qwen2.5 7B/72B Alibaba 7B–72B 128K Qwen Strong multilingual
Gemma 2 9B/27B Google 9B / 27B 8K Gemma Strong at size
Phi-3 mini/small Microsoft 3.8B / 7B 128K MIT Small but capable
DeepSeek-R1 DeepSeek 7B–671B 64K MIT Reasoning-focused
Command-R+ Cohere 104B 128K CC BY-NC RAG-optimised

∑ Chapter 5.5 Summary — The GPT Family

  • GPT = decoder-only Transformer + causal (left-to-right) attention = autoregressive generation
  • Scaling laws: loss decreases as power law with parameters, data, and compute
  • Chinchilla: optimal training = equal compute budget for parameters AND data
  • Emergent abilities: capabilities appear suddenly at scale thresholds — not trained explicitly
  • Inference: top-p (nucleus) sampling at temperature 0.7–1.0 is the typical LLM generation setting
  • Open-source LLMs (LLaMA 3, Mistral, Qwen) are approaching closed-source frontier quality
5.6
Chapter 5.6
BERT & Encoder Models

BERT introduced a paradigm shift: instead of predicting the next word left-to-right, mask some words and predict them using the FULL surrounding context. This bidirectional pre-training produces richer representations that dominate understanding tasks — classification, NER, QA, and semantic search.

Devlin et al. (Google, 2018): BERT — Bidirectional Encoder Representations from Transformers. BERT uses an encoder-only Transformer stack — no decoder, no causal mask. Every token attends to all other tokens simultaneously (bidirectional attention). This is the key difference from GPT: BERT sees the full context before producing representations.

BERT-base

  • 12 Transformer layers
  • 768 hidden dimension
  • 12 attention heads
  • 110M parameters

BERT-large

  • 24 Transformer layers
  • 1024 hidden dimension
  • 16 attention heads
  • 340M parameters
BERT Model Sizes BERT-base: L=12 layers, H=768 dim, A=12 heads, 110M parameters BERT-large: L=24 layers, H=1024 dim, A=16 heads, 340M parameters Input: [CLS] sentence_A [SEP] sentence_B [SEP] Output: Contextual embedding for every input token (shape: seq_len × 768)
BERT vs GPT Attention — bidirectional vs causal masking
BERT — each token sees ALL tokens The cat sat on the mat The cat sat on the mat GPT — each token sees only PAST tokens The cat sat on the mat The cat sat on the mat

BERT's input representation is the sum of three embedding types (not concatenated). Every input is prepended with [CLS] and sentence pairs are separated by [SEP].

🏷️

[CLS] Token

Classification token prepended to every input. Its final hidden state is used as the aggregate sequence representation for classification tasks.

✂️

[SEP] Token

Separator token between sentence A and sentence B. Also appended at the end of the input sequence.

🎭

[MASK] Token

Replaces 15% of tokens during pre-training. Of those 15%: 80% → [MASK], 10% → random word, 10% → kept unchanged.

Segment embeddings tell BERT which sentence each token belongs to (Sentence A vs Sentence B). The three input components are summed element-wise: Token Embedding + Positional Embedding + Segment Embedding.

BERT Input = Token Embedding + Positional Embedding + Segment Embedding
TOKEN EMBEDDING [CLS] I love Paris [SEP] It's beautiful [SEP] SEGMENT EMBEDDING Segment A Segment B POSITIONAL EMBEDDING pos 0 pos 1 pos 2 pos 3 pos 4 pos 5 pos 6 pos 7 + Sum of all three = BERT input SUMMED (not concatenated)

BERT's power lies in fine-tuning: take the pre-trained backbone and add a thin task-specific head. All BERT weights are updated during fine-tuning (with a small learning rate). Four canonical task types:

📄

Sequence Classification

Sentiment, topic, NLI. Add FC layer on top of [CLS] embedding → class probabilities. Fine-tune all BERT weights + FC layer.

🏷️

Token Classification

NER, POS tagging. Add FC layer on every token embedding → per-token labels. Each token gets a label independently.

Extractive Question Answering

Input: [CLS] question [SEP] passage [SEP]. Output: start + end position — which span in the passage is the answer. Two vectors classify each token as answer-start or answer-end.

🔗

Sentence Pair Tasks

Similarity, entailment. Input: [CLS] sentence A [SEP] sentence B [SEP]. Use [CLS] embedding as pair representation.

BERT Fine-tuning Tasks — one backbone, four output heads
SEQ CLASSIFICATION Tokens BERT [CLS] only FC → Softmax Class Label TOKEN CLASSIFICATION Tokens BERT ALL token outputs FC per token Per-token Labels EXTRACTIVE QA [CLS] Q [SEP] P [SEP] BERT Passage tokens Start / End cls Answer Span SENTENCE PAIR [CLS] A [SEP] B [SEP] BERT [CLS] only Similarity Score Score / Label
Code — BERT Fine-tuning with HuggingFace
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments from datasets import load_dataset # Load pre-trained BERT + add classification head tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) # Tokenise input (handles [CLS] and [SEP] automatically) def tokenize_fn(examples): return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128) dataset = load_dataset('imdb') tokenized = dataset.map(tokenize_fn, batched=True) training_args = TrainingArguments( output_dir='./bert-sentiment', num_train_epochs=3, per_device_train_batch_size=16, learning_rate=2e-5, # CRITICAL: small LR for fine-tuning pre-trained model weight_decay=0.01, evaluation_strategy='epoch', warmup_steps=500 ) trainer = Trainer(model=model, args=training_args, train_dataset=tokenized['train'], eval_dataset=tokenized['test']) trainer.train()
Model Key Innovation Params Training Data Notable Improvement
BERT-base Bidirectional Transformer + MLM + NSP 110M BookCorpus + Wikipedia Baseline
RoBERTa (Facebook) Remove NSP, larger batches, more data, longer training 125M 160GB text 5–10% improvement on GLUE
DistilBERT (HuggingFace) Knowledge distillation from BERT (40% smaller) 66M Same 60% faster, 97% of BERT's performance
ALBERT (Google) Cross-layer parameter sharing, sentence order prediction 12M–235M Same Same performance, fraction of params
DeBERTa (Microsoft) Disentangled attention (separate content + position) 86M–1.5B 160GB State-of-the-art on SuperGLUE
ELECTRA (Google) Replaced Token Detection (more efficient training) 14M–335M Same 4× more efficient than BERT
Aspect BERT (Encoder) GPT (Decoder)
Attention Bidirectional (all tokens) Causal (left-to-right only)
Pre-training Masked LM + NSP Next-token prediction
Best For Understanding (classify, NER, QA) Generation (chat, completion)
Output Contextual embeddings Generated text
Fine-tuning Add task head, small dataset OK Prompt-based, few-shot
Scale 110M–1.5B 117M–1.8T+
Use BERT / Encoder Models
Use GPT / Decoder Models
  • → Understanding tasks (classification, NER, QA)
  • → Sentence embeddings for semantic search
  • → NLI and entailment
  • → Smaller, faster fine-tuning
  • → Bidirectional context needed
  • → Tasks with fixed input→label format
  • → Generation tasks (chat, completion, summarisation)
  • → Zero/few-shot prompting
  • → Reasoning over long contexts
  • → Instruction following
  • → Tasks requiring flexible output format
  • → When you have no task-specific labels

∑ Chapter 5.6 Summary — BERT & Encoder Models

  • BERT: encoder-only, bidirectional attention — each token sees all tokens simultaneously
  • Pre-training: Masked LM (predict 15% masked tokens) + NSP on Wikipedia + BookCorpus
  • Fine-tuning: add task-specific head on [CLS] (classification) or all tokens (NER)
  • RoBERTa improves BERT by: more data, remove NSP, larger batches, longer training
  • DistilBERT: 40% smaller, 60% faster, 97% of BERT performance via knowledge distillation
  • Use BERT for understanding tasks; use GPT for generation and instruction following
5.7
Chapter 5.7
Prompt Engineering

Prompt engineering is the art and science of crafting inputs to get desired outputs from LLMs. The same model can produce radically different quality depending on how you ask — mastering the prompt is mastering the interface to intelligence.

LLMs are not search engines — they are conditional probability machines. Given your prompt as the beginning of a document, they predict what comes next. The quality and structure of that beginning determines everything about the continuation.

Compare: "What is the capital of France?" vs "Answer as a geography teacher giving a detailed explanation: What is the capital of France?" — same factual answer but very different style and depth.

Mental Model

You are writing the beginning of a document that the LLM will continue. The better the beginning, the better the continuation.

Five prompt components:

① Instruction

What you want the model to do. Be specific and explicit: "Summarise in 3 bullets" not "Summarise".

② Context

Background information the model needs: domain, audience, constraints, prior conversation.

③ Input Data

The actual content to process: text to classify, code to review, question to answer.

④ Output Format

How you want the answer: JSON, bullet list, table, single word, code block.

⑤ Examples

Demonstrations of desired input→output pairs (few-shot). The model learns the pattern in-context.

Zero-shot: ask the model without any examples. Works well for simple, well-defined tasks.

Few-shot (in-context learning): provide 2–5 examples before asking. Brown et al. (GPT-3, 2020) showed that providing examples dramatically improves performance. The model is not fine-tuned — it adapts to the task from examples in its context window.

One-shot: exactly one example — sometimes all you need for well-defined tasks.

Key Insight

Example selection matters: choose examples that cover edge cases and represent the full range of expected inputs. Diverse examples outperform similar ones.

Zero-Shot, One-Shot, Few-Shot — In-Context Learning
ZERO-SHOT — no examples Classify the sentiment: 'I love this product!' → → "Positive" ✓ Simple task — works without examples ONE-SHOT — one example Example: 'The food was terrible' → Negative Classify: 'I love this product!' → → "Positive" ✓ One example sets the pattern FEW-SHOT — 3 examples, better coverage Example 1: 'The food was terrible' → Negative Example 2: 'Service was excellent' → Positive Example 3: 'It was okay I guess' → Neutral Classify: 'I love this product!' → → "Positive" ✓✓ (more reliable) More examples = better performance on ambiguous inputs

Wei et al. (Google, 2022): "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". The key insight: prompting LLMs to "think step by step" dramatically improves reasoning accuracy on math, logic, and multi-step problems.

Why it works: the model generates intermediate steps → each step conditions the next → less error accumulation. The chain of reasoning acts as a scratchpad that keeps the model on track.

Standard Prompting

Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many?

A: 11

↑ Direct answer — works for simple tasks, fails for multi-step

CoT Prompting

Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many?

A: Roger started with 5. Each can has 3 balls, so 2 cans = 6 balls. 5 + 6 = 11. The answer is 11.

↑ Explicit steps — each step conditions the next

Zero-Shot CoT (Kojima et al., 2022)

Just add "Let's think step by step" to any prompt — no examples needed. This simple suffix unlocks reasoning in large models.

Chain-of-Thought — explicit reasoning steps dramatically improve accuracy
STANDARD PROMPTING Q: If a store sells 3 items at $4 each and 2 items at $7 each, what is the total? "$26" ✓ Direct answer — higher error rate on complex multi-step problems GSM8K: ~18% accuracy CHAIN-OF-THOUGHT Q: If a store sells 3 items at $4 each and 2 items at $7 each, what is the total? Step 1: 3 items × $4 = $12 Step 2: 2 items × $7 = $14 Step 3: $12 + $14 = $26 ✓ GSM8K: ~57% accuracy Multi-step math: Standard 18% → CoT 57% (GSM8K benchmark)

Structure your prompts for reliable, parseable output. Three key techniques:

📋

Output Formatting

Ask for JSON, XML, or specific structure. Example: "Return as JSON: {"sentiment": "...", "confidence": 0-1}"

→ Parse programmatically, no regex hacks

🎭

Role Prompting

"You are an expert Python developer with 10 years of experience."

→ Sets persona, knowledge domain, and response style

🔒

Delimiters

Use triple backticks, XML tags, or --- to separate instruction from data.

→ Prevents prompt injection, clarifies boundaries

Code — Structured Entity Extraction with OpenAI API
import openai, json def extract_entities(text: str) -> dict: response = openai.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": "You extract named entities from text. Return JSON only."}, {"role": "user", "content": f"""Extract entities from: ```{text}``` Return format: {{"people": [...], "organizations": [...], "locations": [...], "dates": [...]}}"} ], temperature=0.0, # deterministic for structured output response_format={"type": "json_object"} # GPT-4 JSON mode ) return json.loads(response.choices[0].message.content) result = extract_entities("Tim Cook announced Apple's Q3 earnings in Cupertino on Tuesday, August 1st.") print(json.dumps(result, indent=2)) # {"people": ["Tim Cook"], "organizations": ["Apple"], "locations": ["Cupertino"], "dates": ["August 1st"]}

In chat-based APIs (OpenAI, Anthropic, etc.), the system prompt sets the model's behaviour, persona, and constraints before the user speaks. It's the most powerful lever for controlling output quality.

What Goes in System Prompts

  • Role and persona definition
  • Output format requirements
  • Constraints and guardrails
  • Domain knowledge or context
  • Tone and style instructions

Best Practices

  • Be explicit and specific — don't assume the model infers intent
  • Put constraints up front (format, length, language)
  • Use delimiters to separate user content from instructions
  • Test with adversarial inputs
  • Don't rely on system prompt secrecy for security
System Prompt Template You are [ROLE] with expertise in [DOMAIN]. Your task: [INSTRUCTION] Rules: 1) [CONSTRAINT] 2) [FORMAT] 3) [GUARDRAIL] If uncertain, say "I don't know" — do NOT hallucinate. Always put the most important constraints first — models attend more strongly to the beginning.
Pattern When to Use Template Example
Role Pattern Need domain expertise "You are a [role] with [experience]..." "You are a senior Python engineer reviewing code for bugs"
Step-by-Step Multi-step reasoning, math "Think step by step..." "Solve this problem step by step: ..."
Output Format Need structured data "Return as JSON/list/table..." "Return as JSON: {fields}"
Few-Shot Task hard to specify, need examples "[Example 1]→[Output 1]\n[Input]→?" Sentiment, classification, entity extraction
Chain-of-Thought Reasoning, math, logic "[problem] Let's think step by step" Math word problems, logical puzzles
Delimiter Long context, avoid injection "Summarise: ```{text}```" Document processing, code review
Self-Ask Complex multi-hop questions "Are there any follow-up questions?" Research synthesis, fact verification
⚠️

Prompt Injection

Malicious input overrides instructions: "Ignore previous instructions and..."

Mitigation: use delimiters, validate inputs, separate system and user content.

⚠️

Prompt Leaking

User can extract system prompt: "Repeat all your instructions"

Mitigation: don't rely on prompt secrecy for security, use proper access controls.

💡

Ambiguous Instructions

Vague prompts → inconsistent outputs. Be explicit: "Respond in 3 bullet points of max 20 words each".

💡

Lost in the Middle

LLMs attend better to start and end of context. Put most important info first or last. (Liu et al., 2023: "Lost in the Middle" phenomenon)

∑ Chapter 5.7 Summary — Prompt Engineering

  • Few-shot in-context learning: examples in the prompt teach the task — no gradient updates needed
  • Chain-of-thought: "Let's think step by step" — explicit reasoning steps reduce errors
  • Structured output: specify JSON/XML format → parse programmatically
  • ReAct pattern: Thought→Action→Observation loop — foundation of tool-using agents (Domain 8)
  • Prompt injection: user input can override instructions — always use delimiters to separate content
  • "Lost in the Middle": LLMs attend best to start and end of context — put key info there
5.8
Chapter 5.8
Retrieval-Augmented Generation

RAG is the bridge between an LLM's frozen knowledge and the living, changing world. Instead of retraining a model every time information changes, retrieve relevant documents at query time and inject them into the prompt — grounding the model's answers in real, verifiable sources.

Three fundamental LLM limitations make RAG essential for production systems:

📅

Knowledge Cutoff

GPT-4's training data has a cutoff date. Any event after it is unknown. "What happened at the UN Security Council yesterday?" → hallucination.

RAG fix: retrieve yesterday's news, inject into context.

🌀

Hallucination on Specifics

LLMs confabulate details — addresses, phone numbers, dates, internal policies. "What is our Q3 refund policy?" → makes something up.

RAG fix: retrieve actual policy document, ground the answer.

🔒

Private Knowledge

Your internal docs, contracts, code, Slack history — not in any LLM. "Summarise our client contract with Acme Corp" → impossible.

RAG fix: embed and retrieve from your private document store.

RAG has two phases: an offline indexing pipeline (run once or periodically) and an online query pipeline (run at every user question). Both share the same embedding model and vector database.

Indexing Phase (Offline)

  • Load documents (PDFs, web pages, Word, Slack, etc.)
  • Chunk into smaller pieces (e.g., 512 tokens each)
  • Generate embedding vector for each chunk
  • Store vectors in vector database

Query Phase (Online)

  • User asks a question
  • Embed the question (same model)
  • Vector search: find top-k similar chunks
  • Inject retrieved chunks into LLM prompt
  • LLM generates grounded answer
RAG Architecture — Indexing (offline) and Query (online) pipelines
INDEXING PIPELINE (OFFLINE) Documents PDFs, Web, DB Chunking 512 tokens Embedding Model text → 768-dim vector Vector DB Pinecone / Chroma "Our return policy allows..." → [0.2, -0.5, 0.8, ...] (768-dim) → stored in Vector DB Vector DB shared by both pipelines QUERY PIPELINE (ONLINE) User Query "Return policy?" Embed ANN Search top-k chunks LLM + Context GPT-4 / Claude Grounded Answer User: "What is the return policy?" → Answer: "Our return policy is 30 days as stated in section 4.2..." (with source citation from retrieved document)
Code — Simple RAG with LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.vectorstores import Chroma from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain.chains import RetrievalQA # === INDEXING PHASE === # Load and chunk documents with open("company_policy.txt", "r") as f: text = f.read() splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50) chunks = splitter.create_documents([text]) print(f"Created {len(chunks)} chunks") # Embed and store embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db") # === QUERY PHASE === retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) # top 4 chunks llm = ChatOpenAI(model="gpt-4o", temperature=0.0) qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=retriever, return_source_documents=True # show which docs were retrieved ) result = qa_chain.invoke("What is the return policy for electronics?") print(result["result"]) print("Sources:", [d.metadata for d in result["source_documents"]])

Vector databases are specialised stores for embedding vectors, optimised for similarity search. The core operation is Approximate Nearest Neighbour (ANN) search: "find the k vectors closest to this query vector." Exact search is O(n·d) — ANN algorithms like HNSW make this dramatically faster.

Vector Search — k-nearest neighbour retrieval by cosine similarity
Query: "return policy?" cos=0.92 cos=0.89 cos=0.85 Top-3 Retrieved Chunks: 1. "Return policy: 30 days..." (cos=0.92) 2. "Customer refund guidelines..." (cos=0.89) 3. "Warranty and returns..." (cos=0.85) Finance docs Tech docs Legal docs ANN search finds semantically similar chunks in milliseconds
Database Type ANN Algorithm Filtering Best For Hosted?
Chroma Open source HNSW Metadata Dev / prototyping Self-hosted / Cloud
Pinecone Cloud-native Proprietary Metadata + hybrid Production, scale Cloud only
Weaviate Open source HNSW GraphQL Hybrid search, graphs Both
Qdrant Open source HNSW Payload High performance Both
pgvector PostgreSQL ext IVF / HNSW Full SQL Existing PG infra Self-hosted
Milvus Open source IVF, HNSW Scalar Billion-scale Self-hosted / Cloud

Chunk size matters enormously — wrong size degrades RAG quality. Too small: insufficient context in each chunk → incomplete answers. Too large: irrelevant content mixed with relevant → noisy retrieval. Chunks should overlap (50–100 tokens) to avoid splitting context across boundaries.

📏

Fixed-Size

Split every 512 tokens regardless of content. Simple but crude — may cut mid-sentence.

📝

Sentence-Based

Split on sentence boundaries. Better semantic coherence, preserves complete thoughts.

🔄

Recursive

Try paragraphs, then sentences, then words. LangChain's default. Best general-purpose strategy.

🧠

Semantic

Use embedding similarity to detect topic changes. Expensive but produces the most coherent chunks.

📑

Document-Aware

Use document structure (headers, sections, tables). Best for structured documents like reports and manuals.

Chunking Strategies — how you split documents impacts retrieval quality
Same document, three chunking methods: FIXED-SIZE 512 tok 512 tok 512 tok May split mid-sentence Simple but lossy SENTENCE-BASED Sent 1 Sent 2 Sent 3 Sent 4 Sent 5 Sent 6 . . Preserves sentence coherence Better semantic boundaries RECURSIVE + OVERLAP Paragraph 1 (3 sentences) Paragraph 2 (2 sentences) overlap Structure-aware + overlap Best general-purpose Recommended default: Recursive, 512 tokens, 50-token overlap

Naive RAG (embed → search → generate) works for many cases. These advanced patterns dramatically improve precision and recall for production systems:

🔀

Hybrid Search

Combine dense (embedding) + sparse (BM25 keyword) search. Better recall for exact phrase matches AND semantic similarity. Use Reciprocal Rank Fusion (RRF) to merge result lists.

🏆

Re-Ranking

Initial retrieval: top-20 by fast ANN → re-rank with expensive cross-encoder → return top-5. Cross-encoders read query + document together for much better relevance.

🔄

Query Transformation

Rewrite query before retrieval. HyDE: generate hypothetical answer, then embed that. Multi-query: generate 3 variants → retrieve for all → merge results.

👨‍👧

Parent-Child Chunks

Index small child chunks for precision retrieval. Return larger parent chunk for context. Best of both worlds — precise matching with rich context.

Advanced RAG: Two-Stage Retrieve-Then-Rerank Pipeline
User Query Embed ANN Search Vector DB Top 20 results (fast, ~5ms) Cross-Encoder Re-Ranker Top 5 results (slow, ~100ms, precise) LLM Answer Stage 1: Recall Stage 2: Precision Two-stage: fast recall then precise re-ranking
RAG
Fine-Tuning
  • ✓ Knowledge updatable at any time
  • ✓ Cites sources, verifiable answers
  • ✓ No training required
  • ✓ Lower cost than fine-tuning
  • ✗ Retrieval quality is the bottleneck
  • ✗ Context window limits
  • ✗ Latency of retrieval step
  • ✓ Knowledge baked into weights
  • ✓ No retrieval latency
  • ✓ Better for style / format / behaviour
  • ✗ Knowledge is static (needs retraining)
  • ✗ Can't cite specific sources
  • ✗ Expensive to update frequently

RAG and fine-tuning are not competing approaches — they are complementary. Fine-tune to change HOW the model communicates (tone, format, domain vocabulary). Use RAG to change WHAT the model knows (current facts, private documents, enterprise data). The best production systems use both.

∑ Chapter 5.8 Summary — Retrieval-Augmented Generation

  • RAG solves: knowledge cutoff, hallucination on specifics, private/proprietary data
  • Pipeline: Chunk docs → Embed → Store in vector DB → At query: embed query → ANN search → inject → generate
  • Chunking: 512 tokens with 50-token overlap is a reasonable default — recursive splitting preserves structure
  • Vector DB: stores embeddings for ANN similarity search — cosine similarity finds semantically similar chunks
  • Advanced RAG: hybrid search + re-ranking dramatically improves retrieval precision
  • RAG vs fine-tuning: use both — RAG for dynamic knowledge, fine-tuning for style/behaviour
5.9
Chapter 5.9
Hallucination, Alignment & Evaluation

LLMs are trained to produce fluent, probable text — not factual text. Understanding why they hallucinate, how alignment steers them toward human values, and how to rigorously evaluate their output is essential for responsible deployment.

Hallucination: LLMs generate factually incorrect information with apparent confidence. This is not a bug — it's a consequence of the training objective. The model was rewarded for coherent, fluent text, not for verified facts.

📛

Factual Hallucination

"Einstein won the Nobel Prize for relativity" — he actually won for the photoelectric effect. Plausible, confident, wrong.

📚

Citation Hallucination

Fabricated paper titles, non-existent authors, wrong DOIs. The model generates citation-shaped text that looks real but doesn't exist.

👤

Entity Hallucination

Made-up people, places, company names that sound real. "Westbrook Medical Center" — doesn't exist but sounds plausible.

🧩

Reasoning Hallucination

Correct-sounding reasoning leading to a wrong conclusion. Each step looks valid, but the chain produces an incorrect answer.

Intrinsic vs extrinsic hallucination:

Intrinsic Hallucination

Contradicts the provided context. The document says population = 5M, but the answer says 10M. Detectable by comparing output to source.

Extrinsic Hallucination

Fabricated content not in context. Generated from the model's world knowledge — may or may not be true. Harder to detect without external verification.

Hallucination Taxonomy — intrinsic (contradicts context) vs extrinsic (fabricated)
LLM Hallucination Intrinsic (contradicts source) "Source says 5M, answer says 10M" Extrinsic (fabricated content) Factual Wrong dates, events Citation Fake papers Entity Fake places Reasoning Wrong logic Mitigations → RAG: ground in retrieved facts → Constitutional AI: self-critique → Temperature=0: more deterministic

Root cause: LLMs are trained to produce fluent, probable text — not factual text. The model doesn't "know" it doesn't know something. Confidence calibration is poor: models are confidently wrong, which is more dangerous than being uncertainly wrong.

🎯

Training Objective

Maximise next-token probability → rewarded for coherent text, not verified facts. The loss function doesn't distinguish true from plausible.

🧠

Memorisation vs Generalisation

Facts not seen enough times in training → model interpolates between facts. It generates a blend of real knowledge and pattern-matched confabulation.

😊

Sycophancy

Models trained with RLHF learn to tell users what they want to hear. If you suggest a wrong answer, the model may agree rather than correct you.

Critical Danger

The hallucination-confidence problem: models are confidently wrong. A model that says "I'm not sure" is safer than one that states a fabricated fact with full certainty. This is why calibration research is critical.

Alignment problem: ensure AI systems behave according to human values and intentions. A highly capable but misaligned AI is dangerous — capability without alignment amplifies harm.

The specification problem: how do you formally specify "what humans actually want"? Even well-intentioned reward functions can be gamed — the model maximises the metric in unintended ways (reward hacking).

RLHF (Partial Solution)

Human preferences act as a proxy for values. Humans rank model outputs → reward model trained on rankings → PPO optimises policy. Imperfect but significant improvement over base models.

Constitutional AI (Anthropic)

Model learns from its own self-critique using a constitution of principles. Generate → critique against principles → revise. Scales better than human labelling.

Key alignment challenges:

Distributional Shift

Behaves well in training distribution but fails on out-of-distribution deployment inputs.

Reward Hacking

Satisfies the letter but not the spirit of the reward. Finds loopholes in the reward function.

Deceptive Alignment

Appears aligned during evaluation, behaves differently when deployed. The hardest failure mode to detect.

Capability vs Alignment — the central tension in LLM development
Capability (benchmark scores) → Alignment / Safety → Base LLMs RLHF-aligned GPT-3 base InstructGPT GPT-4 (goal) Alignment Tax RLHF may reduce raw capability slightly but increases safety

Anthropic's HHH framework defines the three axes of aligned model behaviour. The tension: being more helpful sometimes means being slightly less cautious (and vice versa). Constitutional AI resolves this by giving the model explicit principles to follow.

🤝

Helpful

Genuinely helps users accomplish tasks. Unhelpfulness is never trivially "safe" — a model that refuses everything harms users who have legitimate needs.

🛡️

Harmless

Avoids generating content that causes real-world harm. Calibrated — not reflexively refusing edge cases. Context matters: medical information for a nurse vs a stranger.

🔍

Honest

Doesn't claim certainty it doesn't have. Proactively shares relevant information. Doesn't pursue hidden agendas or deceive about its nature.

Constitutional AI Process

Generate → critique against principles → revise → repeat. The model becomes its own alignment judge, guided by a written constitution of values. Scales far better than per-output human labelling.

Automatic metrics enable scalable evaluation, but each has significant blind spots. Understanding their strengths and weaknesses is essential for trustworthy evaluation.

📊

BLEU (Translation)

Precision of n-gram overlap between generated and reference text. Range: 0–1. Weakness: doesn't capture meaning, penalises valid paraphrase.

📝

ROUGE (Summarisation)

Recall of n-gram overlap with reference. ROUGE-N: n-gram recall. ROUGE-L: longest common subsequence. Weakness: length bias, synonym-blind.

🌐

METEOR

Combines precision, recall, and semantic matching via WordNet synonyms. Better correlation with human judgement than BLEU alone.

🤖

BERTScore

Uses BERT embeddings to measure semantic similarity. More robust than n-gram metrics — captures paraphrase and meaning equivalence.

Key Metric Formulas BLEU = BP · exp(∑ wₙ log pₙ) Where: pₙ = n-gram precision, BP = brevity penalty ROUGE-N = (# overlapping n-grams) / (# n-grams in reference) Perplexity = exp(H(p,q)) = exp(-(1/N) ∑ log q(xᵢ)) Lower perplexity = better language model. Measures how "surprised" the model is by the test text.
Automatic vs Human Evaluation — speed-accuracy tradeoff
Accuracy Speed Cost (low=good) Coverage Human Corr. Automatic (BLEU/ROUGE/BERTScore) Human Evaluation Auto: fast, cheap, scalable but misses semantic nuance Human: accurate, gold standard but slow, expensive, not scalable
Benchmark Tests Format Human Baseline Note
MMLU 57-subject knowledge (57K questions) Multiple choice ~89% Knowledge breadth
HumanEval Python function generation (164 problems) Code generation ~75% Coding
GSM8K Grade school math (8.5K problems) Multi-step reasoning ~95% Math
MATH Competition math (12.5K problems) Multi-step hard math ~40% (students) Hard math
ARC-AGI Visual pattern reasoning Novel test patterns ~85% Novel reasoning
GPQA Diamond PhD-level science (448 questions) Multiple choice ~65% Expert knowledge
MT-Bench Multi-turn dialogue quality GPT-4 as judge Chat quality
Chatbot Arena Head-to-head human preference ELO rating Real-world preference

Goodhart's Law applies everywhere in LLM evaluation: when a benchmark becomes a target, it ceases to be a good measure. Models trained on benchmark-adjacent data score artificially high. The most trustworthy evaluation is diverse human assessment on novel, never-before-seen tasks.

∑ Chapter 5.9 Summary — Hallucination, Alignment & Evaluation

  • Hallucination: LLMs generate confident falsehoods — trained for fluency, not factual accuracy
  • Types: factual, citation, entity, reasoning hallucinations — RAG and temperature=0 reduce them
  • Alignment: ensure models behave according to human values (Helpful, Harmless, Honest)
  • RLHF and Constitutional AI: current best approaches to alignment — imperfect but significant improvement
  • BLEU/ROUGE: n-gram metrics for translation/summarisation — fast but miss semantic equivalence
  • Human evaluation remains the gold standard — automatic metrics can be gamed
5.10
Chapter 5.10
LLM Fine-Tuning in Practice

Fine-tuning is the final tool in the LLM adaptation toolkit — used when prompting and RAG aren't enough. Modern techniques like QLoRA make it possible to fine-tune a 70B parameter model on a single consumer GPU.

Decision framework — try in order (cheapest first):

1️⃣

Prompt Engineering (Free)

Can you get the desired behaviour with a better prompt? Better system prompt, clearer instructions, output format specification. Always try this first.

2️⃣

Few-Shot Examples (Free)

Add 3–10 examples to the prompt. Often dramatically improves output quality for classification, extraction, and formatting tasks.

3️⃣

RAG (Moderate Cost)

Does the model need access to external, updated, or private knowledge? RAG grounds answers in retrieved documents without any training.

4️⃣

Fine-Tuning (Higher Cost)

Does the model need to change its behaviour, style, or domain expertise? Fine-tuning bakes capabilities into the weights.

Fine-tuning IS the right choice when:

Consistent Format

Always return JSON in a specific schema, or follow a strict output template.

Domain Vocabulary

Medical jargon, legal language, internal code style that prompts can't reliably teach.

Reduce Token Usage

Bake instructions into weights that would otherwise consume context window space.

Faster Inference

A smaller fine-tuned model can outperform a larger prompted model — lower cost per query.

500+ Examples

You have high-quality labelled data. Without data, fine-tuning can't help.

When to Fine-Tune — try prompting and RAG first
Need to improve LLM performance? Better prompt / instructions work? YES Prompt Engineering ✓ (free) NO Needs current / private knowledge? YES RAG ✓ (moderate) NO Needs format / style / domain vocab? YES Fine-Tuning ✓ Have GPU → QLoRA | API-only → OpenAI FT NO More / better training data needed Cost increases →

The #1 determinant of fine-tuning quality is data quality. Garbage in, garbage out — but amplified by the power of gradient descent. Format: instruction-following datasets use the chat message format (system / user / assistant).

Minimum Dataset Sizes

  • 500 examples — for format/style changes
  • 1,000+ examples — for new capability
  • 5,000+ examples — for complex domain tasks

Data Quality Checklist

  • Diverse inputs (edge cases, different phrasings)
  • Consistent output quality (human-reviewed)
  • No duplicate or near-duplicate examples
  • Balanced classes (for classification)
Synthetic Data Generation

Use GPT-4 to generate training examples, then human spot-check 10–20%. This is the fastest way to build a high-quality dataset. Filter aggressively — 500 excellent examples beat 5,000 mediocre ones.

Code — Data Preparation for Fine-Tuning (JSONL Chat Format)
import json from pathlib import Path # Chat format for instruction fine-tuning (OpenAI / LLaMA format) def create_training_example(instruction: str, input_text: str, output: str) -> dict: messages = [ {"role": "system", "content": "You are a helpful assistant specialised in contract analysis."}, {"role": "user", "content": f"{instruction}\n\n{input_text}" if input_text else instruction}, {"role": "assistant", "content": output} ] return {"messages": messages} # Example dataset creation examples = [ create_training_example( instruction="Extract the termination clause from this contract:", input_text="...contract text...", output="Termination clause (Section 12): Either party may terminate with 30 days written notice..." ), # Add 499+ more examples ] # Validate format for ex in examples: assert len(ex["messages"]) == 3 assert ex["messages"][-1]["role"] == "assistant" assert len(ex["messages"][-1]["content"]) > 0, "Empty response!" # Save as JSONL (one JSON per line) with open("train.jsonl", "w") as f: for ex in examples: f.write(json.dumps(ex) + "\n") print(f"Training examples: {len(examples)}") print(f"Avg response length: {sum(len(e['messages'][-1]['content']) for e in examples) / len(examples):.0f} chars")

SFT objective: minimise cross-entropy loss on the assistant turns only. The user/system turns are masked — the model doesn't compute loss on the prompt tokens, only on the response it should have generated.

SFT Loss Function L = -(1/Nᵣ) ∑ log P(aₜ | s, u, a₁,...,aₜ₋₁) Only computed over response (assistant) tokens a₁,...,aₙ. System prompt s and user message u are MASKED.

Key hyperparameters:

Parameter Typical Range Notes
Epochs 1–3 More = overfitting, memorisation
Learning Rate 1e-5 to 2e-4 10–100× lower than pre-training
Batch Size 8–64 Use gradient accumulation for limited VRAM
Warmup 3–10% of steps Prevents early instability
Max Seq Length 2048–4096 Match model's typical context
Catastrophic Forgetting

Fine-tuning on a narrow task → model forgets general capabilities. Mitigation: use LoRA (only updates a small fraction of weights), or mix in general instruction-following data alongside your task data.

QLoRA (Dettmers et al., 2023): LoRA applied on a 4-bit quantised base model. This makes it possible to fine-tune a 70B model on a single 48GB GPU — impossible without quantisation.

QLoRA Hardware Requirements — which GPU can fine-tune which model size
Model Size RTX 3090 (24GB) RTX 4090 (24GB) A100 (40GB) A100 (80GB) 7B (~6GB) 13B (~10GB) 34B (~22GB) 70B (~40GB) VRAM shown is for QLoRA (4-bit) fine-tuning with batch size 1–2
Code — QLoRA Fine-Tuning with Unsloth (2× faster)
# Unsloth: 2x faster fine-tuning with memory-efficient kernels from unsloth import FastLanguageModel from trl import SFTTrainer from transformers import TrainingArguments # Load model in 4-bit quantisation model, tokenizer = FastLanguageModel.from_pretrained( model_name="meta-llama/Meta-Llama-3-8B-Instruct", max_seq_length=2048, load_in_4bit=True, # QLoRA: 4-bit quantised base dtype=None ) # Apply LoRA adapters model = FastLanguageModel.get_peft_model( model, r=16, # LoRA rank lora_alpha=16, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.0, bias="none", use_gradient_checkpointing="unsloth" # saves 30% VRAM ) trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=train_dataset, dataset_text_field="text", max_seq_length=2048, args=TrainingArguments( per_device_train_batch_size=2, gradient_accumulation_steps=4, # effective batch = 8 num_train_epochs=2, learning_rate=2e-4, fp16=True, logging_steps=10, output_dir="./output", warmup_ratio=0.05, lr_scheduler_type="cosine" ) ) trainer.train() model.save_pretrained("./my-llama3-ft") # saves only LoRA weights (~50MB)

Rafailov et al. (2023): "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." RLHF requires three components: SFT model + reward model + PPO training — complex, unstable, expensive.

DPO insight: the optimal RLHF policy has a closed-form solution — no separate reward model needed. DPO uses preference pairs: (prompt, chosen_response, rejected_response). Train to increase likelihood of chosen relative to rejected. Simpler, more stable, often achieves similar or better results.

DPO Loss Function L = -E[log σ(β log(πₖ(yₑ|x)/πᵣ(yₑ|x)) - β log(πₖ(yₗ|x)/πᵣ(yₗ|x)))] yₑ = chosen response, yₗ = rejected response, πₖ = fine-tuned model, πᵣ = reference SFT model, β = temperature
RLHF
DPO
  • → Requires reward model training
  • → PPO: complex RL algorithm
  • → 3 separate models to maintain
  • → Compute intensive
  • → Hyperparameter-sensitive
  • → Gold standard for alignment
  • → No separate reward model
  • → Direct gradient on preference pairs
  • → Only 2 models (policy + reference)
  • → Simpler, more stable
  • → Fewer hyperparameters
  • → Increasingly preferred (2023–2025)
1️⃣

Define Task & Collect Data

500–2,000 examples in JSONL chat format. Ensure quality over quantity.

2️⃣

Data Validation

Check format, deduplicate, quality filter. Remove examples with empty or low-quality responses.

3️⃣

Choose Base Model

LLaMA 3, Mistral, Qwen — pick based on size, language, licence, and your hardware.

4️⃣

SFT with QLoRA

Unsloth or HuggingFace TRL. r=16, lr=2e-4, 1–3 epochs. Monitor loss convergence.

5️⃣

Evaluate on Hold-out

Task-specific metrics + human evaluation. Check for catastrophic forgetting on general tasks.

6️⃣

Optionally: DPO

Preference tuning on failure cases. Collect chosen/rejected pairs from model outputs.

7️⃣

Merge & Quantise

Merge LoRA adapters into full model. Quantise to GGUF (4-bit) for efficient inference.

8️⃣

Deploy

Ollama (local), vLLM (server), or cloud API. Monitor quality in production, collect feedback.

🎓 Domain 5 Complete — NLP & Large Language Models

  • Ch 5.1: NLP = four ambiguity layers: lexical, syntactic, semantic, pragmatic. Classical preprocessing (stopwords, stemming) is NOT used with neural models.
  • Ch 5.2: BPE tokenisation: iteratively merge most frequent pairs. GPT-4 uses 100K-vocab BPE (~¾ word per token).
  • Ch 5.3: Word2Vec: context predicts embedding. "king − man + woman ≈ queen" — geometry encodes meaning.
  • Ch 5.4: Contextual embeddings: same word, different vector per context. Pre-train then fine-tune = the modern NLP paradigm.
  • Ch 5.5: GPT = decoder-only, causal attention, autoregressive. Scaling laws: loss ∝ N. Chinchilla: equal budget for params and data.
  • Ch 5.6: BERT = encoder-only, bidirectional attention, MLM pre-training. Use for understanding; GPT for generation.
  • Ch 5.7: Few-shot ICL: examples in prompt adapt behaviour. Chain-of-thought: "think step by step" dramatically improves reasoning.
  • Ch 5.8: RAG: Chunk → Embed → Vector DB → Retrieve → Generate. Solves knowledge cutoff, hallucination on specifics, and private data.
  • Ch 5.9: Hallucination = LLMs generate confident falsehoods — trained for fluency not facts. HHH: Helpful, Harmless, Honest.
  • Ch 5.10: Fine-tune when prompt+RAG isn't enough. QLoRA fine-tunes 70B on a single GPU. DPO replaces RLHF's complexity.
🚀 Go Deeper — Fine-Tuning LLMs

Most applications today avoid fine-tuning and instead use prompting or RAG — faster, cheaper, and no training infrastructure needed.

Fine-tuning becomes important when:

  • Strict behaviour control is needed — consistent output format, tone, or safety guardrails
  • Domain-specific patterns must be learned — legal contracts, medical notes, proprietary code styles

→ Covered in depth: Fine-Tuning LLMs (Advanced)

Domain 5 is where theory meets the frontier. The GPT family and BERT established the modern NLP paradigm that all of AI now follows. Prompt engineering, RAG, and fine-tuning are the three tools every AI practitioner uses daily. Domain 8 (Agentic AI) will show how LLMs with tools become autonomous agents. Domain 9 (AI Ethics) will address the alignment and hallucination challenges at scale.