NLP Fundamentals
How computers learn to read — text processing, linguistic structure, and the pipeline from raw text to model input
Natural language is the most information-dense medium humans have ever created. Every sentence carries meaning at multiple levels simultaneously — lexical, syntactic, semantic, pragmatic. Teaching a machine to process all of these layers reliably is the central challenge of NLP.
What Is NLP? Core
Natural Language Processing (NLP) is the branch of AI that enables computers to understand, process, and generate human language. This sounds straightforward until you confront what language actually is: an ambiguous, context-dependent, culturally-loaded communication system that humans have spent millions of years evolving. No sentence carries a single, unambiguous meaning independent of context. A machine must learn to handle every level of this ambiguity simultaneously.
Language ambiguity operates at four distinct levels, each building on the one below. Lexical ambiguity arises when a single word has multiple meanings — "bank" can mean a financial institution or the edge of a river. Syntactic ambiguity occurs when sentence structure is unclear — "I saw the man with the telescope" leaves open whether the speaker or the man possesses the telescope. Semantic ambiguity involves phrases whose meaning is unclear even with structure resolved — "Can you pass the salt?" is literally a question about capability but functions as a request. Pragmatic ambiguity is the deepest: "It's cold in here" is an observation that functions as a request to close the window, but only if you already know the conversational norms.
The history of NLP traces a clear arc: Rule-based systems (1950s–1980s) used hand-crafted grammars and dictionaries — brittle and language-specific. Statistical NLP (1990s–2000s) replaced rules with probabilities learned from corpora. Neural NLP (2013–2017) used word embeddings (Word2Vec, GloVe) and RNNs to learn representations directly from data. Transformer-era (2018–present) introduced BERT, GPT, and their successors — models that learn language representations of staggering generality from massive corpora, making almost all previous approaches obsolete.
NLP Understanding Tasks
- Text classification
- Named entity recognition
- Sentiment analysis
- Question answering
- Natural language inference
- Coreference resolution
NLP Generation Tasks
- Machine translation
- Text summarisation
- Dialogue / chatbots
- Text completion
- Code generation
- Data-to-text narration
Text Preprocessing In-depth
Before any model can process text, it must be transformed from raw characters into a form the model understands. Classical NLP pipelines involve a series of hand-engineered preprocessing steps, each reducing noise and normalising vocabulary. We trace each step using: "The Quick Brown Foxes are JUMPING over lazy dogs! They've been running."
Step 1 — Lowercasing. Convert all characters to lowercase. "The" and "the" are the same word — keeping both wastes vocabulary slots. This alone can reduce vocabulary size by 10–30% for English text.
Step 2 — Punctuation & special character removal. Strip characters that carry no lexical meaning for bag-of-words models. Important caveat: not always appropriate — punctuation carries meaning in some contexts (U.S.A, 3.14, emoticons, code). Remove selectively based on the task.
Step 3 — Tokenisation. Split text into meaningful units (tokens). The naïve approach is whitespace splitting. Better approaches handle contractions ("they've" → ["they", "'ve"]) and punctuation. Chapter 5.2 covers subword tokenisation for neural models in depth.
Step 4 — Stopword removal. Remove high-frequency words ("the", "is", "a") that carry little semantic weight in bag-of-words models. Critical warning: never remove stopwords for neural models or sequence tasks — position and function words are often critical to meaning ("not" changes everything).
Step 5 — Stemming. Reduce words to root form by stripping suffixes using heuristic rules. Porter Stemmer: "jumping" → "jump", "foxes" → "fox". Fast but imprecise — "university" → "univers". Two words with the same stem may not share meaning.
Step 6 — Lemmatisation. Morphologically reduce words to their dictionary form (lemma) using linguistic knowledge. "better" → "good", "ran" → "run", "foxes" → "fox". More accurate than stemming but requires WordNet. "Saw" → "see" (verb) or "saw" (noun) depending on POS tag.
Step 7 — Text normalisation. Expand contractions ("they've" → "they have"), normalise Unicode, standardise numbers, handle abbreviations and acronyms.
Neural models and LLMs do NOT use most of these preprocessing steps. They process raw subword tokens (Chapter 5.2) directly from near-original text. These classical steps are for bag-of-words models, TF-IDF search engines, and traditional ML feature engineering. If you are building anything with BERT, GPT, or similar — skip everything except basic Unicode normalisation.
Linguistic Structure Core
Even if you never build a classical NLP pipeline, understanding key linguistic concepts will help you reason about what language models are learning, diagnose failure modes, and work effectively with hybrid systems.
Part-of-Speech (POS) Tagging. Label each word with its grammatical role: Noun (NN), Verb (VB), Adjective (JJ), Adverb (RB), Determiner (DT), Preposition (IN). "The cat sat on the mat" → [DT, NN, VBD, IN, DT, NN]. POS tags disambiguate words with multiple roles — "run" is a verb in "I run" but a noun in "a home run". Used in: NER, information extraction, classical feature engineering.
Named Entity Recognition (NER). Identify and classify spans of text as named entities: PERSON, ORG, GPE (geopolitical entity), DATE, MONEY. "Apple's Tim Cook announced on Monday that sales exceeded $100B" → [Apple: ORG], [Tim Cook: PERSON], [Monday: DATE], [$100B: MONEY]. Modern BERT-based NER achieves near-human F1 scores on standard benchmarks.
Dependency Parsing. Identify grammatical relationships between words — subject, object, modifier, etc. "The cat chased the mouse" → "cat" is the nominal subject (nsubj) of "chased"; "mouse" is the direct object (dobj). Dependency parses are directed graphs enabling extraction of who did what to whom.
Coreference Resolution. Determine which mentions refer to the same entity. "When Mary arrived, she said she was tired" — both "she" instances refer to Mary. Crucial for coherent understanding across sentences. SpanBERT achieves state-of-the-art by jointly scoring mention pairs.
Traditional NLP Methods Reference
Before neural networks dominated NLP, three core representation methods powered almost every text application. They remain useful for lightweight tasks, interpretable systems, and benchmarking. Understanding them explains why neural embeddings were such a dramatic improvement.
Bag of Words (BoW) represents a document as a vector of word counts, completely discarding word order. "The cat sat" and "sat cat the" produce identical BoW vectors. Works well for document classification and spam filtering because topic is often determined by which words appear, not their order. Critical weakness: "not good" and "good" look nearly identical.
TF-IDF improves on raw counts by weighting words by how rare they are across the corpus. A word that appears frequently in one document but rarely elsewhere is a likely topic word. "The" appears in every document — TF-IDF assigns it near-zero weight. "Convolutional" in an AI paper is rare and gets high weight. Still discards order, but much more informative than raw counts.
n-grams partially recover word order by treating sequences of n adjacent words as features. Bigrams of "the cat sat": ["the cat", "cat sat"]. Captures local context but explodes vocabulary size exponentially with n. Word2Vec and subsequent neural embeddings made n-gram language models obsolete for most applications.
| Method | Captures Order | Vector Size | Sparse? | Semantic Meaning | Best For |
|---|---|---|---|---|---|
| Bag of Words | No | V (vocab size) | Yes — mostly 0s | No | Doc classification, spam |
| TF-IDF | No | V (vocab size) | Yes | No | Search, document similarity |
| n-grams | Local only | Vn (explodes!) | Very sparse | No | Language models (pre-neural) |
| Word2Vec | No (fixed window) | d (e.g. 300) | No — dense | Yes | Semantic similarity, analogies |
NLP Task Taxonomy Core
NLP encompasses a wide range of tasks grouped by the type of output they produce. Understanding this taxonomy helps you choose the right architecture (encoder-only, decoder-only, encoder-decoder), the right loss function, and the right evaluation metric for any given problem.
| Task | Input | Output | Key Metric | Modern Model |
|---|---|---|---|---|
| Sentiment Analysis | Review text | Positive / Negative / Neutral | Accuracy, F1 | BERT, RoBERTa |
| Named Entity Recognition | Sentence | Token-level labels (B-I-O) | F1 per entity type | BERT-CRF, SpanBERT |
| Machine Translation | Source language text | Target language text | BLEU score | T5, NLLB-200, GPT-4 |
| Summarisation | Long document | Short summary | ROUGE score | BART, Pegasus, GPT-4 |
| Question Answering | Context + question | Answer span / free text | Exact Match, F1 | GPT-4, Claude, Llama |
∑ Chapter 5.1 Summary — NLP Fundamentals & Text Preprocessing
- Language has four layers of ambiguity: lexical, syntactic, semantic, pragmatic — all must be handled, each building on the one below
- NLP history: rule-based (1960s) → statistical (1990s) → neural embeddings (2013) → Transformer LLMs dominant from 2018
- Classical preprocessing: lowercase → tokenise → remove stopwords → lemmatise — NOT used with neural models (BERT, GPT)
- Stemming is fast but imprecise (heuristic suffix rules); lemmatisation is accurate but requires WordNet + POS context
- BoW and TF-IDF: sparse, order-independent representations — still useful for search, lightweight classification, interpretable systems
- Key linguistic annotations: POS tagging, NER, dependency parsing, coreference resolution — used in classical and hybrid NLP pipelines
- NLP splits into: understanding tasks (classification, extraction) and generation tasks — different architectures, loss functions, and metrics
Tokenisation is the invisible foundation of every language model. Before a single parameter is trained, the tokeniser decides how text will be represented as integers — and that decision shapes what patterns the model can learn, how efficiently it processes different languages, and how much it costs to run at inference time.
Why Tokenise? Core
Neural language models operate on numbers, not text. Every word, character, or subword must be mapped to an integer ID from a fixed vocabulary before it can be fed into the model. Tokenisation is this mapping — it converts a raw string into a sequence of integers, each representing a "token" from the vocabulary. The choice of what constitutes a token has profound consequences for the model's capabilities.
The vocabulary dilemma has three corners. Word-level tokenisation uses whole words as tokens — intuitive, but English has 170,000+ words and with proper nouns, compounds, and morphological variants, the vocabulary explodes into millions. Words not seen during training become [UNK] (unknown) — the out-of-vocabulary problem. Character-level tokenisation uses individual characters — tiny vocabulary of ~128 ASCII characters, but sequences become very long. "Hello world" is 11 characters; a document of 1,000 words becomes ~6,000 characters. Attention's O(n²) complexity makes this expensive. Subword tokenisation is the sweet spot: split common words into whole tokens, rare or unknown words into subword pieces. "unhappiness" → ["un", "##happy", "##ness"] — known pieces, no OOV, reasonable sequence length.
Modern LLMs universally use subword tokenisation with vocabularies of 32,000–100,000 tokens. GPT-4 uses 100,277 tokens; LLaMA-3 uses 128,256. The vocabulary is fixed at training time and cannot be changed without retraining the model — making the tokeniser one of the most consequential design decisions in LLM development.
Word & Character Tokenisers Core
Understanding why pure word and character tokenisers were abandoned helps clarify the design goals of modern subword tokenisers. Both extremes have fundamental problems that subword methods resolve.
Word tokenisation splits on whitespace and punctuation. Problems pile up quickly. Contractions: "don't" — is that one token or two ("do", "n't")? Hyphenated compounds: "state-of-the-art" — one or four? Morphological variants: "run", "running", "ran", "runs" require four separate vocabulary entries, even though they share meaning. Proper nouns, technical terms, and misspellings not seen during training become [UNK] — the model sees a blank where information should be. The English vocabulary alone exceeds 170,000 words; with all languages and domains, a truly universal word vocabulary would require millions of entries.
Character tokenisation has no OOV problem — the alphabet is fixed. But it fragments language into meaningless units from the model's perspective. The word "hello" becomes 5 separate tokens [h][e][l][l][o]. The model must learn from scratch that these 5 tokens together form a word unit — it cannot start with the useful prior that words are meaningful. More critically, sequence length explodes. A 1,000-word essay becomes ~6,000 character tokens. Transformer attention is O(n²) in sequence length — doubling the sequence length quadruples the compute cost. In practice, character-level models were impractical at scale.
Byte-level tokenisation tokenises raw UTF-8 bytes (0–255) rather than characters. Every document is representable — there is no OOV at the byte level. GPT-2 used byte-level BPE: start from 256 byte tokens, then apply BPE merges. This handles multilingual text naturally and is fully language-agnostic. GPT-4's tiktoken also uses byte-level BPE.
Byte-Pair Encoding (BPE) In-depth
Byte-Pair Encoding was introduced for NLP by Sennrich et al. (2016) as a data compression algorithm adapted for subword vocabulary construction. It is used by GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Falcon, and most modern decoder-based language models. The key insight is elegant: let the training data decide what the vocabulary tokens should be, by iteratively merging the most frequently co-occurring pairs.
The BPE algorithm is a simple loop. Initialise with a character vocabulary (or byte vocabulary for byte-level BPE). Count all adjacent token pairs across the entire training corpus. Merge the most frequent pair into a new single token and add it to the vocabulary. Repeat until the target vocabulary size is reached. The result: common words like "the", "and", "is" become single tokens; common morphological patterns like "-ing", "-tion", "un-" become tokens; rare words are split into recognisable subword pieces.
WordPiece Core
WordPiece is the tokenisation algorithm used by BERT, DistilBERT, ALBERT, and multilingual BERT (mBERT). It is mechanically similar to BPE but uses a different merge criterion: rather than merging the most frequent pair, it merges the pair that most increases the likelihood of the corpus under a language model. In practice, the merge score is: score(A, B) = freq(A+B) / (freq(A) × freq(B)). This prefers pairs where the joint occurrence is disproportionately high relative to how often each appears alone — capturing meaningful linguistic units rather than just common bigrams.
WordPiece uses a distinctive notation: a ## prefix marks a continuation subword — a piece that is attached to the preceding token rather than starting a new word. "playing" → ["play", "##ing"]. The "play" token has no prefix (it starts a word); "##ing" is always a suffix, never a standalone word. This makes the tokenisation reversible and unambiguous: joining tokens without spaces and stripping ## gives back the original word.
The standard BERT vocabulary contains 30,522 tokens. Unknown characters that cannot be represented by any combination of vocabulary tokens are mapped to [UNK] — rare for English but can occur with unusual Unicode characters. The vocabulary also includes special tokens: [CLS] (classification, prepended to all inputs), [SEP] (separator, marks sentence boundaries), [MASK] (for masked language modelling), and [PAD] (padding to fixed length).
SentencePiece Core
SentencePiece (Kudo & Richardson, 2018) is a language-agnostic tokenisation framework used by T5, LLaMA-1/2/3, Mistral, Gemma, XLNet, and others. Its key architectural difference from BPE and WordPiece is that it operates on raw text including whitespace, without any pre-tokenisation step. BPE and WordPiece typically split on whitespace first (giving the language-specific assumption that spaces separate words), then apply subword segmentation within each word. SentencePiece treats spaces as regular characters — the text "Hello world" is tokenised as a single stream of characters including the space character.
To make tokenisation reversible, SentencePiece uses a special ▁ (U+2581, lower one-eighth block) character to mark word boundaries. The space before a word is encoded as ▁: "Hello world" → ["▁Hello", "▁world"]. Decoding is trivial: replace ▁ with a space, concatenate. This approach means the same tokeniser works identically for languages with no spaces (Japanese, Chinese) and languages with regular spacing (English, French) — making it the preferred choice for multilingual models.
SentencePiece supports two underlying algorithms. BPE mode is the same bottom-up merge algorithm as before. Unigram Language Model mode takes the opposite approach: start with a large vocabulary (e.g. all substrings up to length 16), then iteratively remove tokens whose removal least decreases corpus likelihood, until the target size is reached. Unigram produces multiple possible segmentations of a word and assigns probabilities to them — during training, samples are drawn from the distribution, providing a natural form of tokenisation regularisation.
| Tokeniser | Algorithm | Vocab Marker | Vocab Size | Used By | Language Agnostic |
|---|---|---|---|---|---|
| BPE (byte-level) | Bottom-up merge (byte pairs) | None | 50K / 100K | GPT-2, GPT-3, GPT-4, LLaMA | Yes (bytes) |
| WordPiece | Likelihood-based merge | ## (continuation) | 30K | BERT, DistilBERT, ALBERT | Partial |
| SentencePiece BPE | Bottom-up, raw text | ▁ (word start) | 32K–128K | LLaMA-2/3, T5, Gemma | Yes |
| SentencePiece Unigram | Top-down pruning | ▁ (word start) | 32K | mBERT, XLNet, T5 | Yes |
| tiktoken | BPE on bytes | None | 100K (cl100k) | GPT-4, GPT-4o, Codex | Yes |
tiktoken & Practical Tokenisation In-depth
OpenAI's tiktoken is a fast BPE tokeniser library used by all GPT models. It supports three encodings: r50k_base (GPT-2/3, 50,257 tokens), p50k_base (Codex), and cl100k_base (GPT-4/GPT-4o, 100,277 tokens). The cl100k vocabulary was specifically designed to handle code and multilingual text more efficiently — common programming patterns like function definitions and import statements are often single tokens.
The rough practical rule is ¾ of a word per token for English text: 1,000 tokens ≈ 750 English words ≈ 4–5 average paragraphs. This ratio degrades significantly for non-English text. Chinese and Japanese characters are typically 1–4 tokens per character (since each character is a complex glyph encoded as multiple UTF-8 bytes). Arabic script runs 2–3 tokens per word. Code is token-efficient for English keywords but indentation and special characters add tokens. Understanding these ratios is essential for prompt engineering and cost estimation at scale.
import tiktoken enc = tiktoken.get_encoding("cl100k_base") # GPT-4 tokeniser def count_tokens(text: str) -> int: return len(enc.encode(text)) # Token quirks every practitioner should know tests = [ ("hello world", "lowercase"), ("Hello World", "capitalised — same count but different IDs"), (" hello world", "leading space can differ"), ("1234567890", "numbers split by digit groups"), ("你好世界", "Chinese: ~2-4 tokens PER CHARACTER"), (" def foo():", "indented code — spaces are tokens!"), ] for text, note in tests: toks = enc.encode(text) print(f"{text!r:30s} → {len(toks):3d} tokens # {note}") # Cost estimation (approximate, prices change frequently) def estimate_cost(input_tokens: int, output_tokens: int, model="gpt-4o"): rates = {"gpt-4o": (0.005, 0.015)} # ($/1K input, $/1K output) r_in, r_out = rates.get(model, (0.01, 0.03)) return input_tokens / 1000 * r_in + output_tokens / 1000 * r_out print(f"1K in + 500 out: ${estimate_cost(1000, 500):.4f}") # ~$0.0125Token Counting & Context Windows In-depth
The context window is the maximum number of tokens a model can process in a single forward pass — both input prompt and generated output count against this limit. Exceeding the context window truncates input, silently losing information. Context window sizes have grown dramatically: from GPT-3's 2,048 tokens (2020) to GPT-4o's 128,000 (2024), Claude 3.5's 200,000, and Gemini 1.5 Pro's 1,000,000. However, larger context windows don't mean models use all context equally well — empirically, LLMs attend more strongly to the beginning and end of long contexts ("lost in the middle" effect).
Token counting matters for three practical reasons. Cost: commercial LLM APIs price per token — $5–$30 per million tokens for frontier models, multiplied by millions of API calls adds up. Context management: overflow silently truncates your prompt — a bug that is easy to miss and hard to debug. Latency: generation cost is proportional to output tokens; every unnecessary token in the response costs inference time and money. Practitioners routinely count tokens in prompt templates, conversation histories, and retrieved documents before sending API requests.
1. Leading spaces change token IDs. " hello" and "hello" produce different token IDs in tiktoken — relevant for prompt formatting. 2. Numbers split unexpectedly. "GPT-4" → ["G", "PT", "-", "4"]; phone numbers, dates, and prices consume far more tokens than you'd expect. 3. Non-English is expensive. A Chinese prompt of 100 characters may cost 200–400 tokens — 2–4× more than the equivalent English text. 4. Markdown inflates count. Headers, bold markers, code fences, and bullet points all consume tokens. Strip unnecessary formatting from retrieved context before sending. 5. Chat format overhead. OpenAI's chat completions API adds ~4 tokens per message for role/structure overhead — relevant for high-frequency fine-grained API calls.
∑ Chapter 5.2 Summary — Tokenisation
- Tokenisation maps raw text to integer IDs from a fixed vocabulary (32K–100K tokens) before it can enter a neural model
- Subword is the sweet spot: no OOV, balanced sequence length, handles morphology — all modern LLMs use it
- BPE: iteratively merge the most frequent adjacent token pair — used by GPT-2, GPT-3, GPT-4, LLaMA; byte-level BPE is fully language-agnostic
- WordPiece: ## marks continuations; likelihood-based merges; vocabulary = 30K — used by BERT, DistilBERT, mBERT
- SentencePiece: operates on raw text including spaces; ▁ marks word starts; supports BPE and Unigram LM — used by T5, LLaMA, Gemma
- Practical rule: ~¾ word per token for English; non-English and numbers use 2–4× more tokens — critical for cost and context management
Word embeddings did not just improve NLP performance — they changed how we think about language. When Mikolov et al. showed in 2013 that "king − man + woman ≈ queen" held in a 300-dimensional vector space, it suggested that semantic relationships could be captured as geometric transformations. This was the first evidence that neural representations were not just feature maps — they were encoding structured knowledge about the world.
The Distributional Hypothesis Core
The theoretical foundation of all word embedding methods is the Distributional Hypothesis, stated by linguist J.R. Firth in 1957: "You shall know a word by the company it keeps." The idea is deceptively simple: words that appear in similar linguistic contexts tend to have similar meanings. "Dog" and "cat" both appear near words like "pet", "feed", "vet", "fur", "owner", "breed" — and this co-occurrence pattern reflects their shared semantic category. "Dog" and "quantum" do not share contexts, and they do not share meaning.
This hypothesis transforms the problem of meaning into a problem of statistics. Instead of defining what "happy" means philosophically, we can simply observe that "happy" appears with "smile", "joy", "content", "pleased", "glad" — and "sad" appears with "cry", "grief", "unhappy", "depressed" — and that these two distributional profiles are measurably different. The distributional hypothesis gives us a way to measure semantic similarity without any human annotation: compute the similarity between two words' context distributions.
Every word embedding method — Word2Vec, GloVe, FastText, and even the contextual embeddings of BERT — is an implementation of this hypothesis. They differ in how they model context (local window vs global matrix, character-level vs word-level, static vs contextual), but they all share the core insight: context distribution = meaning.
The Key Insight
You do not need to define what words mean. Observe where they appear, and the geometry of the embedding space will capture the rest. No hand-crafted ontologies, no linguistic rules — just patterns in text.
Context Window
Most embedding methods use a fixed window of ±k surrounding words as "context". Window size k=5 means the 5 words before and after each target word. Larger k → more topical similarity. Smaller k → more syntactic similarity.
Word2Vec In-depth
Mikolov et al. (Google Brain, 2013) introduced Word2Vec — a family of shallow neural networks that learn word representations by predicting word context. The key insight was framing representation learning as a self-supervised prediction task: given a word, predict its surrounding words. No labels are needed — the text itself provides the training signal. Train on enough text (Google News, 100 billion words) and the resulting vectors encode semantic structure as geometry.
The architecture is deliberately simple: a single-layer neural network with no non-linearity in the hidden layer. The input is a one-hot vector of vocabulary size V. The single hidden layer projects this to a dense vector of dimension d (typically 300). The output layer projects back to V dimensions and applies softmax to produce a probability distribution over the vocabulary. The weight matrix of the hidden layer — shape V × d — is the embedding matrix. After training, each row is the embedding vector for one word.
Word2Vec uses two architectural variants: Skip-gram and CBOW (Continuous Bag of Words). Skip-gram predicts context words from a centre word and works better on small datasets and rare words. CBOW predicts the centre word from its context and trains faster on large corpora. Both are trained with a practical approximation — negative sampling — rather than full softmax over the entire vocabulary (computing softmax over 50,000+ words every step is prohibitively expensive).
With negative sampling, the objective becomes: for each training pair (centre, context), maximise the probability of the true pair while minimising the probability of k randomly sampled negative pairs. This reduces the per-step computation from O(V) to O(k), where k is typically 5–20. The result is a practical algorithm that can be trained on billions of words in hours on a single machine.
Skip-gram & CBOW Architectures In-depth
Given the sentence "The quick brown fox jumps over the lazy dog" with window size 2: Skip-gram takes the centre word "brown" and tries to predict each context word — ("brown", "quick"), ("brown", "The"), ("brown", "fox"), ("brown", "jumps"). One training pair for each context word in the window. CBOW takes all context words ["The", "quick", "fox", "jumps"] and averages their embeddings, then tries to predict the centre word "brown". CBOW is faster (averages context, one prediction per window); Skip-gram trains on more pairs and handles rare words better.
GloVe — Global Vectors Core
Pennington, Socher, and Manning (Stanford NLP, 2014) pointed out a conceptual limitation of Word2Vec: it trains on individual context windows, one at a time, effectively ignoring the global statistics of word co-occurrence across the entire corpus. If "ice" and "steam" both co-occur with "water" but "ice" co-occurs with "solid" and "steam" with "gas", this distinction should be captured — but Word2Vec only sees local windows, not the global ratio structure.
GloVe directly factorises the global word–word co-occurrence matrix X, where Xᵢⱼ is the count of how many times word j appears in the context of word i across the entire corpus. The objective is to find word vectors such that their dot product approximates the log of the co-occurrence count. The weighting function f(Xᵢⱼ) ensures that very frequent pairs (like "the–the") don't dominate the loss — pairs with Xᵢⱼ above a threshold are capped.
In practice, GloVe and Word2Vec produce embeddings of similar quality. GloVe often edges ahead on analogy tasks (because it explicitly models co-occurrence ratios); Word2Vec can be more efficient to train on very large corpora with negative sampling. Both have been largely superseded for downstream tasks by contextual embeddings (BERT, GPT), but GloVe remains popular as a lightweight baseline and for interpretability research.
FastText — Subword Embeddings Core
Bojanowski et al. (Facebook AI Research, 2016) identified a critical gap in both Word2Vec and GloVe: they treat words as atomic units. "Run", "running", "runner", "runs" each get their own independent vector — the morphological relationship between them is invisible to the model. For languages with rich morphology (Finnish, Turkish, German, Arabic), this is devastating: a model may never see the exact form "Freundschaftsbezeigungen" (German for "demonstrations of friendship") but it shares meaningful subwords with common words.
FastText represents each word as a bag of character n-grams. For "where" with n=3: the word is decomposed as ["<wh", "whe", "her", "ere", "re>", "<where>"] (with boundary markers < and >). The final word vector is the sum of all its n-gram vectors. Each n-gram has its own embedding — these are what get trained. The boundary markers ensure "whe" in "where" and "whe" in "elsewhere" contribute differently because they occur in different boundary contexts.
The payoff: FastText can produce meaningful vectors for words never seen during training, including misspellings, technical jargon, and morphological variants. "Antidisestablishmentarianism" will share n-grams with "establish", "establishment", "disestablish", "ism", etc., and their combined embedding will be semantically meaningful. FastText is still used in production where domain vocabulary is highly variable — scientific text, social media, multilingual pipelines.
Vector Arithmetic & Analogies In-depth
The most celebrated discovery in word embeddings is that semantic relationships are encoded as consistent directions in vector space. The vector from "man" to "woman" is approximately the same as the vector from "king" to "queen", from "uncle" to "aunt", from "actor" to "actress". This "gender direction" is a consistent geometric transformation across the embedding space. Similarly, there is a "capital city direction" (France→Paris ≈ Germany→Berlin ≈ Japan→Tokyo), a "superlative direction" (big→biggest ≈ small→smallest), and a "past tense direction" (run→ran ≈ walk→walked).
This property emerged from training — it was not engineered in. It suggests that the distributional statistics of language contain enough signal to implicitly encode the relational structure of the world. The analogy task became a standard benchmark: given A:B::C:?, find D such that B−A+C ≈ D in vector space. Word2Vec achieves ~65% accuracy on the Google Analogy Dataset (20,000 analogies across semantic and syntactic categories) — far above what was thought possible with shallow models.
Embedding Properties & Limitations Core
Word embeddings inherit — and amplify — the biases present in their training data. Bolukbasi et al. (2016) demonstrated that in Word2Vec trained on Google News: "doctor − nurse ≈ man − woman", "programmer − homemaker ≈ man − woman", "brilliant − dull ≈ man − woman". These gender stereotypes are encoded as geometric structure in the embedding space. When downstream models use these embeddings, the bias propagates: a résumé classifier using Word2Vec may discriminate based on field-specific vocabulary that encodes gender. Debiasing techniques exist (projecting out the gender direction) but are only partially effective — the bias is distributed across the space, not concentrated in one direction.
The most fundamental limitation of all static word embeddings is context independence: every word has a single vector regardless of usage. The word "bank" in "She went to the bank to withdraw money" and "She sat on the river bank" produce the exact same 300-dimensional vector. The model averages the two senses of "bank" into one representation — losing the information needed to distinguish them. This polysemy problem is unresolvable within the static embedding framework, no matter how large the training corpus. It is the primary motivation for contextual embeddings: BERT, GPT, and their successors assign each word occurrence a different vector based on its surrounding context (Chapter 5.4).
Word2Vec, GloVe, and FastText give every word one vector for all contexts. "I deposited money at the bank" and "I fished at the river bank" produce the same "bank" embedding — the vector is a weighted average of all senses. For tasks requiring word-sense disambiguation, coreference resolution, or semantic role labelling, static embeddings hit a ceiling that no amount of data or dimensions can overcome. This is why BERT (2018) was a watershed: it introduced position-and-context-dependent representations, effectively making static embeddings obsolete for most NLP tasks.
| Method | Training Objective | Context | OOV | Typical Dim | Still Used? |
|---|---|---|---|---|---|
| Word2Vec | Predict context / centre | Local window | [UNK] | 100–300 | Baselines, feature eng |
| GloVe | Factorise co-occurrence matrix | Global corpus | [UNK] | 100–300 | NLP baselines |
| FastText | Subword n-gram sum | Local window | No OOV | 300 | Multilingual, rare vocab |
| BERT (contextual) | Masked language model | Full sentence | Subword | 768 | Yes — encoder tasks |
| GPT (contextual) | Causal language model | Causal window | Subword | 768–12288 | Yes — generation |
∑ Chapter 5.3 Summary — Word Embeddings
- Distributional hypothesis: words with similar contexts have similar meaning — foundation of all embedding methods (Firth, 1957)
- Word2Vec: two architectures trained by predicting word context — Skip-gram (centre → context, better for rare words) and CBOW (context → centre, faster)
- "king − man + woman ≈ queen" — semantic relationships are directions in geometry; emerged from training, not engineered
- GloVe: factorises the global co-occurrence matrix — explicitly models co-occurrence ratios across the entire corpus
- FastText: word = sum of character n-gram vectors — handles OOV, morphologically rich languages, and rare vocabulary
- Critical limitation: static embeddings give the same vector regardless of context — "bank" is identical in "river bank" and "bank account" — solved by BERT (Chapter 5.4)
Static word embeddings were a revolution — but they hit a ceiling. The same vector for "bank" in every sentence is a fundamental architectural limit, not a training data problem. The field needed representations that compute word meaning dynamically based on context. ELMo, ULMFiT, and then BERT answered that need — and in doing so, established the pre-train/fine-tune paradigm that defines modern NLP.
The Problem with Static Embeddings Core
Word2Vec, GloVe, and FastText assign each word type exactly one vector, shared across all its occurrences. This is adequate for words with a single dominant sense — "elephant" nearly always means the same thing. But English has thousands of polysemous words: "bank" (financial institution / river edge), "bat" (cricket equipment / flying mammal), "light" (not heavy / illumination / a lamp), "book" (a publication / to reserve), "well" (healthy / a water source / interjection). The Word2Vec vector for "bank" is a weighted average of all its senses — useful for neither.
The polysemy ceiling is not solvable by training on more data. No matter how large the corpus, a single vector must average all contexts. The architecture itself is the limitation: static embeddings compute representations before seeing the sentence. What is needed is a model that reads the full sentence, then assigns each word a representation based on its role in that specific sentence. This is exactly what contextual embedding models provide.
ELMo — Embeddings from Language Models Core
Peters et al. (AllenNLP, 2018) introduced ELMo — Embeddings from Language Models — the first widely adopted contextual word embedding. The architecture is a two-layer bidirectional LSTM language model pre-trained on 1 billion words (1 Billion Word Benchmark). Two passes through the sentence: a forward LM reads left to right and learns to predict the next word; a backward LM reads right to left and learns to predict the previous word. For each token, the forward and backward hidden states from all layers are concatenated, producing a context-sensitive representation.
ELMo representations are used as frozen features — the ELMo model is not fine-tuned on downstream tasks. Instead, the pre-computed ELMo vectors are concatenated to the input of existing task-specific models (NER taggers, QA systems, coreference models). This "feature-based" approach produced large, consistent improvements across NLP benchmarks — the first empirical proof that language model pre-training transfers broadly. ELMo improved the state-of-the-art on 6 NLP tasks simultaneously, which was extraordinary at the time.
ELMo's key limitation: the underlying architecture is an LSTM, which processes sequences sequentially (O(n) depth) and cannot parallelise across token positions. Training is slow and the representations are computed sequentially at inference. The Transformer architecture (Chapter 5.5), with O(1) depth and full parallelism via self-attention, replaced the LSTM backbone in every subsequent contextual embedding model.
The ULMFiT Pre-train / Fine-tune Paradigm In-depth
Howard & Ruder (2018) introduced ULMFiT (Universal Language Model Fine-Tuning) — the paper that established the three-stage paradigm now universal in NLP. Where ELMo froze the language model and used it as a feature extractor, ULMFiT's insight was that the language model itself should be fine-tuned end-to-end on downstream tasks. This shifts the mental model from "use LM features" to "adapt a pre-trained LM for each task". BERT and GPT made this paradigm dominant — but ULMFiT proved it worked first.
ULMFiT introduced two now-standard fine-tuning techniques. Discriminative fine-tuning: assign different learning rates to each layer — earlier layers (which capture general syntax and morphology) are updated very slowly; later layers (which capture task-specific semantics) are updated faster. Gradual unfreezing: start fine-tuning only the last layer, then progressively unfreeze earlier layers one at a time. This prevents catastrophic forgetting — the phenomenon where fine-tuning on a small task dataset destroys the broad language knowledge acquired during pre-training.
The three stages generalise directly to all modern pre-trained language models. Stage 1 (pre-training) is expensive but done once and shared via model hubs. Stage 2 (domain adaptation) is optional but valuable for specialised domains (biomedical, legal, code). Stage 3 (task fine-tuning) is cheap — hours on a single GPU with hundreds to thousands of labelled examples, compared to millions required to train from scratch. This cost asymmetry is the fundamental economic argument for the transformer pre-training paradigm.
Self-Supervised Pre-training Tasks In-depth
Causal Language Modelling (CLM) — used by GPT, GPT-2, GPT-3, LLaMA: predict the next token given all previous tokens. Objective: maximise P(xₜ | x₁,...,xₜ₋₁). The model never sees future tokens during training — it processes left-to-right with a causal (triangular) attention mask. This makes it naturally suited to generation: at inference, repeatedly predict the next token and append it to the sequence.
Masked Language Modelling (MLM) — used by BERT, RoBERTa, DeBERTa: randomly mask 15% of input tokens (replacing them with [MASK]), then predict the original token using the full surrounding context. Objective: maximise P(masked | all other tokens). Because both left and right context is available simultaneously, the model builds bidirectional representations — excellent for understanding tasks (classification, NER, QA) but not for generation.
Next Sentence Prediction (NSP) was used in original BERT alongside MLM: given two text segments, predict whether they appear consecutively in the source document. Later analysis (RoBERTa, 2019) showed NSP adds little benefit and can hurt performance by forcing artificially short segments — it was removed in subsequent models. Replaced Token Detection (RTD) — used by ELECTRA: a small generator network creates plausible but fake token replacements; the main discriminator must identify which tokens were replaced. Every token gets a training signal (not just 15% as in MLM), making ELECTRA 4× more efficient for the same computational budget.
Sentence Embeddings In-depth
Word-level contextual embeddings (one vector per token) are essential for token-level tasks like NER and POS tagging — but many applications require a single vector for an entire sentence, paragraph, or document. How do you pool a variable-length sequence of token vectors into one fixed-size representation? Three approaches have been widely used.
[CLS] token pooling (BERT's approach): prepend a special [CLS] (classification) token to every input. The Transformer processes all tokens together with full self-attention. In theory, the [CLS] token's output representation aggregates information from the entire sequence. In practice, this works well after fine-tuning on a specific task — but out-of-the-box BERT [CLS] vectors perform poorly on semantic similarity benchmarks, because BERT was not trained to produce meaningful sentence-level representations in [CLS].
Mean pooling: average all token embeddings in the final layer. Surprisingly effective as a baseline — often outperforms [CLS] pooling on zero-shot semantic similarity without fine-tuning. Simple to implement and parameter-free. Sentence-BERT (SBERT, Reimers & Gurevych, 2019) addresses both approaches by fine-tuning BERT with a siamese / triplet network objective on sentence pairs — training the model to produce similar vectors for semantically similar sentences. SBERT dramatically outperforms naive pooling on STS benchmarks and is 20–30× faster for pair-wise similarity computation than vanilla BERT (which requires a separate forward pass for every pair).
Embedding Models & Semantic Search Core
Sentence embeddings power semantic search — retrieving documents by meaning rather than keyword overlap. A query like "What city is France's capital?" should retrieve "Paris is the seat of French government" even though it shares no keywords. The approach: embed all documents once into a vector index; at query time, embed the query and retrieve the closest document vectors by cosine similarity. This is the retrieval component of Retrieval-Augmented Generation (RAG) systems.
| Model | Provider | Dimensions | Max Tokens | Best For | MTEB Score |
|---|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 3072 | 8,191 | General purpose, RAG | ~65 |
| text-embedding-3-small | OpenAI | 1536 | 8,191 | Cost-efficient RAG | ~62 |
| text-embedding-ada-002 | OpenAI | 1536 | 8,191 | Legacy, widely used | ~61 |
| all-MiniLM-L6-v2 | SBERT / HuggingFace | 384 | 256 | Fast, lightweight, on-device | ~56 |
| e5-large-v2 | Microsoft | 1024 | 512 | Strong open-source baseline | ~63 |
| BGE-M3 | BAAI | 1024 | 8,192 | Multilingual, long documents | ~65 |
∑ Chapter 5.4 Summary — Contextual Embeddings & Pre-trained LMs
- Static embeddings fail on polysemy: same vector for "bank" in all contexts — fundamental architecture limit, not a data problem
- ELMo: first practical contextual embeddings using bidirectional LSTM LM — forward + backward hidden states concatenated per token
- ULMFiT established the pre-train / fine-tune paradigm: expensive once, cheap adaptation — with discriminative LR and gradual unfreezing
- CLM (GPT): predict next token — left-to-right, generative. MLM (BERT): mask 15%, predict — bidirectional, understanding
- RTD (ELECTRA): discriminate original vs replaced tokens — 4× more compute-efficient than MLM
- Sentence embeddings: [CLS] pooling or mean pooling → SBERT fine-tuning dramatically improves semantic similarity quality for RAG and search
From GPT-1 to GPT-4 — how decoder-only transformers and scale created the generative AI era. The GPT lineage proved a single, deceptively simple idea: train a very large decoder-only Transformer on very large data with a next-token-prediction objective, and intelligence-like capabilities emerge.
GPT Architecture In-Depth
GPT = Generative Pre-trained Transformer — a decoder-only Transformer. Unlike BERT (encoder-only, bidirectional), GPT uses causal (masked) self-attention: each token can attend ONLY to previous tokens. This constraint is not a limitation — it's the design that enables generation. You can't look at future tokens while generating them.
The key architectural choices that distinguish GPT from BERT:
Causal Self-Attention
Each token attends only to tokens before it. Implemented via a triangular mask that sets future positions to −∞ before softmax. This makes the model autoregressive — it can generate one token at a time, left to right.
No Encoder
GPT uses a single stack of N transformer decoder blocks. No encoder–decoder cross-attention. The entire prompt and generated text flow through the same stack. Simplicity at scale.
Autoregressive Generation
Given a prompt, predict the next token → append it → repeat. Each forward pass produces one token. Generation is sequential by nature — you can't parallelise the generation of future tokens (though prompt processing is parallel).
Why Decoder-Only Wins for Generation
Encoder-only models (BERT) see the full context bidirectionally — great for understanding, but can't generate. Decoder-only enforces the causal constraint that makes autoregressive generation coherent and consistent.
Neural Scaling Laws In-Depth
Kaplan et al. (OpenAI, 2020) discovered that language model loss decreases as a smooth power law as you increase model size (N), dataset size (D), or compute (C). This isn't a vague trend — it's a precise mathematical relationship: L(N) ∝ N−α where α ≈ 0.076. Double the parameters and loss drops predictably.
Three factors drive scaling: N (parameters), D (dataset tokens), and C (compute in FLOPs). The breakthrough insight: you must scale all three together. Scaling parameters alone while holding data fixed gives diminishing returns.
Optimal scaling allocates equal compute budget to parameters AND data. GPT-3 was undertrained: 175B params trained on only 300B tokens. Chinchilla-optimal would be ~3.5T tokens. LLaMA-2 7B trained on 2T tokens — far more tokens per parameter than GPT-3 — and performed remarkably well. Practical implication: "Train a smaller model on more data" — better for inference costs.
Chinchilla Optimal Scaling (2022) Nopt ∝ C0.5 Dopt ∝ C0.5 For every 2× increase in compute → double BOTH model size AND training tokens
GPT-1 to GPT-4 Timeline In-Depth
GPT-1 — June 2018
117M parameters, trained on BookCorpus. First generative pre-training paper. Demonstrated that unsupervised pre-training + supervised fine-tuning on 12 tasks produced strong NLU results. Proof of concept.
GPT-2 — Feb 2019
1.5B parameters, trained on WebText (40GB of Reddit-filtered web pages). First model to show zero-shot capabilities — performing tasks with no task-specific training. The "too dangerous to release" controversy put LLMs in the public consciousness.
GPT-3 — June 2020
175B parameters, trained on 300B tokens. Introduced few-shot learning from the prompt alone — no gradient updates needed. In-context learning: provide examples in the prompt, and GPT-3 generalises. This changed everything.
InstructGPT — Jan 2022
GPT-3 fine-tuned with RLHF (Reinforcement Learning from Human Feedback). Follows instructions, avoids harmful output. Much more useful than raw GPT-3. Foundation for alignment research.
ChatGPT — Nov 2022
GPT-3.5 + RLHF + chat interface. 100 million users in 60 days — fastest product adoption in history. Made LLMs accessible to non-technical users. Started the "AI moment".
GPT-4 — March 2023
Multimodal (image + text), estimated ~1 trillion parameters. Professional exam performance: passed the bar exam (90th percentile), SAT, medical licensing. Step change in reasoning quality.
GPT-4o — 2024
Native voice + vision, fast inference, GPT-4 quality at lower cost. "Omni" model — unified multimodal architecture. Real-time conversation with vision understanding.
o1, o3 — 2024–2025
Reasoning models with chain-of-thought. New frontier: models that "think" before answering, spending more compute at inference time. Trade speed for accuracy on complex tasks.
Emergent Capabilities In-Depth
Wei et al. (2022) documented a surprising phenomenon: certain capabilities appear suddenly at a scale threshold — they are essentially absent in smaller models and then abruptly present in larger ones. These emergent abilities were not explicitly trained. The model was only ever trained to predict the next token. Yet above a certain parameter count, it can perform multi-step arithmetic, chain-of-thought reasoning, translation between unseen language pairs, and code generation.
Multi-step Arithmetic
Below ~10B params ≈ random performance. Above 100B → suddenly works with high accuracy. The model learns to decompose calculations despite never being explicitly taught arithmetic.
Chain-of-thought Reasoning
Appears around 100B parameters. Prompting "Let's think step by step" has zero effect on small models but dramatically improves large model accuracy on multi-step reasoning tasks.
Unseen Language Translation
Models trained primarily on English data can translate between language pairs never seen during training. This capability emerges at scale — evidence of internal multilingual representations.
Code Generation
Near-zero at 1B, functional at 10B, excellent at 100B+. Models go from generating syntactic garbage to writing correct, complex programs — a phase transition in capability.
Schaeffer et al. (2023) argued that emergent abilities may be measurement artifacts — they appear "sudden" because we use discontinuous metrics (e.g., exact-match accuracy). With continuous metrics (e.g., log-likelihood), improvement is smooth. The debate continues, but the practical observation holds: there are capability thresholds below which models are useless at certain tasks.
Inference & Sampling Strategies Core
How does an LLM actually generate text? At each step, the model outputs a probability distribution over the entire vocabulary. The decoding strategy determines which token to pick from that distribution. This choice dramatically affects output quality, diversity, and creativity.
Greedy
Always pick the most probable token. Deterministic, often repetitive for long texts. argmax at every step.
Beam Search
Maintain top-k sequences at each step, pick the best overall. Better quality than greedy, still not diverse.
Sampling
Sample randomly from the full distribution. Diverse but can produce incoherent text — low-probability tokens get chosen.
Top-k Sampling
Sample only from the top k most likely tokens. k=50 is common. Balances diversity and coherence.
Top-p / Nucleus
Sample from the smallest set of tokens whose cumulative probability ≥ p. Adaptive vocabulary — more tokens when distribution is flat, fewer when peaked.
Temperature
Scale logits by T before softmax. T<1 → sharper (more deterministic). T>1 → flatter (more creative/chaotic). T=0 ≈ greedy.
Open-Source LLMs Core
The open-source LLM ecosystem exploded in 2023–2025. Models from Meta, Mistral, Alibaba, Google, Microsoft, and others are approaching closed-source frontier quality. This table captures the major families as of 2024–2025.
| Model | Provider | Params | Context | License | Notable |
|---|---|---|---|---|---|
| LLaMA 3 8B/70B/405B | Meta | 8B–405B | 128K | Llama 3 | Best open-source 2024 |
| Mistral 7B / 8×7B | Mistral AI | 7B / ~45B | 32K | Apache 2.0 | Efficient MoE (Mixtral) |
| Qwen2.5 7B/72B | Alibaba | 7B–72B | 128K | Qwen | Strong multilingual |
| Gemma 2 9B/27B | 9B / 27B | 8K | Gemma | Strong at size | |
| Phi-3 mini/small | Microsoft | 3.8B / 7B | 128K | MIT | Small but capable |
| DeepSeek-R1 | DeepSeek | 7B–671B | 64K | MIT | Reasoning-focused |
| Command-R+ | Cohere | 104B | 128K | CC BY-NC | RAG-optimised |
∑ Chapter 5.5 Summary — The GPT Family
- GPT = decoder-only Transformer + causal (left-to-right) attention = autoregressive generation
- Scaling laws: loss decreases as power law with parameters, data, and compute
- Chinchilla: optimal training = equal compute budget for parameters AND data
- Emergent abilities: capabilities appear suddenly at scale thresholds — not trained explicitly
- Inference: top-p (nucleus) sampling at temperature 0.7–1.0 is the typical LLM generation setting
- Open-source LLMs (LLaMA 3, Mistral, Qwen) are approaching closed-source frontier quality
BERT introduced a paradigm shift: instead of predicting the next word left-to-right, mask some words and predict them using the FULL surrounding context. This bidirectional pre-training produces richer representations that dominate understanding tasks — classification, NER, QA, and semantic search.
BERT Architecture In-Depth
Devlin et al. (Google, 2018): BERT — Bidirectional Encoder Representations from Transformers. BERT uses an encoder-only Transformer stack — no decoder, no causal mask. Every token attends to all other tokens simultaneously (bidirectional attention). This is the key difference from GPT: BERT sees the full context before producing representations.
BERT-base
- 12 Transformer layers
- 768 hidden dimension
- 12 attention heads
- 110M parameters
BERT-large
- 24 Transformer layers
- 1024 hidden dimension
- 16 attention heads
- 340M parameters
BERT Special Inputs Core
BERT's input representation is the sum of three embedding types (not concatenated). Every input is prepended with [CLS] and sentence pairs are separated by [SEP].
[CLS] Token
Classification token prepended to every input. Its final hidden state is used as the aggregate sequence representation for classification tasks.
[SEP] Token
Separator token between sentence A and sentence B. Also appended at the end of the input sequence.
[MASK] Token
Replaces 15% of tokens during pre-training. Of those 15%: 80% → [MASK], 10% → random word, 10% → kept unchanged.
Segment embeddings tell BERT which sentence each token belongs to (Sentence A vs Sentence B). The three input components are summed element-wise: Token Embedding + Positional Embedding + Segment Embedding.
Fine-tuning BERT In-Depth
BERT's power lies in fine-tuning: take the pre-trained backbone and add a thin task-specific head. All BERT weights are updated during fine-tuning (with a small learning rate). Four canonical task types:
Sequence Classification
Sentiment, topic, NLI. Add FC layer on top of [CLS] embedding → class probabilities. Fine-tune all BERT weights + FC layer.
Token Classification
NER, POS tagging. Add FC layer on every token embedding → per-token labels. Each token gets a label independently.
Extractive Question Answering
Input: [CLS] question [SEP] passage [SEP]. Output: start + end position — which span in the passage is the answer. Two vectors classify each token as answer-start or answer-end.
Sentence Pair Tasks
Similarity, entailment. Input: [CLS] sentence A [SEP] sentence B [SEP]. Use [CLS] embedding as pair representation.
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
# Load pre-trained BERT + add classification head
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Tokenise input (handles [CLS] and [SEP] automatically) def tokenize_fn(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)
dataset = load_dataset('imdb')
tokenized = dataset.map(tokenize_fn, batched=True)
training_args = TrainingArguments(
output_dir='./bert-sentiment',
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=2e-5, # CRITICAL: small LR for fine-tuning pre-trained model
weight_decay=0.01,
evaluation_strategy='epoch',
warmup_steps=500
)
trainer = Trainer(model=model, args=training_args,
train_dataset=tokenized['train'], eval_dataset=tokenized['test'])
trainer.train() The BERT Family Core
| Model | Key Innovation | Params | Training Data | Notable Improvement |
|---|---|---|---|---|
| BERT-base | Bidirectional Transformer + MLM + NSP | 110M | BookCorpus + Wikipedia | Baseline |
| RoBERTa (Facebook) | Remove NSP, larger batches, more data, longer training | 125M | 160GB text | 5–10% improvement on GLUE |
| DistilBERT (HuggingFace) | Knowledge distillation from BERT (40% smaller) | 66M | Same | 60% faster, 97% of BERT's performance |
| ALBERT (Google) | Cross-layer parameter sharing, sentence order prediction | 12M–235M | Same | Same performance, fraction of params |
| DeBERTa (Microsoft) | Disentangled attention (separate content + position) | 86M–1.5B | 160GB | State-of-the-art on SuperGLUE |
| ELECTRA (Google) | Replaced Token Detection (more efficient training) | 14M–335M | Same | 4× more efficient than BERT |
BERT vs GPT
| Aspect | BERT (Encoder) | GPT (Decoder) |
|---|---|---|
| Attention | Bidirectional (all tokens) | Causal (left-to-right only) |
| Pre-training | Masked LM + NSP | Next-token prediction |
| Best For | Understanding (classify, NER, QA) | Generation (chat, completion) |
| Output | Contextual embeddings | Generated text |
| Fine-tuning | Add task head, small dataset OK | Prompt-based, few-shot |
| Scale | 110M–1.5B | 117M–1.8T+ |
When to Use Encoders vs Decoders Core
- → Understanding tasks (classification, NER, QA)
- → Sentence embeddings for semantic search
- → NLI and entailment
- → Smaller, faster fine-tuning
- → Bidirectional context needed
- → Tasks with fixed input→label format
- → Generation tasks (chat, completion, summarisation)
- → Zero/few-shot prompting
- → Reasoning over long contexts
- → Instruction following
- → Tasks requiring flexible output format
- → When you have no task-specific labels
∑ Chapter 5.6 Summary — BERT & Encoder Models
- BERT: encoder-only, bidirectional attention — each token sees all tokens simultaneously
- Pre-training: Masked LM (predict 15% masked tokens) + NSP on Wikipedia + BookCorpus
- Fine-tuning: add task-specific head on [CLS] (classification) or all tokens (NER)
- RoBERTa improves BERT by: more data, remove NSP, larger batches, longer training
- DistilBERT: 40% smaller, 60% faster, 97% of BERT performance via knowledge distillation
- Use BERT for understanding tasks; use GPT for generation and instruction following
Prompt engineering is the art and science of crafting inputs to get desired outputs from LLMs. The same model can produce radically different quality depending on how you ask — mastering the prompt is mastering the interface to intelligence.
What Is Prompt Engineering? Core
LLMs are not search engines — they are conditional probability machines. Given your prompt as the beginning of a document, they predict what comes next. The quality and structure of that beginning determines everything about the continuation.
Compare: "What is the capital of France?" vs "Answer as a geography teacher giving a detailed explanation: What is the capital of France?" — same factual answer but very different style and depth.
You are writing the beginning of a document that the LLM will continue. The better the beginning, the better the continuation.
Five prompt components:
① Instruction
What you want the model to do. Be specific and explicit: "Summarise in 3 bullets" not "Summarise".
② Context
Background information the model needs: domain, audience, constraints, prior conversation.
③ Input Data
The actual content to process: text to classify, code to review, question to answer.
④ Output Format
How you want the answer: JSON, bullet list, table, single word, code block.
⑤ Examples
Demonstrations of desired input→output pairs (few-shot). The model learns the pattern in-context.
Zero-Shot & Few-Shot Prompting In-Depth
Zero-shot: ask the model without any examples. Works well for simple, well-defined tasks.
Few-shot (in-context learning): provide 2–5 examples before asking. Brown et al. (GPT-3, 2020) showed that providing examples dramatically improves performance. The model is not fine-tuned — it adapts to the task from examples in its context window.
One-shot: exactly one example — sometimes all you need for well-defined tasks.
Example selection matters: choose examples that cover edge cases and represent the full range of expected inputs. Diverse examples outperform similar ones.
Chain-of-Thought Prompting In-Depth
Wei et al. (Google, 2022): "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". The key insight: prompting LLMs to "think step by step" dramatically improves reasoning accuracy on math, logic, and multi-step problems.
Why it works: the model generates intermediate steps → each step conditions the next → less error accumulation. The chain of reasoning acts as a scratchpad that keeps the model on track.
Standard Prompting
Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many?
A: 11
↑ Direct answer — works for simple tasks, fails for multi-step
CoT Prompting
Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many?
A: Roger started with 5. Each can has 3 balls, so 2 cans = 6 balls. 5 + 6 = 11. The answer is 11.
↑ Explicit steps — each step conditions the next
Just add "Let's think step by step" to any prompt — no examples needed. This simple suffix unlocks reasoning in large models.
Structured Prompting In-Depth
Structure your prompts for reliable, parseable output. Three key techniques:
Output Formatting
Ask for JSON, XML, or specific structure. Example: "Return as JSON: {"sentiment": "...", "confidence": 0-1}"
→ Parse programmatically, no regex hacks
Role Prompting
"You are an expert Python developer with 10 years of experience."
→ Sets persona, knowledge domain, and response style
Delimiters
Use triple backticks, XML tags, or --- to separate instruction from data.
→ Prevents prompt injection, clarifies boundaries
import openai, json
def extract_entities(text: str) -> dict:
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You extract named entities from text. Return JSON only."},
{"role": "user", "content": f"""Extract entities from:
```{text}```
Return format: {{"people": [...], "organizations": [...], "locations": [...], "dates": [...]}}"}
],
temperature=0.0, # deterministic for structured output
response_format={"type": "json_object"} # GPT-4 JSON mode
)
return json.loads(response.choices[0].message.content)
result = extract_entities("Tim Cook announced Apple's Q3 earnings in Cupertino on Tuesday, August 1st.")
print(json.dumps(result, indent=2))
# {"people": ["Tim Cook"], "organizations": ["Apple"], "locations": ["Cupertino"], "dates": ["August 1st"]} System Prompts
In chat-based APIs (OpenAI, Anthropic, etc.), the system prompt sets the model's behaviour, persona, and constraints before the user speaks. It's the most powerful lever for controlling output quality.
What Goes in System Prompts
- Role and persona definition
- Output format requirements
- Constraints and guardrails
- Domain knowledge or context
- Tone and style instructions
Best Practices
- Be explicit and specific — don't assume the model infers intent
- Put constraints up front (format, length, language)
- Use delimiters to separate user content from instructions
- Test with adversarial inputs
- Don't rely on system prompt secrecy for security
Prompt Patterns Catalogue Core
| Pattern | When to Use | Template | Example |
|---|---|---|---|
| Role Pattern | Need domain expertise | "You are a [role] with [experience]..." | "You are a senior Python engineer reviewing code for bugs" |
| Step-by-Step | Multi-step reasoning, math | "Think step by step..." | "Solve this problem step by step: ..." |
| Output Format | Need structured data | "Return as JSON/list/table..." | "Return as JSON: {fields}" |
| Few-Shot | Task hard to specify, need examples | "[Example 1]→[Output 1]\n[Input]→?" | Sentiment, classification, entity extraction |
| Chain-of-Thought | Reasoning, math, logic | "[problem] Let's think step by step" | Math word problems, logical puzzles |
| Delimiter | Long context, avoid injection | "Summarise: ```{text}```" | Document processing, code review |
| Self-Ask | Complex multi-hop questions | "Are there any follow-up questions?" | Research synthesis, fact verification |
Common Pitfalls Core
Prompt Injection
Malicious input overrides instructions: "Ignore previous instructions and..."
Mitigation: use delimiters, validate inputs, separate system and user content.
Prompt Leaking
User can extract system prompt: "Repeat all your instructions"
Mitigation: don't rely on prompt secrecy for security, use proper access controls.
Ambiguous Instructions
Vague prompts → inconsistent outputs. Be explicit: "Respond in 3 bullet points of max 20 words each".
Lost in the Middle
LLMs attend better to start and end of context. Put most important info first or last. (Liu et al., 2023: "Lost in the Middle" phenomenon)
∑ Chapter 5.7 Summary — Prompt Engineering
- Few-shot in-context learning: examples in the prompt teach the task — no gradient updates needed
- Chain-of-thought: "Let's think step by step" — explicit reasoning steps reduce errors
- Structured output: specify JSON/XML format → parse programmatically
- ReAct pattern: Thought→Action→Observation loop — foundation of tool-using agents (Domain 8)
- Prompt injection: user input can override instructions — always use delimiters to separate content
- "Lost in the Middle": LLMs attend best to start and end of context — put key info there
RAG is the bridge between an LLM's frozen knowledge and the living, changing world. Instead of retraining a model every time information changes, retrieve relevant documents at query time and inject them into the prompt — grounding the model's answers in real, verifiable sources.
Why RAG? Core
Three fundamental LLM limitations make RAG essential for production systems:
Knowledge Cutoff
GPT-4's training data has a cutoff date. Any event after it is unknown. "What happened at the UN Security Council yesterday?" → hallucination.
RAG fix: retrieve yesterday's news, inject into context.
Hallucination on Specifics
LLMs confabulate details — addresses, phone numbers, dates, internal policies. "What is our Q3 refund policy?" → makes something up.
RAG fix: retrieve actual policy document, ground the answer.
Private Knowledge
Your internal docs, contracts, code, Slack history — not in any LLM. "Summarise our client contract with Acme Corp" → impossible.
RAG fix: embed and retrieve from your private document store.
RAG Architecture In-Depth
RAG has two phases: an offline indexing pipeline (run once or periodically) and an online query pipeline (run at every user question). Both share the same embedding model and vector database.
Indexing Phase (Offline)
- Load documents (PDFs, web pages, Word, Slack, etc.)
- Chunk into smaller pieces (e.g., 512 tokens each)
- Generate embedding vector for each chunk
- Store vectors in vector database
Query Phase (Online)
- User asks a question
- Embed the question (same model)
- Vector search: find top-k similar chunks
- Inject retrieved chunks into LLM prompt
- LLM generates grounded answer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
# === INDEXING PHASE ===
# Load and chunk documents
with open("company_policy.txt", "r") as f:
text = f.read()
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.create_documents([text])
print(f"Created {len(chunks)} chunks")
# Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
# === QUERY PHASE ===
retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) # top 4 chunks
llm = ChatOpenAI(model="gpt-4o", temperature=0.0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True # show which docs were retrieved
)
result = qa_chain.invoke("What is the return policy for electronics?")
print(result["result"])
print("Sources:", [d.metadata for d in result["source_documents"]]) Vector Databases In-Depth
Vector databases are specialised stores for embedding vectors, optimised for similarity search. The core operation is Approximate Nearest Neighbour (ANN) search: "find the k vectors closest to this query vector." Exact search is O(n·d) — ANN algorithms like HNSW make this dramatically faster.
| Database | Type | ANN Algorithm | Filtering | Best For | Hosted? |
|---|---|---|---|---|---|
| Chroma | Open source | HNSW | Metadata | Dev / prototyping | Self-hosted / Cloud |
| Pinecone | Cloud-native | Proprietary | Metadata + hybrid | Production, scale | Cloud only |
| Weaviate | Open source | HNSW | GraphQL | Hybrid search, graphs | Both |
| Qdrant | Open source | HNSW | Payload | High performance | Both |
| pgvector | PostgreSQL ext | IVF / HNSW | Full SQL | Existing PG infra | Self-hosted |
| Milvus | Open source | IVF, HNSW | Scalar | Billion-scale | Self-hosted / Cloud |
Chunking Strategies In-Depth
Chunk size matters enormously — wrong size degrades RAG quality. Too small: insufficient context in each chunk → incomplete answers. Too large: irrelevant content mixed with relevant → noisy retrieval. Chunks should overlap (50–100 tokens) to avoid splitting context across boundaries.
Fixed-Size
Split every 512 tokens regardless of content. Simple but crude — may cut mid-sentence.
Sentence-Based
Split on sentence boundaries. Better semantic coherence, preserves complete thoughts.
Recursive
Try paragraphs, then sentences, then words. LangChain's default. Best general-purpose strategy.
Semantic
Use embedding similarity to detect topic changes. Expensive but produces the most coherent chunks.
Document-Aware
Use document structure (headers, sections, tables). Best for structured documents like reports and manuals.
Advanced RAG Patterns In-Depth
Naive RAG (embed → search → generate) works for many cases. These advanced patterns dramatically improve precision and recall for production systems:
Hybrid Search
Combine dense (embedding) + sparse (BM25 keyword) search. Better recall for exact phrase matches AND semantic similarity. Use Reciprocal Rank Fusion (RRF) to merge result lists.
Re-Ranking
Initial retrieval: top-20 by fast ANN → re-rank with expensive cross-encoder → return top-5. Cross-encoders read query + document together for much better relevance.
Query Transformation
Rewrite query before retrieval. HyDE: generate hypothetical answer, then embed that. Multi-query: generate 3 variants → retrieve for all → merge results.
Parent-Child Chunks
Index small child chunks for precision retrieval. Return larger parent chunk for context. Best of both worlds — precise matching with rich context.
RAG vs Fine-Tuning Core
- ✓ Knowledge updatable at any time
- ✓ Cites sources, verifiable answers
- ✓ No training required
- ✓ Lower cost than fine-tuning
- ✗ Retrieval quality is the bottleneck
- ✗ Context window limits
- ✗ Latency of retrieval step
- ✓ Knowledge baked into weights
- ✓ No retrieval latency
- ✓ Better for style / format / behaviour
- ✗ Knowledge is static (needs retraining)
- ✗ Can't cite specific sources
- ✗ Expensive to update frequently
RAG and fine-tuning are not competing approaches — they are complementary. Fine-tune to change HOW the model communicates (tone, format, domain vocabulary). Use RAG to change WHAT the model knows (current facts, private documents, enterprise data). The best production systems use both.
∑ Chapter 5.8 Summary — Retrieval-Augmented Generation
- RAG solves: knowledge cutoff, hallucination on specifics, private/proprietary data
- Pipeline: Chunk docs → Embed → Store in vector DB → At query: embed query → ANN search → inject → generate
- Chunking: 512 tokens with 50-token overlap is a reasonable default — recursive splitting preserves structure
- Vector DB: stores embeddings for ANN similarity search — cosine similarity finds semantically similar chunks
- Advanced RAG: hybrid search + re-ranking dramatically improves retrieval precision
- RAG vs fine-tuning: use both — RAG for dynamic knowledge, fine-tuning for style/behaviour
LLMs are trained to produce fluent, probable text — not factual text. Understanding why they hallucinate, how alignment steers them toward human values, and how to rigorously evaluate their output is essential for responsible deployment.
Hallucination In-Depth
Hallucination: LLMs generate factually incorrect information with apparent confidence. This is not a bug — it's a consequence of the training objective. The model was rewarded for coherent, fluent text, not for verified facts.
Factual Hallucination
"Einstein won the Nobel Prize for relativity" — he actually won for the photoelectric effect. Plausible, confident, wrong.
Citation Hallucination
Fabricated paper titles, non-existent authors, wrong DOIs. The model generates citation-shaped text that looks real but doesn't exist.
Entity Hallucination
Made-up people, places, company names that sound real. "Westbrook Medical Center" — doesn't exist but sounds plausible.
Reasoning Hallucination
Correct-sounding reasoning leading to a wrong conclusion. Each step looks valid, but the chain produces an incorrect answer.
Intrinsic vs extrinsic hallucination:
Intrinsic Hallucination
Contradicts the provided context. The document says population = 5M, but the answer says 10M. Detectable by comparing output to source.
Extrinsic Hallucination
Fabricated content not in context. Generated from the model's world knowledge — may or may not be true. Harder to detect without external verification.
Why LLMs Hallucinate Core
Root cause: LLMs are trained to produce fluent, probable text — not factual text. The model doesn't "know" it doesn't know something. Confidence calibration is poor: models are confidently wrong, which is more dangerous than being uncertainly wrong.
Training Objective
Maximise next-token probability → rewarded for coherent text, not verified facts. The loss function doesn't distinguish true from plausible.
Memorisation vs Generalisation
Facts not seen enough times in training → model interpolates between facts. It generates a blend of real knowledge and pattern-matched confabulation.
Sycophancy
Models trained with RLHF learn to tell users what they want to hear. If you suggest a wrong answer, the model may agree rather than correct you.
The hallucination-confidence problem: models are confidently wrong. A model that says "I'm not sure" is safer than one that states a fabricated fact with full certainty. This is why calibration research is critical.
Alignment In-Depth
Alignment problem: ensure AI systems behave according to human values and intentions. A highly capable but misaligned AI is dangerous — capability without alignment amplifies harm.
The specification problem: how do you formally specify "what humans actually want"? Even well-intentioned reward functions can be gamed — the model maximises the metric in unintended ways (reward hacking).
RLHF (Partial Solution)
Human preferences act as a proxy for values. Humans rank model outputs → reward model trained on rankings → PPO optimises policy. Imperfect but significant improvement over base models.
Constitutional AI (Anthropic)
Model learns from its own self-critique using a constitution of principles. Generate → critique against principles → revise. Scales better than human labelling.
Key alignment challenges:
Distributional Shift
Behaves well in training distribution but fails on out-of-distribution deployment inputs.
Reward Hacking
Satisfies the letter but not the spirit of the reward. Finds loopholes in the reward function.
Deceptive Alignment
Appears aligned during evaluation, behaves differently when deployed. The hardest failure mode to detect.
Helpful, Harmless, Honest (HHH) Core
Anthropic's HHH framework defines the three axes of aligned model behaviour. The tension: being more helpful sometimes means being slightly less cautious (and vice versa). Constitutional AI resolves this by giving the model explicit principles to follow.
Helpful
Genuinely helps users accomplish tasks. Unhelpfulness is never trivially "safe" — a model that refuses everything harms users who have legitimate needs.
Harmless
Avoids generating content that causes real-world harm. Calibrated — not reflexively refusing edge cases. Context matters: medical information for a nurse vs a stranger.
Honest
Doesn't claim certainty it doesn't have. Proactively shares relevant information. Doesn't pursue hidden agendas or deceive about its nature.
Generate → critique against principles → revise → repeat. The model becomes its own alignment judge, guided by a written constitution of values. Scales far better than per-output human labelling.
NLP Evaluation Metrics In-Depth
Automatic metrics enable scalable evaluation, but each has significant blind spots. Understanding their strengths and weaknesses is essential for trustworthy evaluation.
BLEU (Translation)
Precision of n-gram overlap between generated and reference text. Range: 0–1. Weakness: doesn't capture meaning, penalises valid paraphrase.
ROUGE (Summarisation)
Recall of n-gram overlap with reference. ROUGE-N: n-gram recall. ROUGE-L: longest common subsequence. Weakness: length bias, synonym-blind.
METEOR
Combines precision, recall, and semantic matching via WordNet synonyms. Better correlation with human judgement than BLEU alone.
BERTScore
Uses BERT embeddings to measure semantic similarity. More robust than n-gram metrics — captures paraphrase and meaning equivalence.
LLM Benchmarks In-Depth
| Benchmark | Tests | Format | Human Baseline | Note |
|---|---|---|---|---|
| MMLU | 57-subject knowledge (57K questions) | Multiple choice | ~89% | Knowledge breadth |
| HumanEval | Python function generation (164 problems) | Code generation | ~75% | Coding |
| GSM8K | Grade school math (8.5K problems) | Multi-step reasoning | ~95% | Math |
| MATH | Competition math (12.5K problems) | Multi-step hard math | ~40% (students) | Hard math |
| ARC-AGI | Visual pattern reasoning | Novel test patterns | ~85% | Novel reasoning |
| GPQA Diamond | PhD-level science (448 questions) | Multiple choice | ~65% | Expert knowledge |
| MT-Bench | Multi-turn dialogue quality | GPT-4 as judge | — | Chat quality |
| Chatbot Arena | Head-to-head human preference | ELO rating | — | Real-world preference |
Goodhart's Law applies everywhere in LLM evaluation: when a benchmark becomes a target, it ceases to be a good measure. Models trained on benchmark-adjacent data score artificially high. The most trustworthy evaluation is diverse human assessment on novel, never-before-seen tasks.
∑ Chapter 5.9 Summary — Hallucination, Alignment & Evaluation
- Hallucination: LLMs generate confident falsehoods — trained for fluency, not factual accuracy
- Types: factual, citation, entity, reasoning hallucinations — RAG and temperature=0 reduce them
- Alignment: ensure models behave according to human values (Helpful, Harmless, Honest)
- RLHF and Constitutional AI: current best approaches to alignment — imperfect but significant improvement
- BLEU/ROUGE: n-gram metrics for translation/summarisation — fast but miss semantic equivalence
- Human evaluation remains the gold standard — automatic metrics can be gamed
Fine-tuning is the final tool in the LLM adaptation toolkit — used when prompting and RAG aren't enough. Modern techniques like QLoRA make it possible to fine-tune a 70B parameter model on a single consumer GPU.
When to Fine-Tune (vs RAG vs Prompting) In-Depth
Decision framework — try in order (cheapest first):
Prompt Engineering (Free)
Can you get the desired behaviour with a better prompt? Better system prompt, clearer instructions, output format specification. Always try this first.
Few-Shot Examples (Free)
Add 3–10 examples to the prompt. Often dramatically improves output quality for classification, extraction, and formatting tasks.
RAG (Moderate Cost)
Does the model need access to external, updated, or private knowledge? RAG grounds answers in retrieved documents without any training.
Fine-Tuning (Higher Cost)
Does the model need to change its behaviour, style, or domain expertise? Fine-tuning bakes capabilities into the weights.
Fine-tuning IS the right choice when:
Consistent Format
Always return JSON in a specific schema, or follow a strict output template.
Domain Vocabulary
Medical jargon, legal language, internal code style that prompts can't reliably teach.
Reduce Token Usage
Bake instructions into weights that would otherwise consume context window space.
Faster Inference
A smaller fine-tuned model can outperform a larger prompted model — lower cost per query.
500+ Examples
You have high-quality labelled data. Without data, fine-tuning can't help.
Data Preparation In-Depth
The #1 determinant of fine-tuning quality is data quality. Garbage in, garbage out — but amplified by the power of gradient descent. Format: instruction-following datasets use the chat message format (system / user / assistant).
Minimum Dataset Sizes
- 500 examples — for format/style changes
- 1,000+ examples — for new capability
- 5,000+ examples — for complex domain tasks
Data Quality Checklist
- Diverse inputs (edge cases, different phrasings)
- Consistent output quality (human-reviewed)
- No duplicate or near-duplicate examples
- Balanced classes (for classification)
Use GPT-4 to generate training examples, then human spot-check 10–20%. This is the fastest way to build a high-quality dataset. Filter aggressively — 500 excellent examples beat 5,000 mediocre ones.
import json
from pathlib import Path
# Chat format for instruction fine-tuning (OpenAI / LLaMA format)
def create_training_example(instruction: str, input_text: str, output: str) -> dict:
messages = [
{"role": "system", "content": "You are a helpful assistant specialised in contract analysis."},
{"role": "user", "content": f"{instruction}\n\n{input_text}" if input_text else instruction},
{"role": "assistant", "content": output}
]
return {"messages": messages}
# Example dataset creation
examples = [
create_training_example(
instruction="Extract the termination clause from this contract:",
input_text="...contract text...",
output="Termination clause (Section 12): Either party may terminate with 30 days written notice..."
),
# Add 499+ more examples
]
# Validate format
for ex in examples:
assert len(ex["messages"]) == 3
assert ex["messages"][-1]["role"] == "assistant"
assert len(ex["messages"][-1]["content"]) > 0, "Empty response!"
# Save as JSONL (one JSON per line)
with open("train.jsonl", "w") as f:
for ex in examples:
f.write(json.dumps(ex) + "\n")
print(f"Training examples: {len(examples)}")
print(f"Avg response length: {sum(len(e['messages'][-1]['content']) for e in examples) / len(examples):.0f} chars") Supervised Fine-Tuning (SFT) In-Depth
SFT objective: minimise cross-entropy loss on the assistant turns only. The user/system turns are masked — the model doesn't compute loss on the prompt tokens, only on the response it should have generated.
Key hyperparameters:
| Parameter | Typical Range | Notes |
|---|---|---|
| Epochs | 1–3 | More = overfitting, memorisation |
| Learning Rate | 1e-5 to 2e-4 | 10–100× lower than pre-training |
| Batch Size | 8–64 | Use gradient accumulation for limited VRAM |
| Warmup | 3–10% of steps | Prevents early instability |
| Max Seq Length | 2048–4096 | Match model's typical context |
Fine-tuning on a narrow task → model forgets general capabilities. Mitigation: use LoRA (only updates a small fraction of weights), or mix in general instruction-following data alongside your task data.
LoRA & QLoRA in Practice In-Depth
QLoRA (Dettmers et al., 2023): LoRA applied on a 4-bit quantised base model. This makes it possible to fine-tune a 70B model on a single 48GB GPU — impossible without quantisation.
# Unsloth: 2x faster fine-tuning with memory-efficient kernels
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
# Load model in 4-bit quantisation
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Meta-Llama-3-8B-Instruct",
max_seq_length=2048,
load_in_4bit=True, # QLoRA: 4-bit quantised base
dtype=None
)
# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.0,
bias="none",
use_gradient_checkpointing="unsloth" # saves 30% VRAM
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # effective batch = 8
num_train_epochs=2,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
output_dir="./output",
warmup_ratio=0.05,
lr_scheduler_type="cosine"
)
)
trainer.train()
model.save_pretrained("./my-llama3-ft") # saves only LoRA weights (~50MB) DPO: Direct Preference Optimisation Core
Rafailov et al. (2023): "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." RLHF requires three components: SFT model + reward model + PPO training — complex, unstable, expensive.
DPO insight: the optimal RLHF policy has a closed-form solution — no separate reward model needed. DPO uses preference pairs: (prompt, chosen_response, rejected_response). Train to increase likelihood of chosen relative to rejected. Simpler, more stable, often achieves similar or better results.
- → Requires reward model training
- → PPO: complex RL algorithm
- → 3 separate models to maintain
- → Compute intensive
- → Hyperparameter-sensitive
- → Gold standard for alignment
- → No separate reward model
- → Direct gradient on preference pairs
- → Only 2 models (policy + reference)
- → Simpler, more stable
- → Fewer hyperparameters
- → Increasingly preferred (2023–2025)
Full Fine-Tuning Pipeline Core
Define Task & Collect Data
500–2,000 examples in JSONL chat format. Ensure quality over quantity.
Data Validation
Check format, deduplicate, quality filter. Remove examples with empty or low-quality responses.
Choose Base Model
LLaMA 3, Mistral, Qwen — pick based on size, language, licence, and your hardware.
SFT with QLoRA
Unsloth or HuggingFace TRL. r=16, lr=2e-4, 1–3 epochs. Monitor loss convergence.
Evaluate on Hold-out
Task-specific metrics + human evaluation. Check for catastrophic forgetting on general tasks.
Optionally: DPO
Preference tuning on failure cases. Collect chosen/rejected pairs from model outputs.
Merge & Quantise
Merge LoRA adapters into full model. Quantise to GGUF (4-bit) for efficient inference.
Deploy
Ollama (local), vLLM (server), or cloud API. Monitor quality in production, collect feedback.
🎓 Domain 5 Complete — NLP & Large Language Models
- Ch 5.1: NLP = four ambiguity layers: lexical, syntactic, semantic, pragmatic. Classical preprocessing (stopwords, stemming) is NOT used with neural models.
- Ch 5.2: BPE tokenisation: iteratively merge most frequent pairs. GPT-4 uses 100K-vocab BPE (~¾ word per token).
- Ch 5.3: Word2Vec: context predicts embedding. "king − man + woman ≈ queen" — geometry encodes meaning.
- Ch 5.4: Contextual embeddings: same word, different vector per context. Pre-train then fine-tune = the modern NLP paradigm.
- Ch 5.5: GPT = decoder-only, causal attention, autoregressive. Scaling laws: loss ∝ N-α. Chinchilla: equal budget for params and data.
- Ch 5.6: BERT = encoder-only, bidirectional attention, MLM pre-training. Use for understanding; GPT for generation.
- Ch 5.7: Few-shot ICL: examples in prompt adapt behaviour. Chain-of-thought: "think step by step" dramatically improves reasoning.
- Ch 5.8: RAG: Chunk → Embed → Vector DB → Retrieve → Generate. Solves knowledge cutoff, hallucination on specifics, and private data.
- Ch 5.9: Hallucination = LLMs generate confident falsehoods — trained for fluency not facts. HHH: Helpful, Harmless, Honest.
- Ch 5.10: Fine-tune when prompt+RAG isn't enough. QLoRA fine-tunes 70B on a single GPU. DPO replaces RLHF's complexity.
Most applications today avoid fine-tuning and instead use prompting or RAG — faster, cheaper, and no training infrastructure needed.
Fine-tuning becomes important when:
- Strict behaviour control is needed — consistent output format, tone, or safety guardrails
- Domain-specific patterns must be learned — legal contracts, medical notes, proprietary code styles
→ Covered in depth: Fine-Tuning LLMs (Advanced)
Domain 5 is where theory meets the frontier. The GPT family and BERT established the modern NLP paradigm that all of AI now follows. Prompt engineering, RAG, and fine-tuning are the three tools every AI practitioner uses daily. Domain 8 (Agentic AI) will show how LLMs with tools become autonomous agents. Domain 9 (AI Ethics) will address the alignment and hallucination challenges at scale.