AI Foundation · Domain 07

Reinforcement Learning

MDPs, Bellman equations, Q-learning, policy gradients, RLHF — learning through reward signals from games to LLM alignment.

7.1

Chapter 7.1

RL Foundations — Agent, Environment & the MDP Framework

Reinforcement learning is the closest thing AI has to how animals actually learn. No labelled examples. No explicit rules. Just an agent making decisions, receiving rewards or penalties, and slowly figuring out what works. It is the framework behind AlphaGo, ChatGPT's alignment, autonomous robots, and game-playing AIs.

What Is Reinforcement Learning? Core

Reinforcement learning (RL) is the third major machine learning paradigm — alongside supervised and unsupervised learning. Instead of learning from labelled examples, an RL agent learns by interacting with an environment, receiving a reward signal, and adjusting its behaviour to maximise cumulative reward over time.

The learning signal is delayed and sparse. A chess program only receives +1 (win) or −1 (loss) at the very end of the game. It must figure out which of the 40–80 moves it played actually caused that outcome — the credit assignment problem, one of the central challenges in RL.

Three key differences from supervised learning: (1) there is no teacher providing correct action labels; (2) actions affect future states, so decisions are not independent; and (3) training data is generated by the agent's own behaviour — a non-stationary, self-influencing distribution.

🎮

Games & Simulations

Atari games (DQN)
Go (AlphaGo)
StarCraft II (AlphaStar)
Dota 2 (OpenAI Five)
Chess + RL (Stockfish)

🤖

Robotics & Control

Robot locomotion
Manipulation & grasping
Autonomous driving
Drone control
Industrial automation

🧠

Language & Alignment

RLHF for ChatGPT/Claude
Dialogue systems
Recommendation engines
Algorithmic trading
Drug discovery

Three ML Paradigms — RL learns from sparse, delayed reward signals

The Agent-Environment Loop In-depth

At every discrete time step t, the agent-environment interaction follows a four-step cycle: (1) agent observes state s_t; (2) selects action a_t via policy π; (3) environment transitions to s_t+1; (4) environment emits reward r_t+1. This repeats until a terminal state (episodic) or runs indefinitely (continuing).

The most fundamental tension in RL is exploration vs exploitation. Exploitation means choosing the action believed best right now. Exploration means trying something new to discover potentially better strategies. Every RL algorithm must balance these two competing imperatives.

Agent-Environment Loop — observe → act → receive reward → repeat

Markov Decision Process (MDP) In-depth

Every RL problem is formalised as a Markov Decision Process — defined by the 5-tuple (S, A, P, R, γ): the state space S, action space A, transition probability P(s'|s,a), reward function R(s,a), and discount factor γ ∈ [0,1].

The cornerstone assumption is the Markov Property: the future depends only on the current state, not on the full history. Formally: P(s_t+1 | s_t, a_t, …, s0, a0) = P(s_t+1 | s_t, a_t). The current state encodes all information needed to act optimally.

When the agent cannot observe the full state, the problem becomes a Partially Observable MDP (POMDP) — the agent must maintain a belief state, a probability distribution over possible states, greatly increasing difficulty.

MDP State-Transition Diagram — states, actions, probabilities, rewards

Returns & Discounting In-depth

The agent's goal is not to maximise the next reward, but the cumulative discounted reward over the entire episode. The return G_t from time step t is the sum of all future rewards, each discounted by γᵏ where k is how many steps away it lies.

The discount factor γ ∈ [0, 1] controls how much the agent values future rewards. With γ = 0 the agent is purely myopic. With γ = 1 all future rewards are equally valued (only valid for finite-horizon tasks). Typical values are γ = 0.95–0.99. The recursive form G_t = r_t+1 + γ·G_t+1 is the key identity that makes Bellman equations tractable.

Return & Discount G‑t = r‑t+1 + γr‑t+2 + γ²r‑t+3 + … = Σ‑(k=0→∞) γᵏ r‑(t+k+1) γ=0: myopic (only r‑t+1 matters) | γ=1: fully far-sighted | γ=0.99: typical production Recursive: G‑t = r‑t+1 + γ · G‑t+1 ← KEY identity for Bellman equations

Discount Factor γ — controls how much the agent values future vs immediate reward

Policy & Value Functions In-depth

A policy π maps states to actions. A deterministic policy π(s) = a always picks the same action; a stochastic policy π(a|s) outputs a probability distribution over actions. The goal: find the optimal policy π* that maximises expected return.

The state value function V^π(s) answers: "How good is state s under policy π?" The action-value function Q^π(s,a) is more granular: "How good is taking action a in state s, then following π?" Q-values are the direct targets of Q-learning and DQN. The advantage A^π(s,a) = Q − V measures how much better action a is compared to the average action from s — used in actor-critic methods to reduce variance.

Value Functions V‑π(s) = E‑π[G‑t | S‑t=s] Q‑π(s,a) = E‑π[G‑t | S‑t=s, A‑t=a] Relationship: V‑π(s) = Σ‑a π(a|s) Q‑π(s,a) Advantage: A‑π(s,a) = Q‑π(s,a) − V‑π(s)

Value Functions on a Grid World — V(s) heat and optimal policy arrows

Bellman Equations In-depth

Richard Bellman's 1957 insight is the mathematical spine of all RL. The Bellman Expectation Equation expresses V^π(s) recursively: the value of state s equals the immediate reward plus the discounted value of the next state, averaged over all possible next states. This converts an infinite-horizon problem into a tractable recursive computation.

The Bellman Optimality Equation does the same for the optimal value function V*: the optimal value is obtained by always choosing the best possible action. Every major RL algorithm — Q-learning, DQN, PPO, SAC — is explicitly or implicitly solving a form of this equation.

The Bellman equation is why RL works at all: it converts predicting infinite-horizon returns into predicting one-step rewards plus bootstrapped future values. Without it, every policy gradient update would require complete Monte Carlo rollouts — far too slow for complex tasks.

Bellman Expectation (policy π) V‑π(s) = Σ‑a π(a|s) Σ‑s' P(s'|s,a) [R(s,a,s') + γV‑π(s')] Bellman Optimality Equations V*(s) = max‑a Σ‑s' P(s'|s,a) [R(s,a,s') + γV*(s')] Q*(s,a) = Σ‑s' P(s'|s,a) [R(s,a,s') + γ max‑a' Q*(s',a')] Optimal policy: π*(s) = argmax‑a Q*(s,a)

Bellman Backup — value of s = best action × (reward + discounted next-state values)

RL Algorithm Taxonomy Core

RL algorithms split along two major axes: (1) whether they use an explicit model of transition dynamics, and (2) whether they optimise a value function, directly a policy, or both via an actor-critic hybrid.

RL Algorithm Taxonomy — Model-Based vs Model-Free; Value vs Policy vs Actor-Critic

Algorithm	Type	Action Space	Needs Model?	Sample Efficiency	Stability	Chapter
Value Iteration	DP	Discrete	Yes (full)	N/A (planning)	High	7.2
Q-Learning	TD, off-policy	Discrete	No	Low	Medium	7.4
SARSA	TD, on-policy	Discrete	No	Low	High	7.4
DQN	Deep Q, off-policy	Discrete	No	Medium	Medium	7.5
REINFORCE	Policy Gradient	Cont/Disc	No	Very Low	Low	7.6
PPO	Actor-Critic, on-policy	Cont/Disc	No	Medium	High	7.7
SAC	Actor-Critic, off-policy	Continuous	No	High	High	7.7
MuZero	Model-Based	Discrete	Yes (learned)	High	High	7.8

∑ Chapter 7.1 — Key Takeaways

RL: agent maximises cumulative reward via trial-and-error interaction — no teacher, no labelled data
Credit assignment problem: which of 40–80 chess moves caused the win/loss 50 steps later?
MDP: (S, A, P, R, γ) — the formal framework; Markov property: future depends only on current state
Return: G‑t = Σγᵏr‑t+k+1 — discount factor γ controls short vs long-term thinking
V‑π(s): "how good is this state?" — Q‑π(s,a): "how good is this action in this state?"
Bellman equations: V*(s) = max‑a[R + γΣP·V*(s')] — recursive foundation of every RL algorithm
Algorithm families: Model-Based vs Model-Free; Value-Based vs Policy-Based vs Actor-Critic

7.2

Chapter 7.2

Dynamic Programming — Solving MDPs with a Perfect Model

DP Overview Core

Dynamic Programming (DP) is a family of algorithms that solve MDPs exactly — given perfect knowledge of the environment model: every transition probability P(s'|s,a) and every reward R(s,a). Because real environments rarely hand us this model, DP is not deployed in production RL — but it is the indispensable theoretical bedrock every modern algorithm is built on.

📐

Full Model Required

Knows P(s'|s,a) and R(s,a) for every (s,a,s') tuple — the "planning" setting.

🔁

Iterative Bellman Sweeps

Repeatedly apply the Bellman operator T across all states until V converges to V*.

✅

Guaranteed Convergence

The Bellman operator is a γ-contraction — V_k → V* is mathematically guaranteed.

Bellman Operator (applied each sweep): V_k+1(s) ← (𝒯V_k)(s) Under mild conditions: V_k → V* as k → ∞

Two main algorithms arise from DP: Policy Iteration (evaluate then improve a fixed policy in alternating steps) and Value Iteration (combine both steps into a single max-Bellman update). Both converge; they differ in computational trade-offs.

Policy Evaluation In-depth

Given a fixed policy π, Policy Evaluation computes V^π(s) for every state by repeatedly applying the Bellman expectation equation until the values stop changing (convergence threshold θ).

Policy Evaluation update rule: V_k+1(s) ← Σ_a π(a|s) Σ_s' P(s'|s,a) [R(s,a,s') + γ V_k(s')] Convergence: V_k → V^π as k→∞ (Bellman operator is a contraction mapping)

Policy Evaluation — V values converge after repeated Bellman sweeps

Policy Improvement Core

Once we have V^π for a fixed policy π, the Policy Improvement Theorem guarantees we can construct a strictly better (or equal) policy by acting greedily with respect to V^π.

Greedy policy improvement: π'(s) = argmax_a Σ_s' P(s'|s,a) [R(s,a,s') + γ V^π(s')] Theorem: V^π'(s) ≥ V^π(s) for all s — policy π' is at least as good

📊

Evaluate first

Run policy evaluation to get accurate V^π values for the current policy.

⬆️

Improve greedily

New policy π'(s) takes the action maximising Q^π(s,a) at each state.

🏁

Stop when stable

If π' = π (no action changed), the policy is already optimal: π = π*.

Policy Iteration In-depth

Policy Iteration alternates between Policy Evaluation and Policy Improvement until the policy stabilises. Despite each evaluation requiring many inner iterations, the outer loop converges in very few steps — typically 2–10 even for large MDPs.

Policy Iteration — evaluate current policy then improve greedily

// Policy Iteration
Initialise π(s) = arbitrary policy for all s ∈ S
Initialise V(s) = 0 for all s ∈ S

LOOP (outer — policy improvement):
  // ── Step 1: Policy Evaluation (inner loop) ──
  LOOP:
    Δ = 0
    FOR each state s ∈ S:
      v = V(s)
      V(s) ← Σ_a π(a|s) Σ_s' P(s'|s,a) [R(s,a,s') + γ·V(s')]
      Δ = max(Δ, |v - V(s)|)
    IF Δ < θ: BREAK   // inner convergence

  // ── Step 2: Policy Improvement ──
  policy_stable = TRUE
  FOR each state s ∈ S:
    old_action = π(s)
    π(s) ← argmax_a Σ_s' P(s'|s,a) [R(s,a,s') + γ·V(s')]
    IF old_action ≠ π(s): policy_stable = FALSE

  IF policy_stable: RETURN π, V   // converged to π*, V*

Value Iteration In-depth

Value Iteration eliminates the inner evaluation loop by fusing policy evaluation and improvement into a single Bellman optimality update. Instead of summing over π(a|s), it takes the max over actions — implicitly acting greedily at every step.

Value Iteration update (repeat until convergence): V_k+1(s) ← max_a Σ_s' P(s'|s,a) [R(s,a,s') + γ V_k(s')] After convergence → extract: π*(s) = argmax_a Σ_s' P(s'|s,a) [R(s,a,s') + γ V*(s')]

// Value Iteration
Initialise V(s) = 0 for all s ∈ S
Set θ = 1e-6 (convergence threshold), γ ∈ [0,1)

LOOP:
  Δ = 0
  FOR each state s ∈ S:
    v = V(s)
    // Bellman optimality operator — single sweep, no inner loop
    V(s) ← max_a [ Σ_s' P(s'|s,a) * (R(s,a,s') + γ * V(s')) ]
    Δ = max(Δ, |v - V(s)|)
  IF Δ < θ: BREAK   // values converged

// Extract greedy policy from V*
FOR each state s ∈ S:
  π*(s) ← argmax_a [ Σ_s' P(s'|s,a) * (R(s,a,s') + γ * V(s')) ]

RETURN π*, V*

Value Iteration — values propagate from goal outward until convergence

Limitations of DP Core

DP is mathematically elegant but practically limited. Three hard walls prevent it from scaling to real-world problems — and each wall motivated a different branch of modern RL.

🗺️

Requires Full Model

P(s'|s,a) and R(s,a) must be known exactly. Real environments (games, robots, markets) don't hand you their dynamics.

💥

Curse of Dimensionality

State space explodes exponentially. Atari screen: 160×210×128 colours. Chess: ~10⁴⁴ states. Full sweeps become impossible.

⏱️

Full-State Sweeps

Every iteration updates every state: O(|S|²|A|). Wastes compute on rarely-visited states. Sample-based methods only update visited states.

Property	Dynamic Programming	Model-Free RL (upcoming)
Environment model	Requires P(s'\|s,a) and R	Learns from sampled experience
Computation per step	O(\|S\|²\|A\|) — sweeps all states	O(1) per sample — updates visited states only
Convergence	Guaranteed (exact)	Guaranteed under conditions
Value representation	Table — one entry per state	Neural network (deep RL) — generalises
Applicability	Small, known, discrete MDPs	Unknown, large, continuous environments

Despite its impracticality, DP is the theoretical target all RL algorithms approximate. Monte Carlo methods replace full sweeps with sampled episodes. TD learning (Chapter 7.3) bootstraps from partial trajectories. Deep RL (Chapter 7.5+) replaces the value table with a neural network — enabling Atari, Go, and robotic control at scale.

∑ Chapter 7.2 — Key Takeaways

DP requires the full MDP model (P and R) — not available in real problems but conceptually essential
Policy Evaluation: iterative Bellman expectation updates converge to V^π for a fixed policy
Policy Improvement Theorem: greedy w.r.t. V^π is always at least as good as π
Policy Iteration: alternate evaluation + improvement — guaranteed convergence in very few outer iterations
Value Iteration: single Bellman optimality sweep per iteration — simpler, no inner loop, equally guaranteed
Limitation: curse of dimensionality — model-free methods overcome this by learning from sampled experience

7.3

Chapter 7.3

Monte Carlo & TD Learning — Model-Free Prediction and Control

Monte Carlo Methods Core

Monte Carlo (MC) methods learn directly from complete episodes of experience — no model of P(s'|s,a) needed. The idea is disarmingly simple: visit state s many times, record the actual return G_t each time, and average them. Like estimating the probability of heads by flipping a coin many times and averaging the results.

🎲

No Model Required

Learn directly from sampled episodes — P(s'|s,a) never needed.

⏳

Wait for Episode End

Must complete a full episode before any value update. Episodic tasks only.

📊

Average Actual Returns

V(s) ≈ mean of all G_t observed when s was visited — unbiased estimate.

MC Value Update (incremental form): V(s) ← V(s) + α [G_t − V(s)] G_t = actual return from this episode · α = learning rate · [G_t − V(s)] = error between actual return and current estimate

Two variants: First-visit MC averages returns from only the first visit to s per episode; Every-visit MC averages returns from every visit. Both converge to V^π with enough data.

MC Prediction In-depth

The prediction problem: estimate V^π(s) for a given policy π. Algorithm: run many episodes following π, compute actual returns at each visit to s, average them. Classic example — Blackjack: state is (player sum, dealer card, usable ace?); policy is hit if sum < 20 else stick. After 10,000 episodes a clear value landscape emerges.

MC Prediction — average actual returns from many episodes to estimate V(s)

MC Control Core

Control means finding the optimal policy, not just evaluating a given one. MC Control applies the GPI (Generalised Policy Iteration) loop using MC evaluation — but requires special care around exploration.

🎯

Update Q(s,a)

Estimate Q(s,a) from actual returns — then improve: π(s) = argmax_a Q(s,a).

🔍

Exploration Problem

Must visit all (s,a) pairs. Exploring starts: begin episodes at random (s,a). ε-greedy: with prob ε take random action.

🔄

On vs Off-Policy

On-policy: improve policy being followed. Off-policy: follow behaviour policy b, evaluate target π via importance sampling.

Temporal-Difference (TD) Learning In-depth

TD learning is the central idea of modern RL — it combines the best of Monte Carlo and Dynamic Programming. Like MC: learns from raw experience, no model needed. Like DP: bootstraps — updates V(s) using the current estimate of V(s'), not the full actual return. The payoff: update after every single step, not after the whole episode.

📡

Online Updates

Update after every transition (s, a, r, s'). Works for continuing tasks — no episode boundary needed.

🔗

Bootstrapping

Uses V(s') to update V(s) — the "TD target" rₜ₊₁ + γV(sₜ₊₁) replaces the full return G_t.

⚡

TD Error δₜ

δₜ = rₜ₊₁ + γV(sₜ₊₁) − V(sₜ) — the "surprise". The brain's dopamine signal is a biological TD error.

TD(0) Update: V(s_t) ← V(s_t) + α · δ_t δ_t = r_t+1 + γ V(s_t+1) − V(s_t) ← TD error r_t+1 = observed reward · V(s_t+1) = bootstrapped estimate · α = learning rate

TD(0) Algorithm In-depth

TD(0) is the simplest TD algorithm — 1-step bootstrapping. It updates V(s_t) immediately after observing the next reward and state, using only one step of lookahead.

// TD(0) Policy Evaluation
Initialise V(s) = 0 for all s ∈ S
Set α = 0.1, γ = 0.9

FOR each episode:
  Initialise state s
  LOOP (each step t until terminal):
    a  ← π(a|s)                    // follow policy
    r, s' ← env.step(a)            // observe reward and next state

    δ = r + γ * V(s') - V(s)       // TD error
    V(s) ← V(s) + α * δ            // online update — no need to wait!

    s ← s'
  END LOOP
END FOR

// Worked trace (chain: A → B → C → GOAL, γ=0.9, α=0.1):
// t=0: s=A, r=0, s'=B  → δ=0+0.9×0−0=0    → V(A)=0
// t=1: s=B, r=0, s'=C  → δ=0+0.9×0−0=0    → V(B)=0
// t=2: s=C, r=+1, s'=∅ → δ=1+0.9×0−0=+1   → V(C)=0.1
// (next episode, s=C again):
//   δ = 1+0−0.1 = +0.9  → V(C)=0.1+0.1×0.9=0.19
// After many episodes:  V(C)→1.0, V(B)→0.9, V(A)→0.81

TD vs MC vs DP — online bootstrap vs wait-for-return vs model sweep

TD vs MC vs DP Core

Property	Monte Carlo	TD(0)	Dynamic Programming
Needs model?	No — samples	No — samples	Yes — full P, R
Update timing	End of episode	Every step (online)	Every sweep (all states)
Bootstraps?	No — actual return G_t	Yes — uses V(s') estimate	Yes — from model
Bias / Variance	Zero bias, high variance	Some bias, lower variance	Zero bias, zero variance
Continuing tasks	No — needs terminal	Yes	Yes
Memory per update	Full episode stored	s_t, a_t, r_t+1, s_t+1 only	Full V(s) table
Convergence	To V^π (enough data)	To V^π (under conditions)	To V^π exactly

Eligibility Traces: TD(λ) Core

TD(0) only updates the immediately preceding state; MC updates all states in an episode. TD(λ) interpolates between them via a parameter λ ∈ [0,1]: at λ=0 it is pure TD(0); at λ=1 it is equivalent to MC. The mechanism is the eligibility trace eₜ(s) — a running score of how recently and frequently each state was visited.

Eligibility trace update (for every state s, every step): e_t(s) = γλ · e_t-1(s) + 𝟙[s = s_t] V(s) ← V(s) + α · δ_t · e_t(s) for all s λ=0 → only current state updated (TD(0)) · λ=1 → all states updated (≈ MC)

Eligibility Traces — decay proportional to recency and frequency of state visits

∑ Chapter 7.3 — Key Takeaways

MC: learn from complete episodes — average actual returns — no model, zero bias, high variance
TD: learn online after every step via bootstrapping — combines strengths of MC and DP
TD error δ_t = r_t+1 + γV(s_t+1) − V(s_t) — the "surprise" that drives learning
TD(0): lower variance than MC but biased by bootstrapped V(s') estimate
TD vs MC: TD is more data-efficient and works on continuing tasks — MC needs full episodes
TD(λ): eligibility traces bridge TD(0) and MC — λ=0 is TD(0), λ=1 is equivalent to MC

7.4

Chapter 7.4

Q-Learning & SARSA — Tabular TD Control Algorithms

Q-Learning In-depth

Watkins (1989) introduced Q-Learning — still the most widely taught RL control algorithm. It learns the optimal action-value function Q*(s,a) directly, regardless of what policy is being followed during training. This makes it off-policy: the agent can explore freely, and the update still converges to the optimal values.

🎯

Learns Q*(s,a) Directly

No need to know the model. Converges to the optimal Q-function under mild tabular conditions.

📴

Off-Policy

Update uses max_a' Q(s',a') — the best possible next action, not the one actually taken.

🏆

Optimal Policy

After convergence: π*(s) = argmax_a Q*(s,a). Extract policy for free from Q-table.

Q-Learning Update: Q(s_t,a_t) ← Q(s_t,a_t) + α · [r_t+1 + γ · max_a' Q(s_t+1,a') − Q(s_t,a_t)] TD target: r_t+1 + γ·max_a' Q(s_t+1,a') (greedy next) · TD error δ = target − Q(s_t,a_t) · max uses BEST possible action, not the one actually taken

// Q-Learning Algorithm
Initialise Q(s,a) = 0 for all s ∈ S, a ∈ A
Set α=0.1, γ=0.99, ε=1.0

FOR each episode:
  s = env.reset()
  LOOP until terminal:
    // ε-greedy action selection
    IF random() < ε:  a = random action       // explore
    ELSE:              a = argmax_a Q(s,a)     // exploit

    r, s' = env.step(a)

    // Off-policy update: always uses max over next actions
    best_next = max over a' of Q(s', a')
    Q(s,a) += α * (r + γ * best_next - Q(s,a))

    s = s'

  ε = max(ε_min, ε * ε_decay)    // decay exploration rate

Q-Table & Updates In-depth

In tabular RL the Q-function is stored as a 2-D table: rows are states, columns are actions, cells hold Q(s,a) estimates. Each transition updates one cell. Below: a step-by-step trace through a 4-state chain MDP, and a full Gymnasium implementation.

Q-Table Update — trace through one transition with actual numbers

import gymnasium as gym
import numpy as np

env = gym.make('FrozenLake-v1', is_slippery=False)
n_states  = env.observation_space.n   # 16 states (4×4 grid)
n_actions = env.action_space.n        # 4 actions

Q = np.zeros((n_states, n_actions))

alpha, gamma  = 0.1, 0.99
epsilon       = 1.0
eps_min       = 0.01
eps_decay     = 0.995
n_episodes    = 5000

for ep in range(n_episodes):
    state, _ = env.reset()

    for _ in range(200):
        # ε-greedy action
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])

        next_state, reward, done, truncated, _ = env.step(action)

        # Q-Learning update
        td_target        = reward + gamma * np.max(Q[next_state]) * (not done)
        Q[state, action] += alpha * (td_target - Q[state, action])

        state = next_state
        if done or truncated: break

    epsilon = max(eps_min, epsilon * eps_decay)

print("Learned Q-table:\n", Q.reshape(4, 4, 4).round(2))

ε-Greedy Exploration Core

The exploration-exploitation dilemma is central to all RL: to learn good values you must explore, but to perform well you must exploit. ε-greedy is the simplest solution — take a random action with probability ε, else take the best known action. ε is annealed from 1.0 (pure explore) down to a small floor (mostly exploit).

🎲

ε = 1.0 (start)

Fully random — discover the environment before exploiting any knowledge.

📉

ε decays each episode

ε ← max(ε_min, ε × 0.995) — learning happens in the transition zone.

🏆

ε = 0.01 (end)

Mostly exploit Q*; tiny ε keeps discovering state-action pairs not yet visited.

ε Decay — exploration gives way to exploitation as Q-table fills in

SARSA — On-Policy TD Control In-depth

SARSA (Rummery & Niranjan, 1994) is Q-learning's on-policy cousin. The name comes from the 5-tuple it uses per update: (S_t, A_t, R_t+1, S_t+1, A_t+1). The key difference: the update uses the actual next action taken — not the greedy max.

SARSA Update: Q(s_t,a_t) ← Q(s_t,a_t) + α · [r_t+1 + γ · Q(s_t+1,a_t+1) − Q(s_t,a_t)] a_t+1 = actual next action from ε-greedy ← ON-POLICY · vs Q-Learning: max_a' Q(s_t+1,a') ← OFF-POLICY

// SARSA Algorithm
FOR each episode:
  s = env.reset()
  a = ε_greedy(Q, s)        // select FIRST action

  LOOP until terminal:
    r, s' = env.step(a)
    a' = ε_greedy(Q, s')    // select NEXT action BEFORE updating

    // SARSA: uses actual (s, a, r, s', a') — not max
    Q(s,a) += α * (r + γ * Q(s',a') - Q(s,a))

    s = s'
    a = a'                  // carry forward the chosen action

On-Policy vs Off-Policy In-depth

The on/off-policy distinction is subtle but has a concrete consequence: near danger. Q-learning sees the world through the lens of optimal behaviour and ignores its own exploration noise. SARSA accounts for the fact that it sometimes takes random actions — and learns a more cautious policy as a result.

Property	Q-Learning (Off-Policy)	SARSA (On-Policy)
Learns	Optimal Q*(s,a) directly	Q^π for policy being followed
Next-action in update	max_a' Q(s',a') — best possible	Q(s',a') — actual action taken
Behaviour = Target policy?	No — can use any behaviour policy	Yes — same policy for exploration & learning
Convergence speed	Faster to optimal	Slower — more conservative
Near hazards	Risky — ignores exploration falls	Safer — accounts for ε random steps
Used in	DQN, most deep RL	Safety-critical, tabular control

Cliff Walking — Q-Learning vs SARSA In-depth

The Cliff Walking benchmark (Sutton & Barto) makes the on/off-policy difference vivid. A 4×12 grid: start bottom-left, goal bottom-right. The entire bottom row between them is cliff — R=−100 and agent resets. Normal steps cost R=−1.

Cliff Walk — Q-Learning optimal but risky; SARSA safe but suboptimal

∑ Chapter 7.4 — Key Takeaways

Q-learning: Q(s,a) ← Q(s,a) + α[r + γ·max_a' Q(s',a') − Q(s,a)] — off-policy
SARSA: Q(s,a) ← Q(s,a) + α[r + γ·Q(s',a') − Q(s,a)] — on-policy (actual next action)
ε-greedy: ε=1.0 → pure explore → anneal to ε_min as agent learns the Q-table
Q-learning: finds optimal path; SARSA: finds safe path accounting for ε exploration noise
Both converge — Q-learning to Q*, SARSA to Q^π — for tabular environments with sufficient data
Limitation: tabular — can't handle large/continuous state spaces → solved by DQN (Ch 7.5)

7.5

Chapter 7.5

Deep Q-Networks — Deep Learning Meets Reinforcement Learning

From Tabular to Function Approximation Core

Tabular Q-learning stores one number per (state, action) pair. That works fine for small, discrete environments — but consider Atari Pong: 160×210 pixels, 128 colours per pixel → ~10^568,000 possible states. A Q-table is not just impractical, it is cosmologically impossible. Continuous control (robot joints, autonomous driving) has an infinite state space.

The solution: approximate Q(s,a) ≈ Q(s,a;θ) using a parameterised function (neural network with weights θ). The network generalises — similar inputs produce similar outputs — so states never seen before can still receive reasonable Q-value estimates. A Deep Q-Network (DQN) is a CNN mapping raw pixels directly to Q-values.

DQN was first published in Nature (2015) with a single striking result: one network, one set of hyperparameters, learned to play 49 Atari games from raw pixels better than a professional human game tester on 29 of them — with zero game-specific knowledge. This was the moment deep learning and RL merged into "deep reinforcement learning".

Mnih et al. — "Human-level control through deep reinforcement learning", Nature 2015

♾️

Infinite State Spaces

A network with millions of weights can represent Q-values over an astronomically large state space.

🧠

Generalisation

Similar pixel patterns → similar Q-values. The agent recognises "ball near paddle" across screen positions.

🕹️

Single Architecture

Same CNN, same loss function, same hyperparameters — 49 games, no task-specific tuning.

DQN Architecture In-depth

The DQN input is 4 stacked 84×84 grayscale frames — the stack gives the network a sense of motion (velocity of the ball, direction of movement) that a single frame cannot convey. Three convolutional layers extract spatial features; two FC layers decode those features into one Q-value per action.

DQN Architecture — raw Atari pixels → Q-values for each action

import torch
import torch.nn as nn

class DQN(nn.Module):
    """DQN as in the original Atari paper (Mnih et al. 2015)"""
    def __init__(self, n_actions: int):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(4, 32, kernel_size=8, stride=4), nn.ReLU(),  # 4 stacked frames
            nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.ReLU(),
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 512), nn.ReLU(),
            nn.Linear(512, n_actions)   # one output per action
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch, 4, 84, 84) — normalise pixels to [0, 1]
        return self.fc(self.conv(x / 255.0))

net = DQN(n_actions=6)
x = torch.randint(0, 255, (32, 4, 84, 84), dtype=torch.float32)
q_values     = net(x)                      # (32, 6) — Q-value per action
best_actions = q_values.argmax(dim=1)      # (32,)  — greedy action
print(f"Output shape: {q_values.shape}")   # torch.Size([32, 6])

Experience Replay & Target Networks In-depth

Naively applying gradient descent to TD errors with a neural network diverges. DQN introduces two stabilisation innovations that together tame the instability and make deep RL practical.

🗄️

Problem: Correlation

Sequential transitions (s₀→s₁→s₂…) are highly correlated. Neural nets need i.i.d. samples — not time-series data.

🔀

Replay Buffer

Store 1M transitions. Sample random mini-batches of 32. Each transition reused many times → better data efficiency.

🎯

Target Network

Frozen copy Q(s,a;θ⁻) updated every 1000 steps. Stable TD target = stable training. Without it: chasing a moving target.

Experience Replay + Target Network — the two stabilisation innovations of DQN

DQN Training Loop Core

The full DQN training loop combines all the pieces into a clean 7-step cycle that repeats millions of times until convergence.

① Observe s_t

Stack 4 frames, preprocess to 84×84 grayscale

② Select a_t

ε-greedy from Q-network output

③ Execute a_t

Observe r_t+1, s_t+1 from env

④ Store

Push (s_t, a_t, r_t+1, s_t+1) to replay buffer

⑤ Sample batch

32 random transitions from replay buffer

⑥ Compute loss

TD error using frozen target network θ⁻

⑦ Update θ

Backprop; sync θ⁻ ← θ every 1K steps

DQN Training Curve — characteristic rise from random to human-level

Double DQN Core

Van Hasselt et al. (2016) identified that standard DQN systematically overestimates Q-values. The culprit: max_a' Q(s',a';θ⁻) always selects the highest value — if any Q-value is overestimated (inevitable early in training), it picks that inflated value, and the bias propagates. Fix: decouple action selection from action evaluation.

Standard DQN target (biased): y = r + γ · max_a' Q(s',a';θ⁻) Double DQN target (unbiased): a* = argmax_a Q(s',a; θ) ← online network selects action y = r + γ · Q(s', a*; θ⁻) ← target network evaluates it One network selects, a different network evaluates → overestimation bias cancelled

📈

Why Overestimation Happens

max of noisy Q-values is always ≥ true max. With neural nets, noise is everywhere early in training.

✂️

Decouple Select & Evaluate

Online net (θ) picks the best action. Target net (θ⁻) scores it. Two different noise sources cancel.

🚀

Better Performance

More accurate value estimates → better policy decisions → higher scores on most Atari games.

Dueling Network Architecture Core

Wang et al. (2016) observed that in many states the choice of action barely matters — an empty screen in Pong, a static field in Breakout. Only near the ball does the action have a significant effect. The Dueling DQN architecture exploits this by splitting Q(s,a) into two components: V(s) (how good is this state, regardless of action) and A(s,a) (how much better is this specific action than average).

Q decomposition: Q(s,a) = V(s) + A(s,a) − mean_a A(s,a) Subtract mean for identifiability (otherwise V and A can trade off arbitrarily)

Dueling DQN — separate Value and Advantage streams combined for Q

Rainbow & The DQN Family Reference

Hessel et al. (DeepMind, 2018) asked: what if we combine all the improvements to DQN into one agent? The result — Rainbow — obliterated all prior benchmarks on Atari while using far fewer environment interactions.

Extension	Key Innovation	Benefit	Year
DQN	CNN + replay buffer + target network	Stable deep RL on raw pixels	2015
Double DQN	Decouple selection/evaluation	Reduce Q-value overestimation	2016
Prioritised Replay	Sample high-TD-error transitions more	Better data efficiency	2016
Dueling DQN	V(s) + A(s,a) decomposition	Better state-value learning	2016
Multi-step Returns	n-step bootstrap target	Faster credit assignment	2016
Distributional RL (C51)	Predict full reward distribution (not mean)	Better risk modelling	2017
Noisy DQN	Learnable noise in FC weights replaces ε	State-dependent exploration	2017
Rainbow	All 6 improvements combined	State-of-the-art Atari (3× DQN data efficiency)	2018

∑ Chapter 7.5 — Key Takeaways

DQN: Q-learning with a CNN as function approximator — scales to pixel-level inputs that destroy tabular methods
Experience replay: random mini-batches from a large buffer break temporal correlations and reuse data
Target network: frozen copy updated every C steps stabilises the TD target — without it, training diverges
Double DQN: decouple action selection (online θ) from evaluation (target θ⁻) — eliminates overestimation bias
Dueling DQN: Q(s,a) = V(s) + A(s,a) — separate heads for state value and action advantage, better generalisation
Rainbow: 6 improvements combined — state-of-the-art discrete-action deep RL, 3× more data-efficient than DQN

7.6

Chapter 7.6

Policy Gradient Methods & Actor-Critic — Direct Policy Optimisation

Why Policy Gradients? Core

Q-learning and DQN are powerful — but they require enumerating a Q-value for every possible action. That works for discrete actions (6 Atari buttons) but breaks immediately for continuous action spaces: robot joint torques, steering angles, portfolio weights. How do you take argmax over an infinite set? You can't.

Policy gradient methods sidestep this by directly parameterising the policy π(a|s;θ) as a neural network and performing gradient ascent on expected return J(θ). They also naturally produce stochastic policies — essential in adversarial games where determinism is exploitable (rock-paper-scissors, poker).

Property	Value-Based (DQN)	Policy-Based (Policy Gradient)
What is learned	Q-function, then extract π	Policy π directly
Action space	Discrete only	Continuous and discrete
Policy type	Deterministic (argmax)	Stochastic (probability dist.)
Optimisation	Indirect — Q → π	Direct gradient ascent on J(θ)
Variance	Lower (bootstrapping)	Higher (Monte Carlo returns)
Applicability	Atari, discrete games	Robotics, continuous control, NLP (RLHF)

The Policy Gradient Theorem In-depth

Williams (1992) proved that the gradient of expected return with respect to policy parameters has a remarkably clean form — computable purely from sampled trajectories, without knowledge of the environment model.

Policy Gradient Theorem: ∇_θ J(θ) = E_π[∇_θ log π(a_t|s_t;θ) · G_t] Gradient ascent update (maximise J): θ ← θ + α · ∇_θ log π(a_t|s_t;θ) · G_t G_t > 0 → push up probability of a_t in s_t · G_t < 0 → push down probability · ∇log π computable by backprop

📐

Log-Trick

∇ log π(a|s;θ) = ∇π/π — lets us compute the gradient from samples without differentiating through the environment.

🎯

Reinforce High-Return Actions

Actions that led to high G_t get their probability increased. Low G_t actions get decreased. Simple and elegant.

🔄

Model-Free & Differentiable

Only requires sampled (s,a,G) tuples. Works for any differentiable policy π(a|s;θ) — including neural networks.

REINFORCE Algorithm In-depth

REINFORCE is the simplest instantiation of the policy gradient theorem — Monte Carlo style: collect a full episode, compute discounted returns G_t at every step, then update the policy. Unbiased but high-variance.

REINFORCE — collect full episode, compute returns, update policy

import torch, torch.nn as nn
import gymnasium as gym

class PolicyNetwork(nn.Module):
    def __init__(self, obs_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, 64),      nn.ReLU(),
            nn.Linear(64, action_dim)   # raw logits
        )
    def forward(self, x): return self.net(x)

env      = gym.make('CartPole-v1')
policy   = PolicyNetwork(4, 2)          # 4 obs dims, 2 actions
optim    = torch.optim.Adam(policy.parameters(), lr=3e-4)
gamma    = 0.99

for episode in range(1000):
    state, _   = env.reset()
    log_probs, rewards = [], []

    while True:                                         # collect full episode
        state_t = torch.tensor(state, dtype=torch.float32)
        logits  = policy(state_t)
        dist    = torch.distributions.Categorical(logits=logits)
        action  = dist.sample()                         # stochastic action
        log_probs.append(dist.log_prob(action))         # log π(a|s;θ)
        state, reward, done, trunc, _ = env.step(action.item())
        rewards.append(reward)
        if done or trunc: break

    # Compute discounted returns G (backwards accumulation)
    G, returns = 0, []
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)  # normalise

    # Policy gradient loss — negative for gradient ASCENT
    loss = -sum(lp * g for lp, g in zip(log_probs, returns))
    optim.zero_grad(); loss.backward(); optim.step()

Baseline & Variance Reduction In-depth

REINFORCE works in theory but is slow in practice because G_t estimates from single trajectories have extremely high variance. The fix: subtract a baseline b(s_t) from the return. The baseline doesn't change the expected gradient — it just centres it, dramatically reducing variance.

REINFORCE with baseline: ∇_θ J(θ) = E[∇_θ log π(a_t|s_t;θ) · (G_t − b(s_t))] Best baseline = state value V(s_t) → Advantage: A_t = G_t − V(s_t) ← "was this action better or worse than expected?" A_t > 0: action was better than expected → increase probability · A_t < 0: worse than expected → decrease

Baseline Subtraction — centres return distribution, dramatically reduces variance

Actor-Critic In-depth

Using V(s) as a baseline requires learning V(s). The natural solution: train a separate Critic network alongside the policy (Actor). The Critic computes the TD error δ_t — a single-step advantage estimate — and feeds it back to guide the Actor's gradient updates. This enables online, per-step learning without waiting for an episode to end.

TD error (Critic output → Actor advantage estimate): δ_t = r_t+1 + γ V(s_t+1;w) − V(s_t;w) Actor update (policy gradient with TD advantage): θ ← θ + α_θ · ∇_θ log π(a_t|s_t;θ) · δ_t Critic update (minimise TD error): w ← w − α_w · δ_t · ∇_w V(s_t;w)

Actor-Critic — Actor acts, Critic evaluates, TD error guides Actor updates

A2C and A3C — Parallel Actor-Critic Core

Mnih et al. (DeepMind, 2016) scaled Actor-Critic to deep networks with a key insight: parallelism solves the correlation problem. Instead of a replay buffer, run many independent workers simultaneously — each in its own environment copy. Their diverse experiences are naturally uncorrelated.

⚡

A3C (Async)

Workers push gradients to a global network asynchronously. Fast wall-clock time — no waiting for others.

🔄

A2C (Sync)

All workers step together; global update after each sync. Simpler, deterministic, often matches A3C performance.

🌍

Diversity Benefit

Workers start from different states, explore different regions — naturally uncorrelated gradients without a replay buffer.

A3C — Asynchronous parallel workers updating shared global network

Entropy Regularisation Reference

A policy trained purely to maximise return tends to collapse to deterministic — it finds one good action per state and stops exploring everything else. This is catastrophic in environments where exploration is critical or where the optimal policy is genuinely stochastic.

The fix: add an entropy bonus H(π) to the objective. Entropy measures how spread-out a probability distribution is — maximising it encourages the policy to remain uncertain and keep exploring. The coefficient β controls the exploration-exploitation trade-off. Entropy regularisation is used in A3C, PPO, and is a cornerstone of SAC (Soft Actor-Critic), where maximising entropy is part of the fundamental objective.

Entropy-regularised policy gradient loss: L(θ) = −E[log π · A] − β · H(π(·|s)) H(π) = −Σ_a π(a|s) log π(a|s) ← entropy of policy β = entropy coefficient (tune: large β → more exploration · small β → more exploitation) · Used in: A3C, PPO, SAC

🎲

Uniform policy: H = max

All actions equally likely — maximum randomness. High entropy = diverse exploration.

⚖️

β controls trade-off

Large β: explore broadly. Small β: converge to best policy. β=0: no entropy (pure exploitation).

🤖

SAC (Ch 7.7)

Soft Actor-Critic maximises expected return AND entropy simultaneously — state-of-the-art continuous control.

∑ Chapter 7.6 — Key Takeaways

Policy gradient: directly optimise ∇_θ J(θ) = E[∇log π · G_t] — works for continuous and stochastic actions
REINFORCE: full episode → compute returns → update — unbiased but high variance
Baseline: subtract V(s) from returns → advantage A = G−V reduces variance without adding bias
Actor-Critic: Actor acts, Critic provides TD error δ — online per-step updates, lower variance than REINFORCE
A3C / A2C: parallel workers with shared global network — diverse experience, fast training, no replay buffer
Entropy bonus: prevent policy collapse — encourage exploration throughout training (key in SAC)

7.7

Chapter 7.7

Advanced RL — PPO, SAC & Model-Based RL

Most research RL happens in textbooks. Most production RL runs PPO or SAC. These two algorithms dominate because they solved the core instability problems of earlier methods. Understanding why they work — not just how — is what separates an RL practitioner from someone who just copies hyperparameters from a blog post.

TRPO — The Problem PPO Solves Core

Schulman et al. (OpenAI/Berkeley, 2015) identified the root instability in policy gradient methods: step size. Too large a step and the new policy is so different that the old value estimates are invalid — training collapses catastrophically and cannot recover. Too small and learning is painfully slow. This is fundamentally different from supervised learning where labels are fixed — in RL, a bad update corrupts the very data distribution used for future learning.

TRPO's solution: constrain each update so the new policy stays within a trust region — KL(π_old ‖ π_new) ≤ δ. The surrogate objective is a valid lower bound on true performance inside this region, guaranteeing monotonic improvement. The catch: enforcing this hard constraint requires second-order optimisation (Fisher information matrix, conjugate gradients) at O(|θ|²) cost — infeasible for large networks.

TRPO Objective: maximise L(θ) = E[π(a|s;θ)/π_old(a|s) · A(s,a)] subject to E[KL(π_old(·|s) ‖ π(·|s;θ))] ≤ δ L(θ) = importance-weighted surrogate objective · δ = trust region radius ≈ 0.01 · Guaranteed monotonic improvement within the region

Property	TRPO	PPO (next section)
Constraint type	Hard KL constraint	Soft clip (approximate)
Improvement guarantee	Monotonic — provable	Approximate but reliable
Optimisation order	Second-order (Fisher matrix)	First-order (Adam/SGD)
Compute per update	O(\|θ\|²) — infeasible at scale	O(\|θ\|) — standard backprop
Implementation	Complex (CG solver, line search)	Simple (~50 lines of PyTorch)
Used in practice	Rarely — historical reference	Default algorithm for most tasks

PPO — Proximal Policy Optimisation In-depth

Schulman et al. (OpenAI, 2017) asked: can we get TRPO's stability with standard first-order optimisation? The answer is PPO — the most widely used deep RL algorithm in 2024. It powers ChatGPT/Claude/Gemini alignment (RLHF), OpenAI Five, AlphaStar, Boston Dynamics locomotion, and continuous control across robotics.

PPO replaces TRPO's hard KL constraint with a clipped surrogate objective. Define the probability ratio r_t(θ) = π_new(a_t|s_t) / π_old(a_t|s_t). Instead of constraining this ratio via Lagrange multipliers, simply clip it to [1−ε, 1+ε] and take the minimum of clipped and unclipped — creating a flat region where the gradient is zero, which naturally prevents too-large updates.

PPO Probability Ratio & Clipped Objective: r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t) L^CLIP(θ) = E_t[min(r_t(θ)·Â_t, clip(r_t(θ), 1−ε, 1+ε)·Â_t)] Full PPO loss (optimise jointly): L(θ) = L^CLIP(θ) − c₁·L^VF(θ) + c₂·H[π_θ(·|s_t)] ε=0.2 · c₁=0.5 (value coef) · c₂=0.01 (entropy coef) · K=10 epochs · T=2048 steps · B=64 mini-batch

PPO Clipping — positive and negative advantage cases with trust region

PPO Training Cycle — collect → advantage → batch update → repeat

// PPO Algorithm (pseudocode)
Initialise policy π_θ, value function V_φ
Set ε=0.2, γ=0.99, λ=0.95, K=10, T=2048, B=64, c1=0.5, c2=0.01

LOOP:
  // Phase 1: Collect rollout with current policy
  FOR t = 0 to T-1 (across N envs):
    a_t  ~ π_θ(·|s_t)
    s_t1, r_t1 = env.step(a_t)
    Store (s_t, a_t, r_t1, s_t1, log π_θ(a_t|s_t), V_φ(s_t))

  // Phase 2: Compute GAE advantages
  FOR each step t (backwards):
    delta = r_t1 + γ·V_φ(s_t1) - V_φ(s_t)
    A_hat[t] = delta + (γ·λ)·A_hat[t+1]      // GAE
  A_hat = (A_hat - mean(A_hat)) / std(A_hat)  // normalise

  // Phase 3: K epochs of mini-batch updates
  θ_old ← θ                                    // freeze reference policy
  FOR epoch = 1 to K:
    FOR each mini-batch of B samples:
      r_t  = π_θ(a_t|s_t) / π_old(a_t|s_t)   // probability ratio
      L_CLIP = mean(min(r_t*A_hat, clip(r_t,1-ε,1+ε)*A_hat))
      L_VF   = mean((V_φ(s_t) - V_target)²)
      L_H    = mean(H(π_θ(·|s_t)))
      loss   = -(L_CLIP - c1*L_VF + c2*L_H)
      backprop(loss); clip_grad_norm_(0.5); optimizer.step()

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.callbacks import EvalCallback
import gymnasium as gym

vec_env = make_vec_env("LunarLander-v2", n_envs=8)

model = PPO(
    policy="MlpPolicy", env=vec_env,
    clip_range=0.2,          # ε — trust region boundary
    n_steps=2048,            # T steps per env per rollout
    batch_size=64,           # mini-batch size B
    n_epochs=10,             # K update epochs per rollout
    gamma=0.99,              # discount factor
    gae_lambda=0.95,         # GAE λ
    ent_coef=0.01,           # entropy coefficient c2
    vf_coef=0.5,             # value loss coefficient c1
    max_grad_norm=0.5,       # gradient clipping
    learning_rate=3e-4,
    verbose=1, tensorboard_log="./ppo_tb/"
)

eval_cb = EvalCallback(gym.make("LunarLander-v2"),
    best_model_save_path="./ppo_best/", eval_freq=10_000, n_eval_episodes=10)
model.learn(total_timesteps=1_000_000, callback=eval_cb)

GAE — Generalised Advantage Estimation In-depth

Schulman et al. (2016) formalised a key trade-off: the advantage estimate A_t used in policy gradient updates can be computed at different "depths" — ranging from a pure TD(0) estimate (low variance, high bias) to a full Monte Carlo return (zero bias, high variance). GAE smoothly interpolates between these extremes with parameter λ, and λ=0.95 has proven the sweet spot across nearly every task.

GAE Formula: δ_t = r_t+1 + γV(s_t+1) − V(s_t) ← TD error Â_t^GAE(γ,λ) = Σ_l=0^∞ (γλ)^l δ_t+l = δ_t + (γλ)δ_t+1 + (γλ)²δ_t+2 + … λ=0 → Â_t = δ_t (pure TD) · λ=1 → Â_t = Σγ^lr_t+l+1 − V(s_t) (MC) · λ=0.95 = default PPO

GAE λ — Bias-Variance Tradeoff in Advantage Estimation

SAC — Soft Actor-Critic In-depth

Haarnoja et al. (Berkeley + Google, 2018) introduced SAC — the dominant algorithm for continuous control. While PPO is on-policy (needs fresh data every update), SAC is off-policy — it reuses all past experience from a replay buffer. Its fundamental innovation is maximum entropy RL: the agent is rewarded not just for getting high return, but for maintaining a stochastic (uncertain) policy.

🌡️

Temperature α

Controls entropy-reward balance. SAC auto-tunes α to maintain target entropy H̄ = −|A| — eliminating the only sensitive hyperparameter.

👯

Twin Critics

Two Q-networks Q₁, Q₂ trained independently. Take min(Q₁, Q₂) for targets — prevents Q-value overestimation.

🔄

Off-Policy Replay

1M-transition replay buffer. Sample random mini-batches. Reuses all past experience — 10× more sample efficient than PPO.

SAC Maximum Entropy Objective: J(π) = Σ_t E_π[r(s_t,a_t) + α·H(π(·|s_t))] Critic Loss (for Q₁, Q₂ independently): y = r + γ·(min(Q̄₁(s',ã'), Q̄₂(s',ã')) − α·log π(ã'|s')) ã' ~ π Actor Loss: L^π = E_s[α·log π(a|s) − min(Q₁(s,a), Q₂(s,a))] Temperature Loss (auto-tune): L^α = E[−α·(log π(a|s) + H̄)] H̄ = −|A| (target entropy)

SAC Architecture — Actor + Twin Critics + Auto-tuned Temperature

from stable_baselines3 import SAC
import gymnasium as gym

env = gym.make("HalfCheetah-v4")

model = SAC(
    policy="MlpPolicy", env=env,
    learning_rate=3e-4,
    buffer_size=1_000_000,     # replay buffer capacity
    batch_size=256,            # mini-batch size
    tau=0.005,                 # soft target network update τ
    gamma=0.99,
    ent_coef="auto",           # auto-tune α
    target_entropy="auto",     # H̄ = -|A|
    learning_starts=10_000,    # fill buffer before training
    train_freq=1,              # update every step
    gradient_steps=1,
    verbose=1
)
model.learn(total_timesteps=1_000_000)
model.save("sac_halfcheetah")

TD3 — Twin Delayed DDPG Core

Fujimoto et al. (2018) dissected three specific failure modes of DDPG (Deep Deterministic Policy Gradient) and engineered a targeted fix for each. TD3 is a lean, deterministic off-policy algorithm that is an important baseline for continuous control benchmarks.

TD3 Target (all three fixes combined): ã' = clip(π_θ'(s') + clip(ε, −c, c), a_min, a_max) ← smoothed target action y = r + γ · min(Q̄₁(s',ã'), Q̄₂(s',ã')) ← twin critics Critic update: every step · Actor update: every 2 critic steps (delayed) · ε ~ N(0,σ), σ=0.2

👯

Fix 1: Twin Critics

Q₁ and Q₂ trained independently. Target uses min(Q̄₁, Q̄₂) — eliminates overestimation bias.

⏳

Fix 2: Delayed Actor

Update actor every 2 critic steps. Prevents actor from over-fitting to a noisy, underfit critic.

🔇

Fix 3: Target Smoothing

Add clipped Gaussian noise to target actions. Smooths Q-landscape — reduces high-variance targets.

Property	SAC	TD3
Policy type	Stochastic Gaussian	Deterministic
Objective	Max-entropy (reward + H)	Standard reward only
Temperature	Auto-tuned α	No temperature — tune noise σ
Sample efficiency	Higher (entropy bonus aids exploration)	Good but less robust
Inference	Sample from Gaussian (stochastic)	Forward pass only (deterministic)
Preferred for	Most continuous tasks (default)	Specific benchmarks, ablations

Model-Based RL — Learning a World Model In-depth

Model-free RL needs millions of real environment interactions. Atari DQN requires 50M frames — equivalent to 38 days of continuous play. For real robots or industrial systems, that is prohibitively expensive and dangerous. Model-based RL addresses this by learning a world model M_ψ(s,a) → (s', r) — a differentiable simulator — and generating cheap synthetic experience to train the policy.

🗺️

Dyna-style

Interleave real steps with K=100 model-generated steps. Same policy, much more data per real interaction.

🔮

Imagination

Train policy entirely inside the model (DreamerV3). Real env only used to improve the world model itself.

🌳

Planning

Use model for lookahead search at test time (MCTS in MuZero). Improves decisions without extra training.

Model-Based vs Model-Free — same final quality, 10-20× fewer real interactions

Dyna Architecture Core

Sutton (1991) introduced the Dyna architecture — deceptively simple yet powerful. Every real environment step generates one real transition and K=100 model-generated transitions. The policy receives 101× more gradient updates per real step at negligible extra compute cost (the model is a neural network, not the real environment).

Dyna-Q — 1 real step + K simulated steps per environment interaction

World Models & DreamerV3 In-depth

Ha & Schmidhuber (2018) showed that compressing observations into a compact latent space and predicting the future in that latent space enables policies trained purely in imagination. DreamerV3 (Hafner et al., 2023) is the apex of this line: a single model with one set of hyperparameters that masters Atari, DMControl, Crafter, BSuite, and Minecraft diamond collection (4/7 tasks — the first RL agent to achieve this).

The core is the RSSM (Recurrent State Space Model) — a latent dynamics model with deterministic memory (GRU recurrent path) and stochastic uncertainty (discrete latent variables). The actor-critic trains entirely inside the RSSM by rolling out K=15 imagined steps from the current latent state — never touching the real environment during policy updates.

DreamerV3 — World model from real data; Policy trained entirely in imagination

MuZero — Planning with a Learned Model Reference

Schrittwieser et al. (DeepMind, 2020) built on AlphaZero (which required the full rules of Go/Chess to run MCTS) and asked: what if we learn the rules from scratch? MuZero learns three functions: a Representation network (observation → latent state h), a Dynamics network ((h, a) → next latent h' + predicted reward r), and a Prediction network (h → policy π + value v). MCTS runs entirely in the learned latent space — never invoking the real environment during planning.

The result: state of the art simultaneously on Atari (57 games), Go, Chess, and Shogi — all with one algorithm, no domain knowledge, no hand-coded rules. The dynamics model need not predict realistic pixels or observations — it only needs to produce accurate value and reward estimates for planning purposes.

MuZero Reanalyse (2021) further improves sample efficiency by replaying stored positions with the latest network, re-running MCTS to generate improved training targets. The successor EfficientZero (2021) achieves human-level Atari performance in just 2 hours of real game time — a 20× improvement over MuZero's original sample efficiency.

MuZero Three Networks: h₀ = r(o_t) ← Representation: observation → latent h_k+1, r_k = g(h_k, a_k) ← Dynamics: latent transition + reward π_k, v_k = f(h_k) ← Prediction: policy + value for MCTS MCTS uses g and f inside learned latent space — zero real environment calls during planning

RL Algorithm Landscape Core

RL Algorithm Landscape — sample efficiency vs final performance

Algorithm	Action Space	On/Off-Policy	Key Strength	Use When	Library
PPO	Disc + Cont	On-policy	Stable, simple, universal	Default choice, RLHF, games	SB3, RLlib, TRL
SAC	Continuous	Off-policy	Sample-efficient, robust, auto-α	Robotics, locomotion, control	SB3, RLlib
TD3	Continuous	Off-policy	Deterministic, stable baseline	Benchmark continuous control	SB3
DQN/Rainbow	Discrete	Off-policy	Proven on Atari	Atari-style discrete games	SB3, RLlib
A2C	Disc + Cont	On-policy	Fast, simple	Prototyping, education	SB3
DreamerV3	Disc + Cont	Off-policy (model)	Best sample efficiency	Pixel obs, scarce real data	Official (JAX)
MuZero	Discrete	Planning	Strongest board/video games	Chess, Go, Atari	DeepMind (JAX)
RLHF-PPO	Token (disc)	On-policy	LLM alignment	Align language models	TRL, OpenRLHF

The practical hierarchy for 2024: Start with PPO — it works everywhere and is easy to debug. Switch to SAC for continuous control — it's more sample-efficient. Consider DreamerV3 when real-world interaction is scarce or expensive. Use model-free methods unless you have a specific reason to use model-based.

∑ Chapter 7.7 — Key Takeaways

TRPO: KL-constrained policy update — guaranteed improvement but requires expensive second-order optimisation
PPO: clip ratio r_t to [1−ε, 1+ε] — same stability as TRPO with first-order Adam; industry default
GAE(λ=0.95): exponentially-weighted TD errors — low bias AND low variance advantage estimate
SAC: maximum entropy objective — off-policy, stochastic, auto-tuned α — best continuous control algorithm
TD3: twin critics + delayed actor + target smoothing — three targeted fixes for DDPG's failure modes
DreamerV3: learn world model from real data; train actor-critic entirely in imagination — 10-20× sample efficiency
MuZero: MCTS in learned latent space — no game rules needed; state of the art on board & video games

7.8

Chapter 7.8

RL in the Real World — AlphaGo, RLHF & Robotics

Reinforcement learning is not just theory and Atari games. It beat the world champion at Go, trained every major LLM to follow instructions, controls robots at Google and Boston Dynamics, and optimises data centre cooling. This chapter is where the equations become consequences.

AlphaGo & AlphaZero In-depth

Chess has ~10⁴⁷ positions — hard but solvable for minimax search engines like Stockfish. Go has ~2×10¹⁷⁰ — minimax is computationally infeasible, and classic evaluation functions can't reliably assess Go positions. Go requires intuition more than calculation. That intuition was beyond computers — until March 2016.

🧠

SL Policy Network π_SL

Trained on 30M human game positions (57% accuracy). Biases MCTS toward human-like moves — prior probability over actions.

⚔️

RL Policy Network π_RL

Initialised from π_SL, refined by self-play policy gradient. Wins 80% vs π_SL — discovers moves humans never conceived.

📊

Value Network V(s)

Predicts game outcome from any board position. Replaces expensive rollout simulations in MCTS leaf evaluation.

AlphaGo MCTS — four quantities per node guide the search

AlphaZero Self-Play Loop — tabula rasa mastery in hours

MuZero & MCTS Planning Core

AlphaZero required explicit game rules for MCTS expansion — it couldn't play Atari because the rules are embedded in emulator code. MuZero (DeepMind, 2020) solved this: it learns its own transition model from experience and runs MCTS entirely in the learned latent space, never calling the real environment during planning.

MuZero — replaces hardcoded rules with learned dynamics model

RLHF — Deep Dive In-depth

Ouyang et al. (OpenAI, 2022) scaled RLHF to GPT-3 (175B) and produced InstructGPT — the model that became ChatGPT. Three stages, each with distinct datasets, objectives, and failure modes. Understanding each stage explains why alignment is hard.

📝

Stage 1 — SFT

13K prompts + expert human responses. Standard cross-entropy fine-tuning. Result: consistent instruction following, but no preference signal yet.

⚖️

Stage 2 — Reward Model

33K prompts, 4-9 model responses ranked by 40 contractors. Bradley-Terry training on preference pairs → scalar proxy score.

🤖

Stage 3 — PPO

Policy = SFT init. Episode = one response. Reward = RM score − β·KL(PPO‖SFT). KL prevents catastrophic forgetting of SFT capability.

RLHF Token-level Reward (KL-penalised): r(s_t, a_t) = RM(x, y) · 𝟙[t=T] − β · KL(π_PPO(·|s_t) ‖ π_SFT(·|s_t)) Bradley-Terry Reward Model Loss: L_RM = −E[log σ(RM(prompt, chosen) − RM(prompt, rejected))] β = KL coefficient (0.02-0.05) · RM score given only at final token T · KL per token prevents drift from SFT

RLHF Full Pipeline — data requirements, training cost, and output per stage

Reward Hacking is the central failure mode of RLHF. PPO maximises the reward model score — but the RM is an imperfect proxy for human preferences. The model discovers RM weaknesses: verbosity (RM rewards long responses → pad with filler), sycophancy (RM rewards agreement → tell users what they want to hear), formatting exploitation (RM likes bullet points → overuse bullets everywhere). Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

Reward Hacking — the model discovers RM weaknesses that annoy humans

DPO & GRPO — Beyond RLHF In-depth

RLHF is powerful but complex: three stages, a separate reward model, PPO instability, and massive compute cost. Two newer approaches simplify or improve it significantly.

🎯

DPO (Rafailov, 2023)

Proves the optimal RLHF policy has a closed form. No RM, no RL — directly optimise on preference pairs. Stable, simpler, same performance.

🏆

GRPO (DeepSeek, 2024)

Sample G responses, rank by verifiable reward, group-relative advantage. No value function. Trained DeepSeek-R1 and OpenAI o1/o3-class models.

🔬

Verifiable Rewards

GRPO works best when reward is binary and checkable — math (correct answer?), code (tests pass?). Eliminates reward model bias entirely.

DPO Loss (no reward model needed): L_DPO = −E[log σ(β·log(π_θ(y_w|x)/π_ref(y_w|x)) − β·log(π_θ(y_l|x)/π_ref(y_l|x)))] y_w = preferred · y_l = rejected · π_ref = frozen SFT · β = divergence temperature GRPO Group-Relative Advantage: For prompt x, sample G responses — A_i = (r(y_i) − mean(r)) / std(r) No value function needed — group provides the baseline · Works best with verifiable reward (math/code)

RLHF vs DPO vs GRPO — three LLM alignment approaches

Robotic RL In-depth

Robotics exposes every limitation of RL simultaneously: expensive data, dangerous exploration, partial observability, sim-to-real transfer. Yet recent systems have overcome these barriers at scale — from dexterous hands to general-purpose manipulation.

🎲

OpenAI Rubik's Cube (2019)

Shadow Hand (24 DOF) solves Rubik's cube. Trained entirely in simulation using Automatic Domain Randomisation — progressively harder sim variations.

🤖

Google RT-2 (2023)

Fine-tuned VLM (PaLI-X) to predict robot actions from image + text. "Pick up the apple" → end-effector positions. Generalises to novel objects.

🏭

DeepMind Data Centre (2016/22)

DQN on Google data centre sensors. Reduces cooling energy 30-40% (~$10M/year). Safety layer constrains actions within safe range.

Sim-to-Real via Domain Randomisation — robust policy bridges the reality gap

Real-World RL Challenges In-depth

🎯

Reward Design

A cleaning robot rewarded for "no mess visible" learns to hide mess under the rug. A boat racer learns to spin collecting power-ups instead of racing. Goodhart's Law applies: when a measure becomes a target, it ceases to be a good measure. Solutions: reward shaping, inverse RL, RLHF.

📊

Sample Inefficiency

DQN needs 50M Atari frames = 38 days. Humans learn Pong in minutes. The human-to-RL sample efficiency gap is >100×. Active research: meta-learning, model-based RL, self-supervised auxiliary tasks.

🌊

Partial Observability (POMDPs)

Real environments are almost always partially observable — a robot can't see behind itself, an LLM can't see user intent. Agent must infer hidden state from observation history. Solutions: LSTM/GRU policies, Transformers with context, belief state tracking.

⚡

Non-Stationarity

The environment changes over time — a 2019 trading algorithm fails in 2020. RL recommender system faces drifting user preferences. Solutions: continual learning, online adaptation, periodic retraining, meta-learning for fast adaptation.

🔗

Sparse & Delayed Rewards

Chess: one reward after 40+ moves. Robotic grasping: reward only on success. Credit assignment over long horizons is fundamentally hard. Solutions: reward shaping, hindsight experience replay (HER), curiosity-driven exploration.

🔍

Exploration in High Dimensions

Random exploration is exponentially inefficient with 12+ DOF robots. Almost never discovers useful behaviours by chance. Solutions: intrinsic motivation (count-based, prediction error, empowerment), curriculum learning, population-based training.

Safe RL Core

Standard RL maximises expected cumulative reward — no constraint on how. In simulation, failure is cheap (just reset). In the real world, failure can be catastrophic: a self-driving car exploring randomly could kill someone; a medical AI exploring randomly could harm a patient. Safe RL adds explicit constraints to the optimisation.

Constrained RL (CMDP) Objective: maximise J(π) = E[Σ r(s,a)] subject to C_k(π) ≤ d_k for all k Lagrangian relaxation: L(π, λ) = J(π) − Σ_k λ_k · (C_k(π) − d_k) Primal update: gradient ascent on π · Dual update: λ_k increases when constraint violated · CPO: TRPO-style with hard constraint

⛓️

Constrained RL

CMDP: maximise reward subject to cost constraints. Lagrangian relaxation adapts penalty automatically when constraints are violated.

📦

Offline / Conservative RL

BCQ, CQL, IQL: train on logged data with no env interaction. CQL penalises Q-values for out-of-distribution actions — prevents exploiting gaps.

🛡️

Safety Layer

Black-box filter: map any proposed action to nearest safe action. Used in autonomous driving (safety pilot) and robotic control (joint limits, force limits).

Property	Standard RL	Safe RL
Objective	Maximise reward	Maximise reward subject to constraints
Exploration	Allowed anywhere	Bounded to certified-safe region
Failure treatment	Learning experience — reset and continue	Potentially unacceptable — must avoid
Deployment context	Simulation, games	Physical systems, medical, autonomous vehicles
Reward signal	r(s,a)	r(s,a) − λ·c(s,a) (Lagrangian penalty)

RL Milestones Timeline Core

1957 Bellman Equations — "A Markovian Decision Process", dynamic programming foundation

1988 TD Learning — Sutton, "Learning to Predict by the Methods of Temporal Differences"

1989 Q-Learning — Watkins, tabular model-free optimal control; convergence proof 1992

1992 REINFORCE — Williams, policy gradient theorem, first neural policy gradient

1994 TD-Gammon ★ Tesauro — RL plays backgammon at world champion level

2013 DQN preprint — DeepMind: "Playing Atari with Deep Reinforcement Learning" (arXiv)

2015 DQN Nature ★ Human-level performance on 49 Atari games from raw pixels

2015 TRPO — Schulman et al., trust region policy optimisation, monotonic improvement guarantee

2016 AlphaGo ★ Beats Lee Sedol 4-1 — Move 37 declared "impossible" by professionals

2017 PPO — Schulman et al., proximal policy optimisation, becomes the default algorithm

2017 AlphaZero ★ Tabula rasa Chess/Go/Shogi — human knowledge obsolete after 36 hours

2018 SAC — Haarnoja et al., Soft Actor-Critic, best continuous control algorithm

2019 OpenAI Five & Rubik's Cube ★ Dota 2 world champion team defeated · Dexterous robotic hand via domain randomisation

2020 MuZero ★ Learns game rules from data — Atari + board games SOTA simultaneously

2022 AlphaTensor + InstructGPT/ChatGPT ★ New matrix multiplication algorithms · RLHF aligns LLMs to human preferences at scale

2023 DreamerV3 + DPO — Single algorithm across 7 domains · Direct Preference Optimisation without RL

2024 GRPO / DeepSeek-R1 + o1/o3 ★ RL for verifiable reasoning — new SOTA on AIME, GPQA, coding benchmarks

🎓 Domain 7 Complete — Reinforcement Learning

Ch 7.1 — RL = trial-and-error via rewards. MDP = (S,A,P,R,γ). Bellman equations are the recursive foundation of every RL algorithm.
Ch 7.2 — DP solves MDPs exactly with a full model. Policy iteration: evaluate + improve. Value iteration: single Bellman sweep. Both converge.
Ch 7.3 — MC: average actual returns, no model. TD: online bootstrap after every step. TD(λ) bridges both via eligibility traces.
Ch 7.4 — Q-learning: off-policy max bootstrap → Q*. SARSA: on-policy actual-action → safer conservative policy. Cliff Walking shows the difference.
Ch 7.5 — DQN: Q-learning + CNN + experience replay + target network. Double DQN, Dueling, Rainbow: progressively improve stability and accuracy.
Ch 7.6 — REINFORCE: ∇log π · G — unbiased, high variance. Baseline → advantage. Actor-Critic: online TD δ from Critic guides Actor every step.
Ch 7.7 — PPO: clip ratio to [1−ε, 1+ε] — stable on-policy default. SAC: max-entropy off-policy — best continuous control. DreamerV3: train in imagination.
Ch 7.8 — AlphaZero: self-play + MCTS from scratch. RLHF: SFT→RM→PPO = aligned LLMs. DPO: direct preference without RL. Real-world challenges: reward design, safety, sim-to-real.

Reinforcement learning is the only ML paradigm where learning happens through consequences. Every major breakthrough covered here — AlphaGo's Move 37, ChatGPT's instruction-following, robots that grasp objects — came from agents discovering strategies that humans never explicitly programmed.

The next frontier: RL for reasoning (o1/o3/DeepSeek-R1) and long-horizon autonomous agents (Domain 8). The agent-environment loop from Chapter 7.1 becomes the agentic loop from Domain 8 — but now the environment includes the real world, and the agent is an LLM.

← Domain 06: Computer Vision Domain 08: AI Agents →