AI Foundation ยท Domain 07

Reinforcement Learning

MDPs, Bellman equations, Q-learning, policy gradients, RLHF โ€” learning through reward signals from games to LLM alignment.

7.1
Chapter 7.1
RL Foundations โ€” Agent, Environment & the MDP Framework

Reinforcement learning is the closest thing AI has to how animals actually learn. No labelled examples. No explicit rules. Just an agent making decisions, receiving rewards or penalties, and slowly figuring out what works. It is the framework behind AlphaGo, ChatGPT's alignment, autonomous robots, and game-playing AIs.

Reinforcement learning (RL) is the third major machine learning paradigm โ€” alongside supervised and unsupervised learning. Instead of learning from labelled examples, an RL agent learns by interacting with an environment, receiving a reward signal, and adjusting its behaviour to maximise cumulative reward over time.

The learning signal is delayed and sparse. A chess program only receives +1 (win) or โˆ’1 (loss) at the very end of the game. It must figure out which of the 40โ€“80 moves it played actually caused that outcome โ€” the credit assignment problem, one of the central challenges in RL.

Three key differences from supervised learning: (1) there is no teacher providing correct action labels; (2) actions affect future states, so decisions are not independent; and (3) training data is generated by the agent's own behaviour โ€” a non-stationary, self-influencing distribution.

๐ŸŽฎ

Games & Simulations

  • Atari games (DQN)
  • Go (AlphaGo)
  • StarCraft II (AlphaStar)
  • Dota 2 (OpenAI Five)
  • Chess + RL (Stockfish)
๐Ÿค–

Robotics & Control

  • Robot locomotion
  • Manipulation & grasping
  • Autonomous driving
  • Drone control
  • Industrial automation
๐Ÿง 

Language & Alignment

  • RLHF for ChatGPT/Claude
  • Dialogue systems
  • Recommendation engines
  • Algorithmic trading
  • Drug discovery
Three ML Paradigms โ€” RL learns from sparse, delayed reward signals
SUPERVISED Input Model Loss vs correct label Correct answer at every step UNSUPERVISED Data Model Clusters / Embeddings No labels โ€” find structure REINFORCEMENT Agent Env Reward (sparse!) Learn from consequences

At every discrete time step t, the agent-environment interaction follows a four-step cycle: (1) agent observes state st; (2) selects action at via policy ฯ€; (3) environment transitions to st+1; (4) environment emits reward rt+1. This repeats until a terminal state (episodic) or runs indefinitely (continuing).

The most fundamental tension in RL is exploration vs exploitation. Exploitation means choosing the action believed best right now. Exploration means trying something new to discover potentially better strategies. Every RL algorithm must balance these two competing imperatives.

Agent-Environment Loop โ€” observe โ†’ act โ†’ receive reward โ†’ repeat
AGENT Policy ฯ€(a|s) Chooses actions ENVIRONMENT Transition P(s'|s,a) Returns state + reward Action a‑t State s‑t+1 Reward r‑t+1 Episode: s0,a0 โ–ถ s1,a1 โ–ถ s2,a2 โ–ถ โ€ฆ โ–ถ s‑T (END) Rewards: r1=0   r2=0   r3=0   โ€ฆ   rT=+100

Every RL problem is formalised as a Markov Decision Process โ€” defined by the 5-tuple (S, A, P, R, ฮณ): the state space S, action space A, transition probability P(s'|s,a), reward function R(s,a), and discount factor ฮณ โˆˆ [0,1].

The cornerstone assumption is the Markov Property: the future depends only on the current state, not on the full history. Formally: P(st+1 | st, at, โ€ฆ, s0, a0) = P(st+1 | st, at). The current state encodes all information needed to act optimally.

When the agent cannot observe the full state, the problem becomes a Partially Observable MDP (POMDP) โ€” the agent must maintain a belief state, a probability distribution over possible states, greatly increasing difficulty.

MDP Formal Definition MDP = (S, A, P, R, ฮณ) S = state space  |  A = action space  |  P(s'|s,a) = transition probability R(s,a) = expected immediate reward  |  ฮณ โˆˆ [0,1] = discount factor Markov Property: P(s‑t+1 | s‑t, a‑t, โ€ฆ, s0, a0) = P(s‑t+1 | s‑t, a‑t)
MDP State-Transition Diagram โ€” states, actions, probabilities, rewards
S1 Start S2 S3 S4 โ˜… GOAL P=0.8, R=0 slip P=0.2 P=0.9, R=0 P=0.8, R=+10 Cliff P=0.2, R=โˆ’100 Stochastic transitions โ€” the world is uncertain

The agent's goal is not to maximise the next reward, but the cumulative discounted reward over the entire episode. The return Gt from time step t is the sum of all future rewards, each discounted by ฮณแต where k is how many steps away it lies.

The discount factor ฮณ โˆˆ [0, 1] controls how much the agent values future rewards. With ฮณ = 0 the agent is purely myopic. With ฮณ = 1 all future rewards are equally valued (only valid for finite-horizon tasks). Typical values are ฮณ = 0.95โ€“0.99. The recursive form Gt = rt+1 + ฮณยทGt+1 is the key identity that makes Bellman equations tractable.

Return & Discount G‑t = r‑t+1 + ฮณr‑t+2 + ฮณยฒr‑t+3 + โ€ฆ = ฮฃ‑(k=0โ†’โˆž) ฮณแต r‑(t+k+1) ฮณ=0: myopic (only r‑t+1 matters)  |  ฮณ=1: fully far-sighted  |  ฮณ=0.99: typical production Recursive: G‑t = r‑t+1 + ฮณ ยท G‑t+1 โ† KEY identity for Bellman equations
Discount Factor ฮณ โ€” controls how much the agent values future vs immediate reward
Reward sequence: r1=0, r2=0, r3=0, r4=0, r5=+100 ฮณ = 0 G0 = 0 Blind to future rewards ฮณ = 0.9 G0 = 65.6 r5 discounted: 100 ร— 0.9โด Future discounted appropriately ฮณ = 1.0 G0 = 100 All future rewards = present value

A policy ฯ€ maps states to actions. A deterministic policy ฯ€(s) = a always picks the same action; a stochastic policy ฯ€(a|s) outputs a probability distribution over actions. The goal: find the optimal policy ฯ€* that maximises expected return.

The state value function Vฯ€(s) answers: "How good is state s under policy ฯ€?" The action-value function Qฯ€(s,a) is more granular: "How good is taking action a in state s, then following ฯ€?" Q-values are the direct targets of Q-learning and DQN. The advantage Aฯ€(s,a) = Q โˆ’ V measures how much better action a is compared to the average action from s โ€” used in actor-critic methods to reduce variance.

Value Functions V‑ฯ€(s) = E‑ฯ€[G‑t | S‑t=s] Q‑ฯ€(s,a) = E‑ฯ€[G‑t | S‑t=s, A‑t=a] Relationship: V‑ฯ€(s) = ฮฃ‑a ฯ€(a|s) Q‑ฯ€(s,a) Advantage: A‑ฯ€(s,a) = Q‑ฯ€(s,a) โˆ’ V‑ฯ€(s)
Value Functions on a Grid World โ€” V(s) heat and optimal policy arrows
A 2.1 3.4 4.8 5.9 2.8 4.5 6.2 7.4 3.2 5.6 โœ— โˆ’8 8.5 2.4 4.1 6.8 โ˜… +10 Q-values at state (0,0) โ€” Agent start Action Q(s,a) โ†‘ Up 2.5 โ† Best โ†“ Down 1.8 โ† Left 0.3 โ†’ Right 2.1 ฯ€*(0,0) = argmax Q = โ†‘ Up   |   A=Agent โ˜…=Goal โœ—=Cliff

Richard Bellman's 1957 insight is the mathematical spine of all RL. The Bellman Expectation Equation expresses Vฯ€(s) recursively: the value of state s equals the immediate reward plus the discounted value of the next state, averaged over all possible next states. This converts an infinite-horizon problem into a tractable recursive computation.

The Bellman Optimality Equation does the same for the optimal value function V*: the optimal value is obtained by always choosing the best possible action. Every major RL algorithm โ€” Q-learning, DQN, PPO, SAC โ€” is explicitly or implicitly solving a form of this equation.

The Bellman equation is why RL works at all: it converts predicting infinite-horizon returns into predicting one-step rewards plus bootstrapped future values. Without it, every policy gradient update would require complete Monte Carlo rollouts โ€” far too slow for complex tasks.

Bellman Expectation (policy ฯ€) V‑ฯ€(s) = ฮฃ‑a ฯ€(a|s) ฮฃ‑s' P(s'|s,a) [R(s,a,s') + ฮณV‑ฯ€(s')] Bellman Optimality Equations V*(s) = max‑a ฮฃ‑s' P(s'|s,a) [R(s,a,s') + ฮณV*(s')] Q*(s,a) = ฮฃ‑s' P(s'|s,a) [R(s,a,s') + ฮณ max‑a' Q*(s',a')] Optimal policy: ฯ€*(s) = argmax‑a Q*(s,a)
Bellman Backup โ€” value of s = best action ร— (reward + discounted next-state values)
s V*(s) a1 a2 a1 a2 0.7 0.2 0.1 s'1 s'2 s'3 0.6 0.4 s'4 s'5 V*(s) = max[a1,a2][ R(s,ai) + ฮณ ฮฃ P(s'|s,ai) V*(s') ]

RL algorithms split along two major axes: (1) whether they use an explicit model of transition dynamics, and (2) whether they optimise a value function, directly a policy, or both via an actor-critic hybrid.

RL Algorithm Taxonomy โ€” Model-Based vs Model-Free; Value vs Policy vs Actor-Critic
Reinforcement Learning Model-Based Dyn. Prog. Dyna-Q World Models MuZero Model-Free Value-Based Q-Learning SARSA DQN Rainbow Policy-Based REINFORCE TRPO/PPO Actor-Critic A3C PPO SAC Ch 7.2 Ch 7.3โ€“7.5 Ch 7.6 Ch 7.7
AlgorithmTypeAction SpaceNeeds Model?Sample EfficiencyStabilityChapter
Value IterationDPDiscreteYes (full)N/A (planning)High7.2
Q-LearningTD, off-policyDiscreteNoLowMedium7.4
SARSATD, on-policyDiscreteNoLowHigh7.4
DQNDeep Q, off-policyDiscreteNoMediumMedium7.5
REINFORCEPolicy GradientCont/DiscNoVery LowLow7.6
PPOActor-Critic, on-policyCont/DiscNoMediumHigh7.7
SACActor-Critic, off-policyContinuousNoHighHigh7.7
MuZeroModel-BasedDiscreteYes (learned)HighHigh7.8

โˆ‘ Chapter 7.1 โ€” Key Takeaways

  • RL: agent maximises cumulative reward via trial-and-error interaction โ€” no teacher, no labelled data
  • Credit assignment problem: which of 40โ€“80 chess moves caused the win/loss 50 steps later?
  • MDP: (S, A, P, R, ฮณ) โ€” the formal framework; Markov property: future depends only on current state
  • Return: G‑t = ฮฃฮณแตr‑t+k+1 โ€” discount factor ฮณ controls short vs long-term thinking
  • V‑ฯ€(s): "how good is this state?" โ€” Q‑ฯ€(s,a): "how good is this action in this state?"
  • Bellman equations: V*(s) = max‑a[R + ฮณฮฃPยทV*(s')] โ€” recursive foundation of every RL algorithm
  • Algorithm families: Model-Based vs Model-Free; Value-Based vs Policy-Based vs Actor-Critic
7.2
Chapter 7.2
Dynamic Programming โ€” Solving MDPs with a Perfect Model

Dynamic Programming (DP) is a family of algorithms that solve MDPs exactly โ€” given perfect knowledge of the environment model: every transition probability P(s'|s,a) and every reward R(s,a). Because real environments rarely hand us this model, DP is not deployed in production RL โ€” but it is the indispensable theoretical bedrock every modern algorithm is built on.

๐Ÿ“
Full Model Required

Knows P(s'|s,a) and R(s,a) for every (s,a,s') tuple โ€” the "planning" setting.

๐Ÿ”
Iterative Bellman Sweeps

Repeatedly apply the Bellman operator T across all states until V converges to V*.

โœ…
Guaranteed Convergence

The Bellman operator is a ฮณ-contraction โ€” V_k โ†’ V* is mathematically guaranteed.

Bellman Operator (applied each sweep): Vk+1(s) โ† (๐’ฏVk)(s) Under mild conditions: Vk โ†’ V* as k โ†’ โˆž

Two main algorithms arise from DP: Policy Iteration (evaluate then improve a fixed policy in alternating steps) and Value Iteration (combine both steps into a single max-Bellman update). Both converge; they differ in computational trade-offs.

Given a fixed policy ฯ€, Policy Evaluation computes Vฯ€(s) for every state by repeatedly applying the Bellman expectation equation until the values stop changing (convergence threshold ฮธ).

Policy Evaluation update rule: Vk+1(s) โ† ฮฃa ฯ€(a|s) ฮฃs' P(s'|s,a) [R(s,a,s') + ฮณ Vk(s')] Convergence: Vk โ†’ Vฯ€ as kโ†’โˆž (Bellman operator is a contraction mapping)
Policy Evaluation โ€” V values converge after repeated Bellman sweeps
k = 0 (init) k = 1 (1 sweep) k = โˆž (converged) 0.0 0.0 +1 GOAL 0.0 0.0 0.0 โˆ’1 TRAP 0.0 0.0 0.0 +0.25 +1 GOAL โˆ’0.25 0.0 +0.25 โˆ’1 TRAP โˆ’0.25 0.0 +0.42 +0.73 +1 GOAL 0.0 +0.18 +0.42 โˆ’1 TRAP โˆ’0.42 0.0 ฮณ = 1.0 ยท Equiprobable random policy ฯ€(a|s) = 0.25 ยท Values converge from 0 toward gradient guided by terminal rewards

Once we have Vฯ€ for a fixed policy ฯ€, the Policy Improvement Theorem guarantees we can construct a strictly better (or equal) policy by acting greedily with respect to Vฯ€.

Greedy policy improvement: ฯ€'(s) = argmaxa ฮฃs' P(s'|s,a) [R(s,a,s') + ฮณ Vฯ€(s')] Theorem: Vฯ€'(s) โ‰ฅ Vฯ€(s) for all s โ€” policy ฯ€' is at least as good
๐Ÿ“Š
Evaluate first

Run policy evaluation to get accurate Vฯ€ values for the current policy.

โฌ†๏ธ
Improve greedily

New policy ฯ€'(s) takes the action maximising Qฯ€(s,a) at each state.

๐Ÿ
Stop when stable

If ฯ€' = ฯ€ (no action changed), the policy is already optimal: ฯ€ = ฯ€*.

Policy Iteration alternates between Policy Evaluation and Policy Improvement until the policy stabilises. Despite each evaluation requiring many inner iterations, the outer loop converges in very few steps โ€” typically 2โ€“10 even for large MDPs.

Policy Iteration โ€” evaluate current policy then improve greedily
Policy ฯ€ current estimate evaluate Policy Evaluation V^ฯ€ for current ฯ€ improve Policy Improvement ฯ€' = greedy(V^ฯ€) if ฯ€' โ‰  ฯ€ โ€” repeat (guaranteed improvement each iteration) ฯ€'=ฯ€ โ†’ ฯ€*
// Policy Iteration
Initialise ฯ€(s) = arbitrary policy for all s โˆˆ S
Initialise V(s) = 0 for all s โˆˆ S

LOOP (outer โ€” policy improvement):
  // โ”€โ”€ Step 1: Policy Evaluation (inner loop) โ”€โ”€
  LOOP:
    ฮ” = 0
    FOR each state s โˆˆ S:
      v = V(s)
      V(s) โ† ฮฃ_a ฯ€(a|s) ฮฃ_s' P(s'|s,a) [R(s,a,s') + ฮณยทV(s')]
      ฮ” = max(ฮ”, |v - V(s)|)
    IF ฮ” < ฮธ: BREAK   // inner convergence

  // โ”€โ”€ Step 2: Policy Improvement โ”€โ”€
  policy_stable = TRUE
  FOR each state s โˆˆ S:
    old_action = ฯ€(s)
    ฯ€(s) โ† argmax_a ฮฃ_s' P(s'|s,a) [R(s,a,s') + ฮณยทV(s')]
    IF old_action โ‰  ฯ€(s): policy_stable = FALSE

  IF policy_stable: RETURN ฯ€, V   // converged to ฯ€*, V*

Value Iteration eliminates the inner evaluation loop by fusing policy evaluation and improvement into a single Bellman optimality update. Instead of summing over ฯ€(a|s), it takes the max over actions โ€” implicitly acting greedily at every step.

Value Iteration update (repeat until convergence): Vk+1(s) โ† maxa ฮฃs' P(s'|s,a) [R(s,a,s') + ฮณ Vk(s')] After convergence โ†’ extract: ฯ€*(s) = argmaxa ฮฃs' P(s'|s,a) [R(s,a,s') + ฮณ V*(s')]
// Value Iteration
Initialise V(s) = 0 for all s โˆˆ S
Set ฮธ = 1e-6 (convergence threshold), ฮณ โˆˆ [0,1)

LOOP:
  ฮ” = 0
  FOR each state s โˆˆ S:
    v = V(s)
    // Bellman optimality operator โ€” single sweep, no inner loop
    V(s) โ† max_a [ ฮฃ_s' P(s'|s,a) * (R(s,a,s') + ฮณ * V(s')) ]
    ฮ” = max(ฮ”, |v - V(s)|)
  IF ฮ” < ฮธ: BREAK   // values converged

// Extract greedy policy from V*
FOR each state s โˆˆ S:
  ฯ€*(s) โ† argmax_a [ ฮฃ_s' P(s'|s,a) * (R(s,a,s') + ฮณ * V(s')) ]

RETURN ฯ€*, V*
Value Iteration โ€” values propagate from goal outward until convergence
k = 0 k = 1 k = 5 k = โˆž (ฯ€*) +1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.9 +1 0.9 0 0 0 0 0 0 0 0 0 0 0 0 0 0.59 0.73 0.90 +1 0.41 0.59 0.73 0.90 0.26 0.41 0.59 0.73 0.15 0.26 0.41 0.59 0.73 0.81 0.90 +1 0.66 0.73 0.81 0.90 0.59 0.66 0.73 0.81 0.53 0.59 0.66 0.73 โ†’ โ†’ โ†’ โ†‘ โ†’ โ†‘ โ†‘ โ†‘ โ†‘ โ†‘ โ†‘ โ†‘ โ†‘ โ†’ โ†‘ Iteration k max |ฮ”V| ฮธ 0 5 10 20 2.0 0 ฮณ = 0.9 ยท Goal at top-right ยท Values propagate outward each sweep ยท Gold arrows = ฯ€* (greedy w.r.t. V*)

DP is mathematically elegant but practically limited. Three hard walls prevent it from scaling to real-world problems โ€” and each wall motivated a different branch of modern RL.

๐Ÿ—บ๏ธ
Requires Full Model

P(s'|s,a) and R(s,a) must be known exactly. Real environments (games, robots, markets) don't hand you their dynamics.

๐Ÿ’ฅ
Curse of Dimensionality

State space explodes exponentially. Atari screen: 160ร—210ร—128 colours. Chess: ~1044 states. Full sweeps become impossible.

โฑ๏ธ
Full-State Sweeps

Every iteration updates every state: O(|S|ยฒ|A|). Wastes compute on rarely-visited states. Sample-based methods only update visited states.

Property Dynamic Programming Model-Free RL (upcoming)
Environment model Requires P(s'|s,a) and R Learns from sampled experience
Computation per step O(|S|ยฒ|A|) โ€” sweeps all states O(1) per sample โ€” updates visited states only
Convergence Guaranteed (exact) Guaranteed under conditions
Value representation Table โ€” one entry per state Neural network (deep RL) โ€” generalises
Applicability Small, known, discrete MDPs Unknown, large, continuous environments

Despite its impracticality, DP is the theoretical target all RL algorithms approximate. Monte Carlo methods replace full sweeps with sampled episodes. TD learning (Chapter 7.3) bootstraps from partial trajectories. Deep RL (Chapter 7.5+) replaces the value table with a neural network โ€” enabling Atari, Go, and robotic control at scale.

โˆ‘ Chapter 7.2 โ€” Key Takeaways

  • DP requires the full MDP model (P and R) โ€” not available in real problems but conceptually essential
  • Policy Evaluation: iterative Bellman expectation updates converge to Vฯ€ for a fixed policy
  • Policy Improvement Theorem: greedy w.r.t. Vฯ€ is always at least as good as ฯ€
  • Policy Iteration: alternate evaluation + improvement โ€” guaranteed convergence in very few outer iterations
  • Value Iteration: single Bellman optimality sweep per iteration โ€” simpler, no inner loop, equally guaranteed
  • Limitation: curse of dimensionality โ€” model-free methods overcome this by learning from sampled experience
7.3
Chapter 7.3
Monte Carlo & TD Learning โ€” Model-Free Prediction and Control

Monte Carlo (MC) methods learn directly from complete episodes of experience โ€” no model of P(s'|s,a) needed. The idea is disarmingly simple: visit state s many times, record the actual return Gt each time, and average them. Like estimating the probability of heads by flipping a coin many times and averaging the results.

๐ŸŽฒ
No Model Required

Learn directly from sampled episodes โ€” P(s'|s,a) never needed.

โณ
Wait for Episode End

Must complete a full episode before any value update. Episodic tasks only.

๐Ÿ“Š
Average Actual Returns

V(s) โ‰ˆ mean of all Gt observed when s was visited โ€” unbiased estimate.

MC Value Update (incremental form): V(s) โ† V(s) + ฮฑ [Gt โˆ’ V(s)] Gt = actual return from this episode ยท ฮฑ = learning rate ยท [Gt โˆ’ V(s)] = error between actual return and current estimate

Two variants: First-visit MC averages returns from only the first visit to s per episode; Every-visit MC averages returns from every visit. Both converge to Vฯ€ with enough data.

The prediction problem: estimate Vฯ€(s) for a given policy ฯ€. Algorithm: run many episodes following ฯ€, compute actual returns at each visit to s, average them. Classic example โ€” Blackjack: state is (player sum, dealer card, usable ace?); policy is hit if sum < 20 else stick. After 10,000 episodes a clear value landscape emerges.

MC Prediction โ€” average actual returns from many episodes to estimate V(s)
Episode Trajectories through State s Ep 1 s +8 Ep 2 s โˆ’2 Ep 3 s +6 Ep 4 s +4 = state s visited = terminal (goal) = terminal (fail) Running Average V(s) Episode V(s) 8 4 0 โˆ’2 1 2 3 4 V*โ‰ˆ4 8.0 3.0 4.0 4.0 V(s) converges to true V^ฯ€ โ‰ˆ 4.0 as more episodes are averaged

Control means finding the optimal policy, not just evaluating a given one. MC Control applies the GPI (Generalised Policy Iteration) loop using MC evaluation โ€” but requires special care around exploration.

๐ŸŽฏ
Update Q(s,a)

Estimate Q(s,a) from actual returns โ€” then improve: ฯ€(s) = argmaxa Q(s,a).

๐Ÿ”
Exploration Problem

Must visit all (s,a) pairs. Exploring starts: begin episodes at random (s,a). ฮต-greedy: with prob ฮต take random action.

๐Ÿ”„
On vs Off-Policy

On-policy: improve policy being followed. Off-policy: follow behaviour policy b, evaluate target ฯ€ via importance sampling.

TD learning is the central idea of modern RL โ€” it combines the best of Monte Carlo and Dynamic Programming. Like MC: learns from raw experience, no model needed. Like DP: bootstraps โ€” updates V(s) using the current estimate of V(s'), not the full actual return. The payoff: update after every single step, not after the whole episode.

๐Ÿ“ก
Online Updates

Update after every transition (s, a, r, s'). Works for continuing tasks โ€” no episode boundary needed.

๐Ÿ”—
Bootstrapping

Uses V(s') to update V(s) โ€” the "TD target" rโ‚œโ‚Šโ‚ + ฮณV(sโ‚œโ‚Šโ‚) replaces the full return Gt.

โšก
TD Error ฮดโ‚œ

ฮดโ‚œ = rโ‚œโ‚Šโ‚ + ฮณV(sโ‚œโ‚Šโ‚) โˆ’ V(sโ‚œ) โ€” the "surprise". The brain's dopamine signal is a biological TD error.

TD(0) Update: V(st) โ† V(st) + ฮฑ ยท ฮดt ฮดt = rt+1 + ฮณ V(st+1) โˆ’ V(st) โ† TD error rt+1 = observed reward ยท V(st+1) = bootstrapped estimate ยท ฮฑ = learning rate

TD(0) is the simplest TD algorithm โ€” 1-step bootstrapping. It updates V(st) immediately after observing the next reward and state, using only one step of lookahead.

// TD(0) Policy Evaluation
Initialise V(s) = 0 for all s โˆˆ S
Set ฮฑ = 0.1, ฮณ = 0.9

FOR each episode:
  Initialise state s
  LOOP (each step t until terminal):
    a  โ† ฯ€(a|s)                    // follow policy
    r, s' โ† env.step(a)            // observe reward and next state

    ฮด = r + ฮณ * V(s') - V(s)       // TD error
    V(s) โ† V(s) + ฮฑ * ฮด            // online update โ€” no need to wait!

    s โ† s'
  END LOOP
END FOR

// Worked trace (chain: A โ†’ B โ†’ C โ†’ GOAL, ฮณ=0.9, ฮฑ=0.1):
// t=0: s=A, r=0, s'=B  โ†’ ฮด=0+0.9ร—0โˆ’0=0    โ†’ V(A)=0
// t=1: s=B, r=0, s'=C  โ†’ ฮด=0+0.9ร—0โˆ’0=0    โ†’ V(B)=0
// t=2: s=C, r=+1, s'=โˆ… โ†’ ฮด=1+0.9ร—0โˆ’0=+1   โ†’ V(C)=0.1
// (next episode, s=C again):
//   ฮด = 1+0โˆ’0.1 = +0.9  โ†’ V(C)=0.1+0.1ร—0.9=0.19
// After many episodes:  V(C)โ†’1.0, V(B)โ†’0.9, V(A)โ†’0.81
TD vs MC vs DP โ€” online bootstrap vs wait-for-return vs model sweep
Episode: s0 โ†’(r=0)โ†’ s1 โ†’(r=0)โ†’ s2 โ†’(r=+10)โ†’ TERMINAL MC s0 r=0 s1 r=0 s2 r=+10 END โณ Wait for episode end โ†’ G0=ฮณยฒร—10โ‰ˆ8.1 โ†’ update all states TD(0) s0 s1 s2 END โ†‘ update V(s0) โ†‘ update V(s1) โ†‘ update V(s2) โšก Online โ€” update each step DP Updates ALL states via full Bellman sweep โ€” requires model P(s'|s,a) ๐Ÿ—บ๏ธ Full model sweep ยท O(|S|ยฒ|A|) per iteration ยท Never visits env TD is the sweet spot: model-free like MC + online like DP
Property Monte Carlo TD(0) Dynamic Programming
Needs model? No โ€” samples No โ€” samples Yes โ€” full P, R
Update timing End of episode Every step (online) Every sweep (all states)
Bootstraps? No โ€” actual return Gt Yes โ€” uses V(s') estimate Yes โ€” from model
Bias / Variance Zero bias, high variance Some bias, lower variance Zero bias, zero variance
Continuing tasks No โ€” needs terminal Yes Yes
Memory per update Full episode stored st, at, rt+1, st+1 only Full V(s) table
Convergence To Vฯ€ (enough data) To Vฯ€ (under conditions) To Vฯ€ exactly

TD(0) only updates the immediately preceding state; MC updates all states in an episode. TD(ฮป) interpolates between them via a parameter ฮป โˆˆ [0,1]: at ฮป=0 it is pure TD(0); at ฮป=1 it is equivalent to MC. The mechanism is the eligibility trace eโ‚œ(s) โ€” a running score of how recently and frequently each state was visited.

Eligibility trace update (for every state s, every step): et(s) = ฮณฮป ยท et-1(s) + ๐Ÿ™[s = st] V(s) โ† V(s) + ฮฑ ยท ฮดt ยท et(s) for all s ฮป=0 โ†’ only current state updated (TD(0)) ยท ฮป=1 โ†’ all states updated (โ‰ˆ MC)
Eligibility Traces โ€” decay proportional to recency and frequency of state visits
Time step t Eligibility e(s) 1.0 0.5 0 2 5 7 12 17 visit visit visit e(A) visit e(B) reward At reward: ฮ”V(A) = ฮฑยทฮดยทe(A) (large โ€” recently & frequently visited) ยท ฮ”V(B) = ฮฑยทฮดยทe(B) (small โ€” faded)

โˆ‘ Chapter 7.3 โ€” Key Takeaways

  • MC: learn from complete episodes โ€” average actual returns โ€” no model, zero bias, high variance
  • TD: learn online after every step via bootstrapping โ€” combines strengths of MC and DP
  • TD error ฮดt = rt+1 + ฮณV(st+1) โˆ’ V(st) โ€” the "surprise" that drives learning
  • TD(0): lower variance than MC but biased by bootstrapped V(s') estimate
  • TD vs MC: TD is more data-efficient and works on continuing tasks โ€” MC needs full episodes
  • TD(ฮป): eligibility traces bridge TD(0) and MC โ€” ฮป=0 is TD(0), ฮป=1 is equivalent to MC
7.4
Chapter 7.4
Q-Learning & SARSA โ€” Tabular TD Control Algorithms

Watkins (1989) introduced Q-Learning โ€” still the most widely taught RL control algorithm. It learns the optimal action-value function Q*(s,a) directly, regardless of what policy is being followed during training. This makes it off-policy: the agent can explore freely, and the update still converges to the optimal values.

๐ŸŽฏ
Learns Q*(s,a) Directly

No need to know the model. Converges to the optimal Q-function under mild tabular conditions.

๐Ÿ“ด
Off-Policy

Update uses maxa' Q(s',a') โ€” the best possible next action, not the one actually taken.

๐Ÿ†
Optimal Policy

After convergence: ฯ€*(s) = argmaxa Q*(s,a). Extract policy for free from Q-table.

Q-Learning Update: Q(st,at) โ† Q(st,at) + ฮฑ ยท [rt+1 + ฮณ ยท maxa' Q(st+1,a') โˆ’ Q(st,at)] TD target: rt+1 + ฮณยทmaxa' Q(st+1,a') (greedy next) ยท TD error ฮด = target โˆ’ Q(st,at) ยท max uses BEST possible action, not the one actually taken
// Q-Learning Algorithm
Initialise Q(s,a) = 0 for all s โˆˆ S, a โˆˆ A
Set ฮฑ=0.1, ฮณ=0.99, ฮต=1.0

FOR each episode:
  s = env.reset()
  LOOP until terminal:
    // ฮต-greedy action selection
    IF random() < ฮต:  a = random action       // explore
    ELSE:              a = argmax_a Q(s,a)     // exploit

    r, s' = env.step(a)

    // Off-policy update: always uses max over next actions
    best_next = max over a' of Q(s', a')
    Q(s,a) += ฮฑ * (r + ฮณ * best_next - Q(s,a))

    s = s'

  ฮต = max(ฮต_min, ฮต * ฮต_decay)    // decay exploration rate

In tabular RL the Q-function is stored as a 2-D table: rows are states, columns are actions, cells hold Q(s,a) estimates. Each transition updates one cell. Below: a step-by-step trace through a 4-state chain MDP, and a full Gymnasium implementation.

Q-Table Update โ€” trace through one transition with actual numbers
4-State Chain MDP S0 S1 S2 โ† HERE S3 +10 Right Left Right Left Right (+10) Q-Table State Left Right S0 0.0 0.0 S1 0.0 2.7 S2 โ† 0.0 0.0โ†’1.0 S3 (terminal) โ€” โ€” Transition: s=S2, a=Right, r=+10, s'=S3 max Q(S3,ยท) = max(โ€”, โ€”) = 0 TD target = 10 + 0.9 ร— 0 = 10.0 TD error = 10.0 โˆ’ 0.0 = 10.0 Q(S2,Right) โ† 0 + 0.1ร—10 = 1.0 โœ“
import gymnasium as gym
import numpy as np

env = gym.make('FrozenLake-v1', is_slippery=False)
n_states  = env.observation_space.n   # 16 states (4ร—4 grid)
n_actions = env.action_space.n        # 4 actions

Q = np.zeros((n_states, n_actions))

alpha, gamma  = 0.1, 0.99
epsilon       = 1.0
eps_min       = 0.01
eps_decay     = 0.995
n_episodes    = 5000

for ep in range(n_episodes):
    state, _ = env.reset()

    for _ in range(200):
        # ฮต-greedy action
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])

        next_state, reward, done, truncated, _ = env.step(action)

        # Q-Learning update
        td_target        = reward + gamma * np.max(Q[next_state]) * (not done)
        Q[state, action] += alpha * (td_target - Q[state, action])

        state = next_state
        if done or truncated: break

    epsilon = max(eps_min, epsilon * eps_decay)

print("Learned Q-table:\n", Q.reshape(4, 4, 4).round(2))

The exploration-exploitation dilemma is central to all RL: to learn good values you must explore, but to perform well you must exploit. ฮต-greedy is the simplest solution โ€” take a random action with probability ฮต, else take the best known action. ฮต is annealed from 1.0 (pure explore) down to a small floor (mostly exploit).

๐ŸŽฒ
ฮต = 1.0 (start)

Fully random โ€” discover the environment before exploiting any knowledge.

๐Ÿ“‰
ฮต decays each episode

ฮต โ† max(ฮต_min, ฮต ร— 0.995) โ€” learning happens in the transition zone.

๐Ÿ†
ฮต = 0.01 (end)

Mostly exploit Q*; tiny ฮต keeps discovering state-action pairs not yet visited.

ฮต Decay โ€” exploration gives way to exploitation as Q-table fills in
Episode 0 500 1000 1500 2000 1.0 0.5 0.0 Exploration Transition Exploitation random ฮต (exploration rate) Avg reward Learning peaks in transition zone โ€” agent knows enough to exploit but still explores

SARSA (Rummery & Niranjan, 1994) is Q-learning's on-policy cousin. The name comes from the 5-tuple it uses per update: (St, At, Rt+1, St+1, At+1). The key difference: the update uses the actual next action taken โ€” not the greedy max.

SARSA Update: Q(st,at) โ† Q(st,at) + ฮฑ ยท [rt+1 + ฮณ ยท Q(st+1,at+1) โˆ’ Q(st,at)] at+1 = actual next action from ฮต-greedy โ† ON-POLICY ยท vs Q-Learning: maxa' Q(st+1,a') โ† OFF-POLICY
// SARSA Algorithm
FOR each episode:
  s = env.reset()
  a = ฮต_greedy(Q, s)        // select FIRST action

  LOOP until terminal:
    r, s' = env.step(a)
    a' = ฮต_greedy(Q, s')    // select NEXT action BEFORE updating

    // SARSA: uses actual (s, a, r, s', a') โ€” not max
    Q(s,a) += ฮฑ * (r + ฮณ * Q(s',a') - Q(s,a))

    s = s'
    a = a'                  // carry forward the chosen action

The on/off-policy distinction is subtle but has a concrete consequence: near danger. Q-learning sees the world through the lens of optimal behaviour and ignores its own exploration noise. SARSA accounts for the fact that it sometimes takes random actions โ€” and learns a more cautious policy as a result.

Property Q-Learning (Off-Policy) SARSA (On-Policy)
Learns Optimal Q*(s,a) directly Qฯ€ for policy being followed
Next-action in update maxa' Q(s',a') โ€” best possible Q(s',a') โ€” actual action taken
Behaviour = Target policy? No โ€” can use any behaviour policy Yes โ€” same policy for exploration & learning
Convergence speed Faster to optimal Slower โ€” more conservative
Near hazards Risky โ€” ignores exploration falls Safer โ€” accounts for ฮต random steps
Used in DQN, most deep RL Safety-critical, tabular control

The Cliff Walking benchmark (Sutton & Barto) makes the on/off-policy difference vivid. A 4ร—12 grid: start bottom-left, goal bottom-right. The entire bottom row between them is cliff โ€” R=โˆ’100 and agent resets. Normal steps cost R=โˆ’1.

Cliff Walk โ€” Q-Learning optimal but risky; SARSA safe but suboptimal
Q-Learning Path (Optimal) SARSA Path (Safe) S CLIFF (โˆ’100) G โ†’ optimal (risky) S CLIFF (โˆ’100) G โ†’ safe path (avoids cliff) Training Performance Episodes (0โ€“500) โˆ’17 โˆ’13 Q-L (โˆ’13) SARSA (โˆ’17) SARSA: safer training โ€” rarely falls off cliff ยท Q-Learning: volatile but finds shorter path

โˆ‘ Chapter 7.4 โ€” Key Takeaways

  • Q-learning: Q(s,a) โ† Q(s,a) + ฮฑ[r + ฮณยทmaxa' Q(s',a') โˆ’ Q(s,a)] โ€” off-policy
  • SARSA: Q(s,a) โ† Q(s,a) + ฮฑ[r + ฮณยทQ(s',a') โˆ’ Q(s,a)] โ€” on-policy (actual next action)
  • ฮต-greedy: ฮต=1.0 โ†’ pure explore โ†’ anneal to ฮต_min as agent learns the Q-table
  • Q-learning: finds optimal path; SARSA: finds safe path accounting for ฮต exploration noise
  • Both converge โ€” Q-learning to Q*, SARSA to Qฯ€ โ€” for tabular environments with sufficient data
  • Limitation: tabular โ€” can't handle large/continuous state spaces โ†’ solved by DQN (Ch 7.5)
7.5
Chapter 7.5
Deep Q-Networks โ€” Deep Learning Meets Reinforcement Learning

Tabular Q-learning stores one number per (state, action) pair. That works fine for small, discrete environments โ€” but consider Atari Pong: 160ร—210 pixels, 128 colours per pixel โ†’ ~10568,000 possible states. A Q-table is not just impractical, it is cosmologically impossible. Continuous control (robot joints, autonomous driving) has an infinite state space.

The solution: approximate Q(s,a) โ‰ˆ Q(s,a;ฮธ) using a parameterised function (neural network with weights ฮธ). The network generalises โ€” similar inputs produce similar outputs โ€” so states never seen before can still receive reasonable Q-value estimates. A Deep Q-Network (DQN) is a CNN mapping raw pixels directly to Q-values.

DQN was first published in Nature (2015) with a single striking result: one network, one set of hyperparameters, learned to play 49 Atari games from raw pixels better than a professional human game tester on 29 of them โ€” with zero game-specific knowledge. This was the moment deep learning and RL merged into "deep reinforcement learning".

Mnih et al. โ€” "Human-level control through deep reinforcement learning", Nature 2015
โ™พ๏ธ
Infinite State Spaces

A network with millions of weights can represent Q-values over an astronomically large state space.

๐Ÿง 
Generalisation

Similar pixel patterns โ†’ similar Q-values. The agent recognises "ball near paddle" across screen positions.

๐Ÿ•น๏ธ
Single Architecture

Same CNN, same loss function, same hyperparameters โ€” 49 games, no task-specific tuning.

The DQN input is 4 stacked 84ร—84 grayscale frames โ€” the stack gives the network a sense of motion (velocity of the ball, direction of movement) that a single frame cannot convey. Three convolutional layers extract spatial features; two FC layers decode those features into one Q-value per action.

DQN Architecture โ€” raw Atari pixels โ†’ Q-values for each action
(4,84,84) 4 frames Conv1 32ร—20ร—20 8ร—8, s=4 Conv2 64ร—9ร—9 4ร—4, s=2 Conv3 64ร—7ร—7 3ร—3, s=1 Flatten 3136 FC1 512 ReLU FC2 |A|=6 Q-values per action L R U D F NF 4.2 7.8 3.1 5.4 6.1 4.7 argmax RIGHT Input: 4 stacked grayscale frames capture motion ยท Conv layers: no pooling (preserves position) ยท Output: one Q-value per available action
import torch
import torch.nn as nn

class DQN(nn.Module):
    """DQN as in the original Atari paper (Mnih et al. 2015)"""
    def __init__(self, n_actions: int):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(4, 32, kernel_size=8, stride=4), nn.ReLU(),  # 4 stacked frames
            nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.ReLU(),
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 512), nn.ReLU(),
            nn.Linear(512, n_actions)   # one output per action
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch, 4, 84, 84) โ€” normalise pixels to [0, 1]
        return self.fc(self.conv(x / 255.0))

net = DQN(n_actions=6)
x = torch.randint(0, 255, (32, 4, 84, 84), dtype=torch.float32)
q_values     = net(x)                      # (32, 6) โ€” Q-value per action
best_actions = q_values.argmax(dim=1)      # (32,)  โ€” greedy action
print(f"Output shape: {q_values.shape}")   # torch.Size([32, 6])

Naively applying gradient descent to TD errors with a neural network diverges. DQN introduces two stabilisation innovations that together tame the instability and make deep RL practical.

๐Ÿ—„๏ธ
Problem: Correlation

Sequential transitions (sโ‚€โ†’sโ‚โ†’sโ‚‚โ€ฆ) are highly correlated. Neural nets need i.i.d. samples โ€” not time-series data.

๐Ÿ”€
Replay Buffer

Store 1M transitions. Sample random mini-batches of 32. Each transition reused many times โ†’ better data efficiency.

๐ŸŽฏ
Target Network

Frozen copy Q(s,a;ฮธโป) updated every 1000 steps. Stable TD target = stable training. Without it: chasing a moving target.

Experience Replay + Target Network โ€” the two stabilisation innovations of DQN
Experience Replay Agent/Env (s,a,r,s') Replay Buffer N = 1M transitions oldest overwrite sample B=32 Online Net Q(s,a;ฮธ) updated every step Random sample breaks temporal correlations Target Network Online Q(ฮธ) updated every step โ† gradient descent Target Q(ฮธโป) frozen weights synced every 1000 steps copy ฮธโ†’ฮธโป every C=1000 Loss = ||r + ฮณยทmax Q(s',a';ฮธโป) โˆ’ Q(s,a;ฮธ)||ยฒ โ† stable target

The full DQN training loop combines all the pieces into a clean 7-step cycle that repeats millions of times until convergence.

โ‘  Observe st

Stack 4 frames, preprocess to 84ร—84 grayscale

โ‘ก Select at

ฮต-greedy from Q-network output

โ‘ข Execute at

Observe rt+1, st+1 from env

โ‘ฃ Store

Push (st, at, rt+1, st+1) to replay buffer

โ‘ค Sample batch

32 random transitions from replay buffer

โ‘ฅ Compute loss

TD error using frozen target network ฮธโป

โ‘ฆ Update ฮธ

Backprop; sync ฮธโป โ† ฮธ every 1K steps

DQN Training Curve โ€” characteristic rise from random to human-level
Training steps (millions) Avg episode reward 0 10 20 30 40 50 0 100 200 300 400 human random Buffer filling Rapid rise Near-optimal plateau Shaded band = ยฑ1 std across seeds ยท DQN on Space Invaders, ฮณ=0.99, replay=1M, C=1000

Van Hasselt et al. (2016) identified that standard DQN systematically overestimates Q-values. The culprit: maxa' Q(s',a';ฮธโป) always selects the highest value โ€” if any Q-value is overestimated (inevitable early in training), it picks that inflated value, and the bias propagates. Fix: decouple action selection from action evaluation.

Standard DQN target (biased): y = r + ฮณ ยท maxa' Q(s',a';ฮธโป) Double DQN target (unbiased): a* = argmaxa Q(s',a; ฮธ) โ† online network selects action y = r + ฮณ ยท Q(s', a*; ฮธโป) โ† target network evaluates it One network selects, a different network evaluates โ†’ overestimation bias cancelled
๐Ÿ“ˆ
Why Overestimation Happens

max of noisy Q-values is always โ‰ฅ true max. With neural nets, noise is everywhere early in training.

โœ‚๏ธ
Decouple Select & Evaluate

Online net (ฮธ) picks the best action. Target net (ฮธโป) scores it. Two different noise sources cancel.

๐Ÿš€
Better Performance

More accurate value estimates โ†’ better policy decisions โ†’ higher scores on most Atari games.

Wang et al. (2016) observed that in many states the choice of action barely matters โ€” an empty screen in Pong, a static field in Breakout. Only near the ball does the action have a significant effect. The Dueling DQN architecture exploits this by splitting Q(s,a) into two components: V(s) (how good is this state, regardless of action) and A(s,a) (how much better is this specific action than average).

Q decomposition: Q(s,a) = V(s) + A(s,a) โˆ’ meana A(s,a) Subtract mean for identifiability (otherwise V and A can trade off arbitrarily)
Dueling DQN โ€” separate Value and Advantage streams combined for Q
Standard DQN Conv layers FC 512 Q(s,a) [Qโ‚ Qโ‚‚ Qโ‚ƒ Qโ‚„] Dueling DQN Conv layers Shared FC 512 V(s) scalar A(s,a) ร—6 V + A โˆ’ mean [Qโ‚ Qโ‚‚ Qโ‚ƒ Qโ‚„] same format Key insight: many states V(s) matters more than which action โ€” dueling learns both separately Empty game screen: any action โ‰ˆ equivalent โ†’ V(s) dominates ยท Near ball/enemy: A(s,a) matters

Hessel et al. (DeepMind, 2018) asked: what if we combine all the improvements to DQN into one agent? The result โ€” Rainbow โ€” obliterated all prior benchmarks on Atari while using far fewer environment interactions.

Extension Key Innovation Benefit Year
DQN CNN + replay buffer + target network Stable deep RL on raw pixels 2015
Double DQN Decouple selection/evaluation Reduce Q-value overestimation 2016
Prioritised Replay Sample high-TD-error transitions more Better data efficiency 2016
Dueling DQN V(s) + A(s,a) decomposition Better state-value learning 2016
Multi-step Returns n-step bootstrap target Faster credit assignment 2016
Distributional RL (C51) Predict full reward distribution (not mean) Better risk modelling 2017
Noisy DQN Learnable noise in FC weights replaces ฮต State-dependent exploration 2017
Rainbow All 6 improvements combined State-of-the-art Atari (3ร— DQN data efficiency) 2018

โˆ‘ Chapter 7.5 โ€” Key Takeaways

  • DQN: Q-learning with a CNN as function approximator โ€” scales to pixel-level inputs that destroy tabular methods
  • Experience replay: random mini-batches from a large buffer break temporal correlations and reuse data
  • Target network: frozen copy updated every C steps stabilises the TD target โ€” without it, training diverges
  • Double DQN: decouple action selection (online ฮธ) from evaluation (target ฮธโป) โ€” eliminates overestimation bias
  • Dueling DQN: Q(s,a) = V(s) + A(s,a) โ€” separate heads for state value and action advantage, better generalisation
  • Rainbow: 6 improvements combined โ€” state-of-the-art discrete-action deep RL, 3ร— more data-efficient than DQN
7.6
Chapter 7.6
Policy Gradient Methods & Actor-Critic โ€” Direct Policy Optimisation

Q-learning and DQN are powerful โ€” but they require enumerating a Q-value for every possible action. That works for discrete actions (6 Atari buttons) but breaks immediately for continuous action spaces: robot joint torques, steering angles, portfolio weights. How do you take argmax over an infinite set? You can't.

Policy gradient methods sidestep this by directly parameterising the policy ฯ€(a|s;ฮธ) as a neural network and performing gradient ascent on expected return J(ฮธ). They also naturally produce stochastic policies โ€” essential in adversarial games where determinism is exploitable (rock-paper-scissors, poker).

PropertyValue-Based (DQN)Policy-Based (Policy Gradient)
What is learnedQ-function, then extract ฯ€Policy ฯ€ directly
Action spaceDiscrete onlyContinuous and discrete
Policy typeDeterministic (argmax)Stochastic (probability dist.)
OptimisationIndirect โ€” Q โ†’ ฯ€Direct gradient ascent on J(ฮธ)
VarianceLower (bootstrapping)Higher (Monte Carlo returns)
ApplicabilityAtari, discrete gamesRobotics, continuous control, NLP (RLHF)

Williams (1992) proved that the gradient of expected return with respect to policy parameters has a remarkably clean form โ€” computable purely from sampled trajectories, without knowledge of the environment model.

Policy Gradient Theorem: โˆ‡ฮธ J(ฮธ) = Eฯ€[โˆ‡ฮธ log ฯ€(at|st;ฮธ) ยท Gt] Gradient ascent update (maximise J): ฮธ โ† ฮธ + ฮฑ ยท โˆ‡ฮธ log ฯ€(at|st;ฮธ) ยท Gt Gt > 0 โ†’ push up probability of at in st ยท Gt < 0 โ†’ push down probability ยท โˆ‡log ฯ€ computable by backprop
๐Ÿ“
Log-Trick

โˆ‡ log ฯ€(a|s;ฮธ) = โˆ‡ฯ€/ฯ€ โ€” lets us compute the gradient from samples without differentiating through the environment.

๐ŸŽฏ
Reinforce High-Return Actions

Actions that led to high Gt get their probability increased. Low Gt actions get decreased. Simple and elegant.

๐Ÿ”„
Model-Free & Differentiable

Only requires sampled (s,a,G) tuples. Works for any differentiable policy ฯ€(a|s;ฮธ) โ€” including neural networks.

REINFORCE is the simplest instantiation of the policy gradient theorem โ€” Monte Carlo style: collect a full episode, compute discounted returns Gt at every step, then update the policy. Unbiased but high-variance.

REINFORCE โ€” collect full episode, compute returns, update policy
โ‘  Episode Collection โ‘ก Return Computation (backwards) โ‘ข Policy Update s0 a0,r=0 s1 a1,r=0 s2 a2,r=0 s3 a3,r=+10 DONE Wait for full episode to finish s0 s1 s2 s3 END G0=7.3 G1=8.1 G2=9.0 G3=10.0 G โ† r + ฮณG (ฮณ=0.9, backwards) Update magnitude โˆ G (higher = bigger update) s0 s1 s2 s3 โ†‘ increase log ฯ€(a|s) by ฮฑยทG High-return steps get larger update
import torch, torch.nn as nn
import gymnasium as gym

class PolicyNetwork(nn.Module):
    def __init__(self, obs_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, 64),      nn.ReLU(),
            nn.Linear(64, action_dim)   # raw logits
        )
    def forward(self, x): return self.net(x)

env      = gym.make('CartPole-v1')
policy   = PolicyNetwork(4, 2)          # 4 obs dims, 2 actions
optim    = torch.optim.Adam(policy.parameters(), lr=3e-4)
gamma    = 0.99

for episode in range(1000):
    state, _   = env.reset()
    log_probs, rewards = [], []

    while True:                                         # collect full episode
        state_t = torch.tensor(state, dtype=torch.float32)
        logits  = policy(state_t)
        dist    = torch.distributions.Categorical(logits=logits)
        action  = dist.sample()                         # stochastic action
        log_probs.append(dist.log_prob(action))         # log ฯ€(a|s;ฮธ)
        state, reward, done, trunc, _ = env.step(action.item())
        rewards.append(reward)
        if done or trunc: break

    # Compute discounted returns G (backwards accumulation)
    G, returns = 0, []
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)  # normalise

    # Policy gradient loss โ€” negative for gradient ASCENT
    loss = -sum(lp * g for lp, g in zip(log_probs, returns))
    optim.zero_grad(); loss.backward(); optim.step()

REINFORCE works in theory but is slow in practice because Gt estimates from single trajectories have extremely high variance. The fix: subtract a baseline b(st) from the return. The baseline doesn't change the expected gradient โ€” it just centres it, dramatically reducing variance.

REINFORCE with baseline: โˆ‡ฮธ J(ฮธ) = E[โˆ‡ฮธ log ฯ€(at|st;ฮธ) ยท (Gt โˆ’ b(st))] Best baseline = state value V(st) โ†’ Advantage: At = Gt โˆ’ V(st) โ† "was this action better or worse than expected?" At > 0: action was better than expected โ†’ increase probability ยท At < 0: worse than expected โ†’ decrease
Baseline Subtraction โ€” centres return distribution, dramatically reduces variance
Without Baseline โ€” Raw G High variance, wide spread -5 0 5 10 15 Return value G meanโ‰ˆ5 ฯƒ โ‰ˆ 8.2 โ€” noisy gradient signal โ†’ A = G โˆ’ V(s) With Baseline โ€” Advantage A Low variance, zero-centred -5 -2 0 +2 +5 Advantage value A meanโ‰ˆ0 ฯƒ โ‰ˆ 2.1 โ€” cleaner gradient signal

Using V(s) as a baseline requires learning V(s). The natural solution: train a separate Critic network alongside the policy (Actor). The Critic computes the TD error ฮดt โ€” a single-step advantage estimate โ€” and feeds it back to guide the Actor's gradient updates. This enables online, per-step learning without waiting for an episode to end.

TD error (Critic output โ†’ Actor advantage estimate): ฮดt = rt+1 + ฮณ V(st+1;w) โˆ’ V(st;w) Actor update (policy gradient with TD advantage): ฮธ โ† ฮธ + ฮฑฮธ ยท โˆ‡ฮธ log ฯ€(at|st;ฮธ) ยท ฮดt Critic update (minimise TD error): w โ† w โˆ’ ฮฑw ยท ฮดt ยท โˆ‡w V(st;w)
Actor-Critic โ€” Actor acts, Critic evaluates, TD error guides Actor updates
ENVIRONMENT outputs: st, rt+1, st+1 receives: action at ACTOR ฯ€(a|s;ฮธ) ฮธ โ† ฮธ + ฮฑยทฮดยทโˆ‡log ฯ€ input: st โ†’ output: at CRITIC V(s;w) w โ† min TD error ฮด = r+ฮณV(s')-V(s) s_t (state) a_t (action) s_t, r_t+1, s_t+1 delta_t (advantage) Critic tells Actor: "was that action better (+ฮด) or worse (โˆ’ฮด) than expected?" ยท Updates every step โ€” no need to wait for episode end

Mnih et al. (DeepMind, 2016) scaled Actor-Critic to deep networks with a key insight: parallelism solves the correlation problem. Instead of a replay buffer, run many independent workers simultaneously โ€” each in its own environment copy. Their diverse experiences are naturally uncorrelated.

โšก
A3C (Async)

Workers push gradients to a global network asynchronously. Fast wall-clock time โ€” no waiting for others.

๐Ÿ”„
A2C (Sync)

All workers step together; global update after each sync. Simpler, deterministic, often matches A3C performance.

๐ŸŒ
Diversity Benefit

Workers start from different states, explore different regions โ€” naturally uncorrelated gradients without a replay buffer.

A3C โ€” Asynchronous parallel workers updating shared global network
Global Network shared weights ฮธ updated by all workers Worker 1 local env + ฯ€ Worker 2 local env + ฯ€ Worker 3 local env + ฯ€ Worker 4 local env + ฯ€ Worker 5 local env + ฯ€ pull ฮธ (sync weights) push gradients (async)

A policy trained purely to maximise return tends to collapse to deterministic โ€” it finds one good action per state and stops exploring everything else. This is catastrophic in environments where exploration is critical or where the optimal policy is genuinely stochastic.

The fix: add an entropy bonus H(ฯ€) to the objective. Entropy measures how spread-out a probability distribution is โ€” maximising it encourages the policy to remain uncertain and keep exploring. The coefficient ฮฒ controls the exploration-exploitation trade-off. Entropy regularisation is used in A3C, PPO, and is a cornerstone of SAC (Soft Actor-Critic), where maximising entropy is part of the fundamental objective.

Entropy-regularised policy gradient loss: L(ฮธ) = โˆ’E[log ฯ€ ยท A] โˆ’ ฮฒ ยท H(ฯ€(ยท|s)) H(ฯ€) = โˆ’ฮฃa ฯ€(a|s) log ฯ€(a|s) โ† entropy of policy ฮฒ = entropy coefficient (tune: large ฮฒ โ†’ more exploration ยท small ฮฒ โ†’ more exploitation) ยท Used in: A3C, PPO, SAC
๐ŸŽฒ
Uniform policy: H = max

All actions equally likely โ€” maximum randomness. High entropy = diverse exploration.

โš–๏ธ
ฮฒ controls trade-off

Large ฮฒ: explore broadly. Small ฮฒ: converge to best policy. ฮฒ=0: no entropy (pure exploitation).

๐Ÿค–
SAC (Ch 7.7)

Soft Actor-Critic maximises expected return AND entropy simultaneously โ€” state-of-the-art continuous control.

โˆ‘ Chapter 7.6 โ€” Key Takeaways

  • Policy gradient: directly optimise โˆ‡ฮธ J(ฮธ) = E[โˆ‡log ฯ€ ยท Gt] โ€” works for continuous and stochastic actions
  • REINFORCE: full episode โ†’ compute returns โ†’ update โ€” unbiased but high variance
  • Baseline: subtract V(s) from returns โ†’ advantage A = Gโˆ’V reduces variance without adding bias
  • Actor-Critic: Actor acts, Critic provides TD error ฮด โ€” online per-step updates, lower variance than REINFORCE
  • A3C / A2C: parallel workers with shared global network โ€” diverse experience, fast training, no replay buffer
  • Entropy bonus: prevent policy collapse โ€” encourage exploration throughout training (key in SAC)
7.7
Chapter 7.7
Advanced RL โ€” PPO, SAC & Model-Based RL

Most research RL happens in textbooks. Most production RL runs PPO or SAC. These two algorithms dominate because they solved the core instability problems of earlier methods. Understanding why they work โ€” not just how โ€” is what separates an RL practitioner from someone who just copies hyperparameters from a blog post.

Schulman et al. (OpenAI/Berkeley, 2015) identified the root instability in policy gradient methods: step size. Too large a step and the new policy is so different that the old value estimates are invalid โ€” training collapses catastrophically and cannot recover. Too small and learning is painfully slow. This is fundamentally different from supervised learning where labels are fixed โ€” in RL, a bad update corrupts the very data distribution used for future learning.

TRPO's solution: constrain each update so the new policy stays within a trust region โ€” KL(ฯ€_old โ€– ฯ€_new) โ‰ค ฮด. The surrogate objective is a valid lower bound on true performance inside this region, guaranteeing monotonic improvement. The catch: enforcing this hard constraint requires second-order optimisation (Fisher information matrix, conjugate gradients) at O(|ฮธ|ยฒ) cost โ€” infeasible for large networks.

TRPO Objective: maximise L(ฮธ) = E[ฯ€(a|s;ฮธ)/ฯ€old(a|s) ยท A(s,a)] subject to E[KL(ฯ€old(ยท|s) โ€– ฯ€(ยท|s;ฮธ))] โ‰ค ฮด L(ฮธ) = importance-weighted surrogate objective ยท ฮด = trust region radius โ‰ˆ 0.01 ยท Guaranteed monotonic improvement within the region
PropertyTRPOPPO (next section)
Constraint typeHard KL constraintSoft clip (approximate)
Improvement guaranteeMonotonic โ€” provableApproximate but reliable
Optimisation orderSecond-order (Fisher matrix)First-order (Adam/SGD)
Compute per updateO(|ฮธ|ยฒ) โ€” infeasible at scaleO(|ฮธ|) โ€” standard backprop
ImplementationComplex (CG solver, line search)Simple (~50 lines of PyTorch)
Used in practiceRarely โ€” historical referenceDefault algorithm for most tasks

Schulman et al. (OpenAI, 2017) asked: can we get TRPO's stability with standard first-order optimisation? The answer is PPO โ€” the most widely used deep RL algorithm in 2024. It powers ChatGPT/Claude/Gemini alignment (RLHF), OpenAI Five, AlphaStar, Boston Dynamics locomotion, and continuous control across robotics.

PPO replaces TRPO's hard KL constraint with a clipped surrogate objective. Define the probability ratio rt(ฮธ) = ฯ€new(at|st) / ฯ€old(at|st). Instead of constraining this ratio via Lagrange multipliers, simply clip it to [1โˆ’ฮต, 1+ฮต] and take the minimum of clipped and unclipped โ€” creating a flat region where the gradient is zero, which naturally prevents too-large updates.

PPO Probability Ratio & Clipped Objective: rt(ฮธ) = ฯ€ฮธ(at|st) / ฯ€old(at|st) LCLIP(ฮธ) = Et[min(rt(ฮธ)ยทร‚t, clip(rt(ฮธ), 1โˆ’ฮต, 1+ฮต)ยทร‚t)] Full PPO loss (optimise jointly): L(ฮธ) = LCLIP(ฮธ) โˆ’ c1ยทLVF(ฮธ) + c2ยทH[ฯ€ฮธ(ยท|st)] ฮต=0.2 ยท c1=0.5 (value coef) ยท c2=0.01 (entropy coef) ยท K=10 epochs ยท T=2048 steps ยท B=64 mini-batch
PPO Clipping โ€” positive and negative advantage cases with trust region
ร‚ > 0 โ€” Good action Probability ratio r_t Objective L 0 0.5 0.8 1.0 1.2 1.5 trust region rยทร‚ (unclipped) clipped (PPO) flat โ†’ no gradient A>0: cap at 20% gain ร‚ < 0 โ€” Bad action Probability ratio r_t 0 0.5 0.8 1.0 1.2 1.5 trust region rยทร‚ (unclipped) flat โ†’ no gradient A<0: cap at 20% penalty Clipping creates a flat region โ†’ zero gradient โ†’ no update when policy has moved too far from old policy
PPO Training Cycle โ€” collect โ†’ advantage โ†’ batch update โ†’ repeat
โ‘  Collect Rollout Run ฯ€_old T=2048 steps across N parallel envs store (s,a,r,s',log ฯ€_old) ~100ms โ‘ก Compute Advantages GAE(ฮป=0.95) for each step Normalise advantages Compute V_target ~10ms โ‘ข Mini-batch Updates K=10 epochs over data B=64 random mini-batches L^CLIP + L^VF + H โ†’ Adam ~500ms โ‘ฃ Update Reference ฯ€_old โ† ฯ€_new Clear rollout buffer Decay ฮต (optional) ~1ms repeat until convergence
// PPO Algorithm (pseudocode)
Initialise policy ฯ€_ฮธ, value function V_ฯ†
Set ฮต=0.2, ฮณ=0.99, ฮป=0.95, K=10, T=2048, B=64, c1=0.5, c2=0.01

LOOP:
  // Phase 1: Collect rollout with current policy
  FOR t = 0 to T-1 (across N envs):
    a_t  ~ ฯ€_ฮธ(ยท|s_t)
    s_t1, r_t1 = env.step(a_t)
    Store (s_t, a_t, r_t1, s_t1, log ฯ€_ฮธ(a_t|s_t), V_ฯ†(s_t))

  // Phase 2: Compute GAE advantages
  FOR each step t (backwards):
    delta = r_t1 + ฮณยทV_ฯ†(s_t1) - V_ฯ†(s_t)
    A_hat[t] = delta + (ฮณยทฮป)ยทA_hat[t+1]      // GAE
  A_hat = (A_hat - mean(A_hat)) / std(A_hat)  // normalise

  // Phase 3: K epochs of mini-batch updates
  ฮธ_old โ† ฮธ                                    // freeze reference policy
  FOR epoch = 1 to K:
    FOR each mini-batch of B samples:
      r_t  = ฯ€_ฮธ(a_t|s_t) / ฯ€_old(a_t|s_t)   // probability ratio
      L_CLIP = mean(min(r_t*A_hat, clip(r_t,1-ฮต,1+ฮต)*A_hat))
      L_VF   = mean((V_ฯ†(s_t) - V_target)ยฒ)
      L_H    = mean(H(ฯ€_ฮธ(ยท|s_t)))
      loss   = -(L_CLIP - c1*L_VF + c2*L_H)
      backprop(loss); clip_grad_norm_(0.5); optimizer.step()
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.callbacks import EvalCallback
import gymnasium as gym

vec_env = make_vec_env("LunarLander-v2", n_envs=8)

model = PPO(
    policy="MlpPolicy", env=vec_env,
    clip_range=0.2,          # ฮต โ€” trust region boundary
    n_steps=2048,            # T steps per env per rollout
    batch_size=64,           # mini-batch size B
    n_epochs=10,             # K update epochs per rollout
    gamma=0.99,              # discount factor
    gae_lambda=0.95,         # GAE ฮป
    ent_coef=0.01,           # entropy coefficient c2
    vf_coef=0.5,             # value loss coefficient c1
    max_grad_norm=0.5,       # gradient clipping
    learning_rate=3e-4,
    verbose=1, tensorboard_log="./ppo_tb/"
)

eval_cb = EvalCallback(gym.make("LunarLander-v2"),
    best_model_save_path="./ppo_best/", eval_freq=10_000, n_eval_episodes=10)
model.learn(total_timesteps=1_000_000, callback=eval_cb)

Schulman et al. (2016) formalised a key trade-off: the advantage estimate At used in policy gradient updates can be computed at different "depths" โ€” ranging from a pure TD(0) estimate (low variance, high bias) to a full Monte Carlo return (zero bias, high variance). GAE smoothly interpolates between these extremes with parameter ฮป, and ฮป=0.95 has proven the sweet spot across nearly every task.

GAE Formula: ฮดt = rt+1 + ฮณV(st+1) โˆ’ V(st) โ† TD error ร‚tGAE(ฮณ,ฮป) = ฮฃl=0โˆž (ฮณฮป)l ฮดt+l = ฮดt + (ฮณฮป)ฮดt+1 + (ฮณฮป)2ฮดt+2 + โ€ฆ ฮป=0 โ†’ ร‚t = ฮดt (pure TD) ยท ฮป=1 โ†’ ร‚t = ฮฃฮณlrt+l+1 โˆ’ V(st) (MC) ยท ฮป=0.95 = default PPO
GAE ฮป โ€” Bias-Variance Tradeoff in Advantage Estimation
GAE ฮป 0 0.2 0.5 0.9 0.95 1.0 common practice Bias Variance Total error TD(0) low var, high bias ฮป=0.95 PPO default MC zero bias, high var

Haarnoja et al. (Berkeley + Google, 2018) introduced SAC โ€” the dominant algorithm for continuous control. While PPO is on-policy (needs fresh data every update), SAC is off-policy โ€” it reuses all past experience from a replay buffer. Its fundamental innovation is maximum entropy RL: the agent is rewarded not just for getting high return, but for maintaining a stochastic (uncertain) policy.

๐ŸŒก๏ธ
Temperature ฮฑ

Controls entropy-reward balance. SAC auto-tunes ฮฑ to maintain target entropy Hฬ„ = โˆ’|A| โ€” eliminating the only sensitive hyperparameter.

๐Ÿ‘ฏ
Twin Critics

Two Q-networks Q1, Q2 trained independently. Take min(Q1, Q2) for targets โ€” prevents Q-value overestimation.

๐Ÿ”„
Off-Policy Replay

1M-transition replay buffer. Sample random mini-batches. Reuses all past experience โ€” 10ร— more sample efficient than PPO.

SAC Maximum Entropy Objective: J(ฯ€) = ฮฃt Eฯ€[r(st,at) + ฮฑยทH(ฯ€(ยท|st))] Critic Loss (for Q1, Q2 independently): y = r + ฮณยท(min(Qฬ„1(s',รฃ'), Qฬ„2(s',รฃ')) โˆ’ ฮฑยทlog ฯ€(รฃ'|s')) รฃ' ~ ฯ€ Actor Loss: Lฯ€ = Es[ฮฑยทlog ฯ€(a|s) โˆ’ min(Q1(s,a), Q2(s,a))] Temperature Loss (auto-tune): Lฮฑ = E[โˆ’ฮฑยท(log ฯ€(a|s) + Hฬ„)] Hฬ„ = โˆ’|A| (target entropy)
SAC Architecture โ€” Actor + Twin Critics + Auto-tuned Temperature
Replay Buffer D 1M transitions (s, a, r, s', done) sample B=256 (s,a,r,s') Actor ฯ€_ฮธ(a|s) s โ†’ (ฮผ, ฯƒ) โ†’ รฃ = ฮผ + ฯƒยทฮต (reparam.) L^ฯ€ = ฮฑยทlog ฯ€(รฃ|s) โˆ’ min(Q1,Q2) Critic Qโ‚(s,a;ฯ†โ‚) L^Qโ‚ = (Qโ‚(s,a) โˆ’ y)ยฒ Critic Qโ‚‚(s,a;ฯ†โ‚‚) L^Qโ‚‚ = (Qโ‚‚(s,a) โˆ’ y)ยฒ Temperature ฮฑ (auto) L^ฮฑ = โˆ’ฮฑยท(log ฯ€ + Hฬ„) Target Qฬ„โ‚(ฯ†ฬ„โ‚) soft update ฯ„=0.005 Target Qฬ„โ‚‚(ฯ†ฬ„โ‚‚) used to compute y min(Qฬ„โ‚,Qฬ„โ‚‚) โ†’ target y โ†’ reduces overestimation
from stable_baselines3 import SAC
import gymnasium as gym

env = gym.make("HalfCheetah-v4")

model = SAC(
    policy="MlpPolicy", env=env,
    learning_rate=3e-4,
    buffer_size=1_000_000,     # replay buffer capacity
    batch_size=256,            # mini-batch size
    tau=0.005,                 # soft target network update ฯ„
    gamma=0.99,
    ent_coef="auto",           # auto-tune ฮฑ
    target_entropy="auto",     # Hฬ„ = -|A|
    learning_starts=10_000,    # fill buffer before training
    train_freq=1,              # update every step
    gradient_steps=1,
    verbose=1
)
model.learn(total_timesteps=1_000_000)
model.save("sac_halfcheetah")

Fujimoto et al. (2018) dissected three specific failure modes of DDPG (Deep Deterministic Policy Gradient) and engineered a targeted fix for each. TD3 is a lean, deterministic off-policy algorithm that is an important baseline for continuous control benchmarks.

TD3 Target (all three fixes combined): รฃ' = clip(ฯ€ฮธ'(s') + clip(ฮต, โˆ’c, c), amin, amax) โ† smoothed target action y = r + ฮณ ยท min(Qฬ„1(s',รฃ'), Qฬ„2(s',รฃ')) โ† twin critics Critic update: every step ยท Actor update: every 2 critic steps (delayed) ยท ฮต ~ N(0,ฯƒ), ฯƒ=0.2
๐Ÿ‘ฏ
Fix 1: Twin Critics

Q1 and Q2 trained independently. Target uses min(Qฬ„1, Qฬ„2) โ€” eliminates overestimation bias.

โณ
Fix 2: Delayed Actor

Update actor every 2 critic steps. Prevents actor from over-fitting to a noisy, underfit critic.

๐Ÿ”‡
Fix 3: Target Smoothing

Add clipped Gaussian noise to target actions. Smooths Q-landscape โ€” reduces high-variance targets.

PropertySACTD3
Policy typeStochastic GaussianDeterministic
ObjectiveMax-entropy (reward + H)Standard reward only
TemperatureAuto-tuned ฮฑNo temperature โ€” tune noise ฯƒ
Sample efficiencyHigher (entropy bonus aids exploration)Good but less robust
InferenceSample from Gaussian (stochastic)Forward pass only (deterministic)
Preferred forMost continuous tasks (default)Specific benchmarks, ablations

Model-free RL needs millions of real environment interactions. Atari DQN requires 50M frames โ€” equivalent to 38 days of continuous play. For real robots or industrial systems, that is prohibitively expensive and dangerous. Model-based RL addresses this by learning a world model Mฯˆ(s,a) โ†’ (s', r) โ€” a differentiable simulator โ€” and generating cheap synthetic experience to train the policy.

๐Ÿ—บ๏ธ
Dyna-style

Interleave real steps with K=100 model-generated steps. Same policy, much more data per real interaction.

๐Ÿ”ฎ
Imagination

Train policy entirely inside the model (DreamerV3). Real env only used to improve the world model itself.

๐ŸŒณ
Planning

Use model for lookahead search at test time (MCTS in MuZero). Improves decisions without extra training.

Model-Based vs Model-Free โ€” same final quality, 10-20ร— fewer real interactions
Environment interactions (real steps, log scale) Task performance (%) 1K 10K 50K 100K 500K 1M 0% 50% 80% 100% 10-20ร— advantage PPO/SAC (model-free) DreamerV3/MBPO (model-based) ideal Model-based: each real step โ†’ K imagined steps ยท Model-free: each step = one gradient update

Sutton (1991) introduced the Dyna architecture โ€” deceptively simple yet powerful. Every real environment step generates one real transition and K=100 model-generated transitions. The policy receives 101ร— more gradient updates per real step at negligible extra compute cost (the model is a neural network, not the real environment).

Dyna-Q โ€” 1 real step + K simulated steps per environment interaction
Real Environment slow, accurate 1 step per interaction โ†“ action from policy 1 real (s,a,r,s') direct policy update World Model M_ฯˆ(s,a) โ†’ (s', r) trained on real data fast, approximate K=100 simulated (s,a,r,s') per step Policy / Q-function receives: real + simulated total updates: 1+K=101ร— per real step action selection

Ha & Schmidhuber (2018) showed that compressing observations into a compact latent space and predicting the future in that latent space enables policies trained purely in imagination. DreamerV3 (Hafner et al., 2023) is the apex of this line: a single model with one set of hyperparameters that masters Atari, DMControl, Crafter, BSuite, and Minecraft diamond collection (4/7 tasks โ€” the first RL agent to achieve this).

The core is the RSSM (Recurrent State Space Model) โ€” a latent dynamics model with deterministic memory (GRU recurrent path) and stochastic uncertainty (discrete latent variables). The actor-critic trains entirely inside the RSSM by rolling out K=15 imagined steps from the current latent state โ€” never touching the real environment during policy updates.

DreamerV3 โ€” World model from real data; Policy trained entirely in imagination
World model training โ†” Actor-critic training alternate every step Phase 1 โ€” World Model Training (real experience) Real Env x_t, r_t RSSM Encoder x_t โ†’ (h_t, z_t) h_t+1 = f(h_t, z_t, a_t) z_t ~ q(z_t|h_t, x_t) Decoder Heads (h,z) โ†’ xฬ‚_t (obs) (h,z) โ†’ rฬ‚_t (reward) (h,z) โ†’ ฮณฬ‚_t (continue) Train to minimise: โ€ข Reconstruction loss ||x-xฬ‚||ยฒ โ€ข Reward pred loss ||r-rฬ‚||ยฒ โ€ข Continue pred loss BCE(ฮณ,ฮณฬ‚) โ€ข KL divergence (latents) Phase 2 โ€” Actor-Critic Training (imagination ONLY) Start latent (h_0, z_0) from real data K=15 imagined steps (RSSM unrolled in latent space) h1,z1 a1~ฯ€ h2,z2 a2~ฯ€ h3,z3 โ€ฆ h15,z15 Actor + Critic Update Critic: V(h,z) estimates imagined returns Actor: maximise returns via BPTT through RSSM No real environment interaction during update! symlog reward normalisation ยท KL balancing DreamerV3 single hyperparameter set: Atari ยท DMControl ยท Minecraft ยท BSuite ยท Crafter

Schrittwieser et al. (DeepMind, 2020) built on AlphaZero (which required the full rules of Go/Chess to run MCTS) and asked: what if we learn the rules from scratch? MuZero learns three functions: a Representation network (observation โ†’ latent state h), a Dynamics network ((h, a) โ†’ next latent h' + predicted reward r), and a Prediction network (h โ†’ policy ฯ€ + value v). MCTS runs entirely in the learned latent space โ€” never invoking the real environment during planning.

The result: state of the art simultaneously on Atari (57 games), Go, Chess, and Shogi โ€” all with one algorithm, no domain knowledge, no hand-coded rules. The dynamics model need not predict realistic pixels or observations โ€” it only needs to produce accurate value and reward estimates for planning purposes.

MuZero Reanalyse (2021) further improves sample efficiency by replaying stored positions with the latest network, re-running MCTS to generate improved training targets. The successor EfficientZero (2021) achieves human-level Atari performance in just 2 hours of real game time โ€” a 20ร— improvement over MuZero's original sample efficiency.

MuZero Three Networks: h0 = r(ot) โ† Representation: observation โ†’ latent hk+1, rk = g(hk, ak) โ† Dynamics: latent transition + reward ฯ€k, vk = f(hk) โ† Prediction: policy + value for MCTS MCTS uses g and f inside learned latent space โ€” zero real environment calls during planning
RL Algorithm Landscape โ€” sample efficiency vs final performance
Sample Efficiency โ†’ (right = fewer real steps needed) Asymptotic Performance โ†’ low high low high On-policy Off-policy model-free Model-based Pareto frontier A2C PPO โœ“ DQN Rainbow DDPG TD3 SAC โœ“ MBPO DreamerV3 MuZero on-policy off-policy model-free model-based
AlgorithmAction SpaceOn/Off-PolicyKey StrengthUse WhenLibrary
PPODisc + Cont On-policy Stable, simple, universal Default choice, RLHF, gamesSB3, RLlib, TRL
SACContinuous Off-policy Sample-efficient, robust, auto-ฮฑ Robotics, locomotion, controlSB3, RLlib
TD3Continuous Off-policy Deterministic, stable baseline Benchmark continuous controlSB3
DQN/RainbowDiscrete Off-policy Proven on Atari Atari-style discrete gamesSB3, RLlib
A2CDisc + Cont On-policy Fast, simple Prototyping, educationSB3
DreamerV3Disc + Cont Off-policy (model) Best sample efficiency Pixel obs, scarce real dataOfficial (JAX)
MuZeroDiscrete Planning Strongest board/video games Chess, Go, AtariDeepMind (JAX)
RLHF-PPOToken (disc) On-policy LLM alignment Align language modelsTRL, OpenRLHF

The practical hierarchy for 2024: Start with PPO โ€” it works everywhere and is easy to debug. Switch to SAC for continuous control โ€” it's more sample-efficient. Consider DreamerV3 when real-world interaction is scarce or expensive. Use model-free methods unless you have a specific reason to use model-based.

โˆ‘ Chapter 7.7 โ€” Key Takeaways

  • TRPO: KL-constrained policy update โ€” guaranteed improvement but requires expensive second-order optimisation
  • PPO: clip ratio rt to [1โˆ’ฮต, 1+ฮต] โ€” same stability as TRPO with first-order Adam; industry default
  • GAE(ฮป=0.95): exponentially-weighted TD errors โ€” low bias AND low variance advantage estimate
  • SAC: maximum entropy objective โ€” off-policy, stochastic, auto-tuned ฮฑ โ€” best continuous control algorithm
  • TD3: twin critics + delayed actor + target smoothing โ€” three targeted fixes for DDPG's failure modes
  • DreamerV3: learn world model from real data; train actor-critic entirely in imagination โ€” 10-20ร— sample efficiency
  • MuZero: MCTS in learned latent space โ€” no game rules needed; state of the art on board & video games
7.8
Chapter 7.8
RL in the Real World โ€” AlphaGo, RLHF & Robotics

Reinforcement learning is not just theory and Atari games. It beat the world champion at Go, trained every major LLM to follow instructions, controls robots at Google and Boston Dynamics, and optimises data centre cooling. This chapter is where the equations become consequences.

Chess has ~1047 positions โ€” hard but solvable for minimax search engines like Stockfish. Go has ~2ร—10170 โ€” minimax is computationally infeasible, and classic evaluation functions can't reliably assess Go positions. Go requires intuition more than calculation. That intuition was beyond computers โ€” until March 2016.

๐Ÿง 
SL Policy Network ฯ€SL

Trained on 30M human game positions (57% accuracy). Biases MCTS toward human-like moves โ€” prior probability over actions.

โš”๏ธ
RL Policy Network ฯ€RL

Initialised from ฯ€SL, refined by self-play policy gradient. Wins 80% vs ฯ€SL โ€” discovers moves humans never conceived.

๐Ÿ“Š
Value Network V(s)

Predicts game outcome from any board position. Replaces expensive rollout simulations in MCTS leaf evaluation.

AlphaGo MCTS โ€” four quantities per node guide the search
Root N=1245 Current position P=0.35 P=0.28 P=0.18 P=0.12 P=0.07 N=523 Q=0.62 N=310 Q=0.48 N=205 Q=0.51 N=124 Q=0.44 N=83 Q=0.38 V=0.71 MCTS Quantities per Node: โ‘  P(a|s): prior from SL Policy Network โ‘ก N(s,a): visit count (exploration) โ‘ข Q(s,a): mean value from backups โ‘ฃ UCB = Q + cยทPยทโˆšN_parent/(1+N) Leaf Evaluation (no full rollout): Value Network: V(s) = 0.71 (position assessment) Mixed: 50% V(s) + 50% fast rollout with ฯ€_SL Final Move Selection: Choose move with highest N = Move A (N=523) โ† backup March 2016: AlphaGo beats Lee Sedol 4-1 ยท Move 37 (Game 2) declared "impossible" by top professionals September 2017: AlphaZero trained 36 hours, beats AlphaGo 100-0, Stockfish 28-0-72
AlphaZero Self-Play Loop โ€” tabula rasa mastery in hours
โ‘  Self-Play 25,000 games vs self Record: position, ฯ€_MCTS game outcome z No human data โ‘ก Data Buffer 500K positions stored Random mini-batches for training Replay buffer โ‘ข Update Network Policy head: CE(ฯ€_MCTS, ฯ€) Value head: MSE(V, z) Combined loss + SGD Single network โ‘ฃ Evaluate 400 games vs champion Win rate > 55%? โ†’ new champion Gating criterion New champion enters self-play โ€” ~7h: rediscovers all chess history ยท ~36h: superhuman across Chess, Go, Shogi

AlphaZero required explicit game rules for MCTS expansion โ€” it couldn't play Atari because the rules are embedded in emulator code. MuZero (DeepMind, 2020) solved this: it learns its own transition model from experience and runs MCTS entirely in the learned latent space, never calling the real environment during planning.

MuZero โ€” replaces hardcoded rules with learned dynamics model
AlphaZero Observation Encoder h REAL RULES Hardcoded Rules must be programmed Cannot handle Atari MuZero Observation Repr. h(obs) LEARNED g(h,a) h', r = dynamics + f(h) โ†’ ฯ€, V Learns rules from experience Works on Atari + Go + Chess AlphaZero Results: Chess: beats Stockfish 28-0-72 (28W, 72D, 0L) Go: beats AlphaGo 100-0 after 36h training MuZero Results (2020): All 57 Atari games: state of the art Go / Chess / Shogi: matches AlphaZero

Ouyang et al. (OpenAI, 2022) scaled RLHF to GPT-3 (175B) and produced InstructGPT โ€” the model that became ChatGPT. Three stages, each with distinct datasets, objectives, and failure modes. Understanding each stage explains why alignment is hard.

๐Ÿ“
Stage 1 โ€” SFT

13K prompts + expert human responses. Standard cross-entropy fine-tuning. Result: consistent instruction following, but no preference signal yet.

โš–๏ธ
Stage 2 โ€” Reward Model

33K prompts, 4-9 model responses ranked by 40 contractors. Bradley-Terry training on preference pairs โ†’ scalar proxy score.

๐Ÿค–
Stage 3 โ€” PPO

Policy = SFT init. Episode = one response. Reward = RM score โˆ’ ฮฒยทKL(PPOโ€–SFT). KL prevents catastrophic forgetting of SFT capability.

RLHF Token-level Reward (KL-penalised): r(st, at) = RM(x, y) ยท ๐Ÿ™[t=T] โˆ’ ฮฒ ยท KL(ฯ€PPO(ยท|st) โ€– ฯ€SFT(ยท|st)) Bradley-Terry Reward Model Loss: LRM = โˆ’E[log ฯƒ(RM(prompt, chosen) โˆ’ RM(prompt, rejected))] ฮฒ = KL coefficient (0.02-0.05) ยท RM score given only at final token T ยท KL per token prevents drift from SFT
RLHF Full Pipeline โ€” data requirements, training cost, and output per stage
Stage 1 โ€” SFT Dataset: 13K prompts + human-written responses Train: GPT-3 175B on assistant tokens only Loss: cross-entropy ~$50K GPU ยท ~1 week โ†’ SFT Model ฯ€_SFT Stage 2 โ€” Reward Model Dataset: 33K prompts, 4-9 responses ranked by humans Architecture: SFT + scalar head Loss: Bradley-Terry pref pairs Response A: concise โœ“ Human preferred Response B: verbose โœ— Ranked lower ~$100K annotation ยท 40 labellers โ†’ RM(prompt, response) โ†’ scalar Stage 3 โ€” PPO Policy: ฯ€_SFT init โ†’ ฯ€_PPO State: token context (growing) Action: next token (50K vocab) Reward: RM score at end KL penalty: -ฮฒยทKL(PPOโ€–SFT) 4 epochs, batch=512 ~2 weeks GPU training โ†’ Aligned LLM (ChatGPT/Claude) Total pipeline: ~$1M GPU + annotation cost ยท Christiano 2017 โ†’ Stiennon 2020 โ†’ InstructGPT 2022 โ†’ ChatGPT Dec 2022 Critical caveat: RM is an imperfect proxy โ†’ reward hacking. Goodhart's Law applies.

Reward Hacking is the central failure mode of RLHF. PPO maximises the reward model score โ€” but the RM is an imperfect proxy for human preferences. The model discovers RM weaknesses: verbosity (RM rewards long responses โ†’ pad with filler), sycophancy (RM rewards agreement โ†’ tell users what they want to hear), formatting exploitation (RM likes bullet points โ†’ overuse bullets everywhere). Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

Reward Hacking โ€” the model discovers RM weaknesses that annoy humans
Prompt: "Explain why the sky is blue" Response A โ€” Ideal "The sky appears blue due to Rayleigh scattering. Sunlight contains all visible wavelengths. Shorter blue wavelengths scatter more โ€” making the sky appear blue to observers." RM score: 7.8/10 Human: 9.2/10 โœ“ Concise, accurate, complete โ‰  RM โ‰  Human Response B โ€” Reward-Hacked "What a fascinating question! I'm delighted to help! The sky's brilliant blue is truly spectacular... [4 more lines of padding and affirmation] ...I hope this was helpful! ๐Ÿ˜Š" RM score: 8.4/10 โœ“ Human: 5.1/10 โœ— RM tricked โ€” human annoyed PPO maximises RM score โ€” but RM is an imperfect proxy. The model finds the gap.

RLHF is powerful but complex: three stages, a separate reward model, PPO instability, and massive compute cost. Two newer approaches simplify or improve it significantly.

๐ŸŽฏ
DPO (Rafailov, 2023)

Proves the optimal RLHF policy has a closed form. No RM, no RL โ€” directly optimise on preference pairs. Stable, simpler, same performance.

๐Ÿ†
GRPO (DeepSeek, 2024)

Sample G responses, rank by verifiable reward, group-relative advantage. No value function. Trained DeepSeek-R1 and OpenAI o1/o3-class models.

๐Ÿ”ฌ
Verifiable Rewards

GRPO works best when reward is binary and checkable โ€” math (correct answer?), code (tests pass?). Eliminates reward model bias entirely.

DPO Loss (no reward model needed): LDPO = โˆ’E[log ฯƒ(ฮฒยทlog(ฯ€ฮธ(yw|x)/ฯ€ref(yw|x)) โˆ’ ฮฒยทlog(ฯ€ฮธ(yl|x)/ฯ€ref(yl|x)))] yw = preferred ยท yl = rejected ยท ฯ€ref = frozen SFT ยท ฮฒ = divergence temperature GRPO Group-Relative Advantage: For prompt x, sample G responses โ€” Ai = (r(yi) โˆ’ mean(r)) / std(r) No value function needed โ€” group provides the baseline ยท Works best with verifiable reward (math/code)
RLHF vs DPO vs GRPO โ€” three LLM alignment approaches
RLHF DPO GRPO SFT Reward Model Bradley-Terry pref pairs PPO RL Training KL-penalised reward Aligned LLM โ—โ—โ— complex RL instability, RM needed SFT Direct Preference Optimisation No RM, no RL โ€” log-ratio loss on pairs Aligned LLM โ—โ— simpler No RL, stable, same perf Base LLM (no SFT needed) Sample G responses โ†’ rank โ†’ group advantage โ†’ PPO update Reasoning LLM DeepSeek-R1, o1 โ—โ— simpler Verifiable reward, no RM All three produce aligned LLMs โ€” GRPO excels at reasoning with checkable rewards (math, code)

Robotics exposes every limitation of RL simultaneously: expensive data, dangerous exploration, partial observability, sim-to-real transfer. Yet recent systems have overcome these barriers at scale โ€” from dexterous hands to general-purpose manipulation.

๐ŸŽฒ
OpenAI Rubik's Cube (2019)

Shadow Hand (24 DOF) solves Rubik's cube. Trained entirely in simulation using Automatic Domain Randomisation โ€” progressively harder sim variations.

๐Ÿค–
Google RT-2 (2023)

Fine-tuned VLM (PaLI-X) to predict robot actions from image + text. "Pick up the apple" โ†’ end-effector positions. Generalises to novel objects.

๐Ÿญ
DeepMind Data Centre (2016/22)

DQN on Google data centre sensors. Reduces cooling energy 30-40% (~$10M/year). Safety layer constrains actions within safe range.

Sim-to-Real via Domain Randomisation โ€” robust policy bridges the reality gap
Simulation Training Millions of steps in minutes Friction: 0.3โ€“0.9 ADR randomised Mass: ยฑ20% ADR randomised Gravity: ยฑ10% ADR randomised Joint noise ADR added Lighting varied Action delays randomised Policy trained on ALL variants โ†’ robust across distribution Sim-to-Real Physics mismatch Sensor noise Actuator delays Contact physics Real Robot Deployment No retraining needed Base EEF Real world = one more randomised variant Wide sim distribution covers real physics Domain Randomisation: make sim distribution wide enough to contain the real-world distribution as just one sample Automatic Domain Randomisation (ADR): curriculum that widens the distribution as policy improves
๐ŸŽฏ
Reward Design

A cleaning robot rewarded for "no mess visible" learns to hide mess under the rug. A boat racer learns to spin collecting power-ups instead of racing. Goodhart's Law applies: when a measure becomes a target, it ceases to be a good measure. Solutions: reward shaping, inverse RL, RLHF.

๐Ÿ“Š
Sample Inefficiency

DQN needs 50M Atari frames = 38 days. Humans learn Pong in minutes. The human-to-RL sample efficiency gap is >100ร—. Active research: meta-learning, model-based RL, self-supervised auxiliary tasks.

๐ŸŒŠ
Partial Observability (POMDPs)

Real environments are almost always partially observable โ€” a robot can't see behind itself, an LLM can't see user intent. Agent must infer hidden state from observation history. Solutions: LSTM/GRU policies, Transformers with context, belief state tracking.

โšก
Non-Stationarity

The environment changes over time โ€” a 2019 trading algorithm fails in 2020. RL recommender system faces drifting user preferences. Solutions: continual learning, online adaptation, periodic retraining, meta-learning for fast adaptation.

๐Ÿ”—
Sparse & Delayed Rewards

Chess: one reward after 40+ moves. Robotic grasping: reward only on success. Credit assignment over long horizons is fundamentally hard. Solutions: reward shaping, hindsight experience replay (HER), curiosity-driven exploration.

๐Ÿ”
Exploration in High Dimensions

Random exploration is exponentially inefficient with 12+ DOF robots. Almost never discovers useful behaviours by chance. Solutions: intrinsic motivation (count-based, prediction error, empowerment), curriculum learning, population-based training.

Standard RL maximises expected cumulative reward โ€” no constraint on how. In simulation, failure is cheap (just reset). In the real world, failure can be catastrophic: a self-driving car exploring randomly could kill someone; a medical AI exploring randomly could harm a patient. Safe RL adds explicit constraints to the optimisation.

Constrained RL (CMDP) Objective: maximise J(ฯ€) = E[ฮฃ r(s,a)] subject to Ck(ฯ€) โ‰ค dk for all k Lagrangian relaxation: L(ฯ€, ฮป) = J(ฯ€) โˆ’ ฮฃk ฮปk ยท (Ck(ฯ€) โˆ’ dk) Primal update: gradient ascent on ฯ€ ยท Dual update: ฮปk increases when constraint violated ยท CPO: TRPO-style with hard constraint
โ›“๏ธ
Constrained RL

CMDP: maximise reward subject to cost constraints. Lagrangian relaxation adapts penalty automatically when constraints are violated.

๐Ÿ“ฆ
Offline / Conservative RL

BCQ, CQL, IQL: train on logged data with no env interaction. CQL penalises Q-values for out-of-distribution actions โ€” prevents exploiting gaps.

๐Ÿ›ก๏ธ
Safety Layer

Black-box filter: map any proposed action to nearest safe action. Used in autonomous driving (safety pilot) and robotic control (joint limits, force limits).

PropertyStandard RLSafe RL
ObjectiveMaximise rewardMaximise reward subject to constraints
ExplorationAllowed anywhereBounded to certified-safe region
Failure treatmentLearning experience โ€” reset and continuePotentially unacceptable โ€” must avoid
Deployment contextSimulation, gamesPhysical systems, medical, autonomous vehicles
Reward signalr(s,a)r(s,a) โˆ’ ฮปยทc(s,a) (Lagrangian penalty)
1957 Bellman Equations โ€” "A Markovian Decision Process", dynamic programming foundation
1988 TD Learning โ€” Sutton, "Learning to Predict by the Methods of Temporal Differences"
1989 Q-Learning โ€” Watkins, tabular model-free optimal control; convergence proof 1992
1992 REINFORCE โ€” Williams, policy gradient theorem, first neural policy gradient
1994 TD-Gammon โ˜… Tesauro โ€” RL plays backgammon at world champion level
2013 DQN preprint โ€” DeepMind: "Playing Atari with Deep Reinforcement Learning" (arXiv)
2015 DQN Nature โ˜… Human-level performance on 49 Atari games from raw pixels
2015 TRPO โ€” Schulman et al., trust region policy optimisation, monotonic improvement guarantee
2016 AlphaGo โ˜… Beats Lee Sedol 4-1 โ€” Move 37 declared "impossible" by professionals
2017 PPO โ€” Schulman et al., proximal policy optimisation, becomes the default algorithm
2017 AlphaZero โ˜… Tabula rasa Chess/Go/Shogi โ€” human knowledge obsolete after 36 hours
2018 SAC โ€” Haarnoja et al., Soft Actor-Critic, best continuous control algorithm
2019 OpenAI Five & Rubik's Cube โ˜… Dota 2 world champion team defeated ยท Dexterous robotic hand via domain randomisation
2020 MuZero โ˜… Learns game rules from data โ€” Atari + board games SOTA simultaneously
2022 AlphaTensor + InstructGPT/ChatGPT โ˜… New matrix multiplication algorithms ยท RLHF aligns LLMs to human preferences at scale
2023 DreamerV3 + DPO โ€” Single algorithm across 7 domains ยท Direct Preference Optimisation without RL
2024 GRPO / DeepSeek-R1 + o1/o3 โ˜… RL for verifiable reasoning โ€” new SOTA on AIME, GPQA, coding benchmarks

๐ŸŽ“ Domain 7 Complete โ€” Reinforcement Learning

  • Ch 7.1 โ€” RL = trial-and-error via rewards. MDP = (S,A,P,R,ฮณ). Bellman equations are the recursive foundation of every RL algorithm.
  • Ch 7.2 โ€” DP solves MDPs exactly with a full model. Policy iteration: evaluate + improve. Value iteration: single Bellman sweep. Both converge.
  • Ch 7.3 โ€” MC: average actual returns, no model. TD: online bootstrap after every step. TD(ฮป) bridges both via eligibility traces.
  • Ch 7.4 โ€” Q-learning: off-policy max bootstrap โ†’ Q*. SARSA: on-policy actual-action โ†’ safer conservative policy. Cliff Walking shows the difference.
  • Ch 7.5 โ€” DQN: Q-learning + CNN + experience replay + target network. Double DQN, Dueling, Rainbow: progressively improve stability and accuracy.
  • Ch 7.6 โ€” REINFORCE: โˆ‡log ฯ€ ยท G โ€” unbiased, high variance. Baseline โ†’ advantage. Actor-Critic: online TD ฮด from Critic guides Actor every step.
  • Ch 7.7 โ€” PPO: clip ratio to [1โˆ’ฮต, 1+ฮต] โ€” stable on-policy default. SAC: max-entropy off-policy โ€” best continuous control. DreamerV3: train in imagination.
  • Ch 7.8 โ€” AlphaZero: self-play + MCTS from scratch. RLHF: SFTโ†’RMโ†’PPO = aligned LLMs. DPO: direct preference without RL. Real-world challenges: reward design, safety, sim-to-real.

Reinforcement learning is the only ML paradigm where learning happens through consequences. Every major breakthrough covered here โ€” AlphaGo's Move 37, ChatGPT's instruction-following, robots that grasp objects โ€” came from agents discovering strategies that humans never explicitly programmed.

The next frontier: RL for reasoning (o1/o3/DeepSeek-R1) and long-horizon autonomous agents (Domain 8). The agent-environment loop from Chapter 7.1 becomes the agentic loop from Domain 8 โ€” but now the environment includes the real world, and the agent is an LLM.