Reinforcement Learning
MDPs, Bellman equations, Q-learning, policy gradients, RLHF โ learning through reward signals from games to LLM alignment.
Reinforcement learning is the closest thing AI has to how animals actually learn. No labelled examples. No explicit rules. Just an agent making decisions, receiving rewards or penalties, and slowly figuring out what works. It is the framework behind AlphaGo, ChatGPT's alignment, autonomous robots, and game-playing AIs.
Reinforcement learning (RL) is the third major machine learning paradigm โ alongside supervised and unsupervised learning. Instead of learning from labelled examples, an RL agent learns by interacting with an environment, receiving a reward signal, and adjusting its behaviour to maximise cumulative reward over time.
The learning signal is delayed and sparse. A chess program only receives +1 (win) or โ1 (loss) at the very end of the game. It must figure out which of the 40โ80 moves it played actually caused that outcome โ the credit assignment problem, one of the central challenges in RL.
Three key differences from supervised learning: (1) there is no teacher providing correct action labels; (2) actions affect future states, so decisions are not independent; and (3) training data is generated by the agent's own behaviour โ a non-stationary, self-influencing distribution.
Games & Simulations
- Atari games (DQN)
- Go (AlphaGo)
- StarCraft II (AlphaStar)
- Dota 2 (OpenAI Five)
- Chess + RL (Stockfish)
Robotics & Control
- Robot locomotion
- Manipulation & grasping
- Autonomous driving
- Drone control
- Industrial automation
Language & Alignment
- RLHF for ChatGPT/Claude
- Dialogue systems
- Recommendation engines
- Algorithmic trading
- Drug discovery
At every discrete time step t, the agent-environment interaction follows a four-step cycle: (1) agent observes state st; (2) selects action at via policy ฯ; (3) environment transitions to st+1; (4) environment emits reward rt+1. This repeats until a terminal state (episodic) or runs indefinitely (continuing).
The most fundamental tension in RL is exploration vs exploitation. Exploitation means choosing the action believed best right now. Exploration means trying something new to discover potentially better strategies. Every RL algorithm must balance these two competing imperatives.
Every RL problem is formalised as a Markov Decision Process โ defined by the 5-tuple (S, A, P, R, ฮณ): the state space S, action space A, transition probability P(s'|s,a), reward function R(s,a), and discount factor ฮณ โ [0,1].
The cornerstone assumption is the Markov Property: the future depends only on the current state, not on the full history. Formally: P(st+1 | st, at, โฆ, s0, a0) = P(st+1 | st, at). The current state encodes all information needed to act optimally.
When the agent cannot observe the full state, the problem becomes a Partially Observable MDP (POMDP) โ the agent must maintain a belief state, a probability distribution over possible states, greatly increasing difficulty.
The agent's goal is not to maximise the next reward, but the cumulative discounted reward over the entire episode. The return Gt from time step t is the sum of all future rewards, each discounted by ฮณแต where k is how many steps away it lies.
The discount factor ฮณ โ [0, 1] controls how much the agent values future rewards. With ฮณ = 0 the agent is purely myopic. With ฮณ = 1 all future rewards are equally valued (only valid for finite-horizon tasks). Typical values are ฮณ = 0.95โ0.99. The recursive form Gt = rt+1 + ฮณยทGt+1 is the key identity that makes Bellman equations tractable.
A policy ฯ maps states to actions. A deterministic policy ฯ(s) = a always picks the same action; a stochastic policy ฯ(a|s) outputs a probability distribution over actions. The goal: find the optimal policy ฯ* that maximises expected return.
The state value function Vฯ(s) answers: "How good is state s under policy ฯ?" The action-value function Qฯ(s,a) is more granular: "How good is taking action a in state s, then following ฯ?" Q-values are the direct targets of Q-learning and DQN. The advantage Aฯ(s,a) = Q โ V measures how much better action a is compared to the average action from s โ used in actor-critic methods to reduce variance.
Richard Bellman's 1957 insight is the mathematical spine of all RL. The Bellman Expectation Equation expresses Vฯ(s) recursively: the value of state s equals the immediate reward plus the discounted value of the next state, averaged over all possible next states. This converts an infinite-horizon problem into a tractable recursive computation.
The Bellman Optimality Equation does the same for the optimal value function V*: the optimal value is obtained by always choosing the best possible action. Every major RL algorithm โ Q-learning, DQN, PPO, SAC โ is explicitly or implicitly solving a form of this equation.
The Bellman equation is why RL works at all: it converts predicting infinite-horizon returns into predicting one-step rewards plus bootstrapped future values. Without it, every policy gradient update would require complete Monte Carlo rollouts โ far too slow for complex tasks.
RL algorithms split along two major axes: (1) whether they use an explicit model of transition dynamics, and (2) whether they optimise a value function, directly a policy, or both via an actor-critic hybrid.
| Algorithm | Type | Action Space | Needs Model? | Sample Efficiency | Stability | Chapter |
|---|---|---|---|---|---|---|
| Value Iteration | DP | Discrete | Yes (full) | N/A (planning) | High | 7.2 |
| Q-Learning | TD, off-policy | Discrete | No | Low | Medium | 7.4 |
| SARSA | TD, on-policy | Discrete | No | Low | High | 7.4 |
| DQN | Deep Q, off-policy | Discrete | No | Medium | Medium | 7.5 |
| REINFORCE | Policy Gradient | Cont/Disc | No | Very Low | Low | 7.6 |
| PPO | Actor-Critic, on-policy | Cont/Disc | No | Medium | High | 7.7 |
| SAC | Actor-Critic, off-policy | Continuous | No | High | High | 7.7 |
| MuZero | Model-Based | Discrete | Yes (learned) | High | High | 7.8 |
โ Chapter 7.1 โ Key Takeaways
- RL: agent maximises cumulative reward via trial-and-error interaction โ no teacher, no labelled data
- Credit assignment problem: which of 40โ80 chess moves caused the win/loss 50 steps later?
- MDP: (S, A, P, R, ฮณ) โ the formal framework; Markov property: future depends only on current state
- Return: G‑t = ฮฃฮณแตr‑t+k+1 โ discount factor ฮณ controls short vs long-term thinking
- V‑ฯ(s): "how good is this state?" โ Q‑ฯ(s,a): "how good is this action in this state?"
- Bellman equations: V*(s) = max‑a[R + ฮณฮฃPยทV*(s')] โ recursive foundation of every RL algorithm
- Algorithm families: Model-Based vs Model-Free; Value-Based vs Policy-Based vs Actor-Critic
Dynamic Programming (DP) is a family of algorithms that solve MDPs exactly โ given perfect knowledge of the environment model: every transition probability P(s'|s,a) and every reward R(s,a). Because real environments rarely hand us this model, DP is not deployed in production RL โ but it is the indispensable theoretical bedrock every modern algorithm is built on.
Knows P(s'|s,a) and R(s,a) for every (s,a,s') tuple โ the "planning" setting.
Repeatedly apply the Bellman operator T across all states until V converges to V*.
The Bellman operator is a ฮณ-contraction โ V_k โ V* is mathematically guaranteed.
Two main algorithms arise from DP: Policy Iteration (evaluate then improve a fixed policy in alternating steps) and Value Iteration (combine both steps into a single max-Bellman update). Both converge; they differ in computational trade-offs.
Given a fixed policy ฯ, Policy Evaluation computes Vฯ(s) for every state by repeatedly applying the Bellman expectation equation until the values stop changing (convergence threshold ฮธ).
Once we have Vฯ for a fixed policy ฯ, the Policy Improvement Theorem guarantees we can construct a strictly better (or equal) policy by acting greedily with respect to Vฯ.
Run policy evaluation to get accurate Vฯ values for the current policy.
New policy ฯ'(s) takes the action maximising Qฯ(s,a) at each state.
If ฯ' = ฯ (no action changed), the policy is already optimal: ฯ = ฯ*.
Policy Iteration alternates between Policy Evaluation and Policy Improvement until the policy stabilises. Despite each evaluation requiring many inner iterations, the outer loop converges in very few steps โ typically 2โ10 even for large MDPs.
// Policy Iteration
Initialise ฯ(s) = arbitrary policy for all s โ S
Initialise V(s) = 0 for all s โ S
LOOP (outer โ policy improvement):
// โโ Step 1: Policy Evaluation (inner loop) โโ
LOOP:
ฮ = 0
FOR each state s โ S:
v = V(s)
V(s) โ ฮฃ_a ฯ(a|s) ฮฃ_s' P(s'|s,a) [R(s,a,s') + ฮณยทV(s')]
ฮ = max(ฮ, |v - V(s)|)
IF ฮ < ฮธ: BREAK // inner convergence
// โโ Step 2: Policy Improvement โโ
policy_stable = TRUE
FOR each state s โ S:
old_action = ฯ(s)
ฯ(s) โ argmax_a ฮฃ_s' P(s'|s,a) [R(s,a,s') + ฮณยทV(s')]
IF old_action โ ฯ(s): policy_stable = FALSE
IF policy_stable: RETURN ฯ, V // converged to ฯ*, V* Value Iteration eliminates the inner evaluation loop by fusing policy evaluation and improvement into a single Bellman optimality update. Instead of summing over ฯ(a|s), it takes the max over actions โ implicitly acting greedily at every step.
// Value Iteration
Initialise V(s) = 0 for all s โ S
Set ฮธ = 1e-6 (convergence threshold), ฮณ โ [0,1)
LOOP:
ฮ = 0
FOR each state s โ S:
v = V(s)
// Bellman optimality operator โ single sweep, no inner loop
V(s) โ max_a [ ฮฃ_s' P(s'|s,a) * (R(s,a,s') + ฮณ * V(s')) ]
ฮ = max(ฮ, |v - V(s)|)
IF ฮ < ฮธ: BREAK // values converged
// Extract greedy policy from V*
FOR each state s โ S:
ฯ*(s) โ argmax_a [ ฮฃ_s' P(s'|s,a) * (R(s,a,s') + ฮณ * V(s')) ]
RETURN ฯ*, V* DP is mathematically elegant but practically limited. Three hard walls prevent it from scaling to real-world problems โ and each wall motivated a different branch of modern RL.
P(s'|s,a) and R(s,a) must be known exactly. Real environments (games, robots, markets) don't hand you their dynamics.
State space explodes exponentially. Atari screen: 160ร210ร128 colours. Chess: ~1044 states. Full sweeps become impossible.
Every iteration updates every state: O(|S|ยฒ|A|). Wastes compute on rarely-visited states. Sample-based methods only update visited states.
| Property | Dynamic Programming | Model-Free RL (upcoming) |
|---|---|---|
| Environment model | Requires P(s'|s,a) and R | Learns from sampled experience |
| Computation per step | O(|S|ยฒ|A|) โ sweeps all states | O(1) per sample โ updates visited states only |
| Convergence | Guaranteed (exact) | Guaranteed under conditions |
| Value representation | Table โ one entry per state | Neural network (deep RL) โ generalises |
| Applicability | Small, known, discrete MDPs | Unknown, large, continuous environments |
Despite its impracticality, DP is the theoretical target all RL algorithms approximate. Monte Carlo methods replace full sweeps with sampled episodes. TD learning (Chapter 7.3) bootstraps from partial trajectories. Deep RL (Chapter 7.5+) replaces the value table with a neural network โ enabling Atari, Go, and robotic control at scale.
โ Chapter 7.2 โ Key Takeaways
- DP requires the full MDP model (P and R) โ not available in real problems but conceptually essential
- Policy Evaluation: iterative Bellman expectation updates converge to Vฯ for a fixed policy
- Policy Improvement Theorem: greedy w.r.t. Vฯ is always at least as good as ฯ
- Policy Iteration: alternate evaluation + improvement โ guaranteed convergence in very few outer iterations
- Value Iteration: single Bellman optimality sweep per iteration โ simpler, no inner loop, equally guaranteed
- Limitation: curse of dimensionality โ model-free methods overcome this by learning from sampled experience
Monte Carlo (MC) methods learn directly from complete episodes of experience โ no model of P(s'|s,a) needed. The idea is disarmingly simple: visit state s many times, record the actual return Gt each time, and average them. Like estimating the probability of heads by flipping a coin many times and averaging the results.
Learn directly from sampled episodes โ P(s'|s,a) never needed.
Must complete a full episode before any value update. Episodic tasks only.
V(s) โ mean of all Gt observed when s was visited โ unbiased estimate.
Two variants: First-visit MC averages returns from only the first visit to s per episode; Every-visit MC averages returns from every visit. Both converge to Vฯ with enough data.
The prediction problem: estimate Vฯ(s) for a given policy ฯ. Algorithm: run many episodes following ฯ, compute actual returns at each visit to s, average them. Classic example โ Blackjack: state is (player sum, dealer card, usable ace?); policy is hit if sum < 20 else stick. After 10,000 episodes a clear value landscape emerges.
Control means finding the optimal policy, not just evaluating a given one. MC Control applies the GPI (Generalised Policy Iteration) loop using MC evaluation โ but requires special care around exploration.
Estimate Q(s,a) from actual returns โ then improve: ฯ(s) = argmaxa Q(s,a).
Must visit all (s,a) pairs. Exploring starts: begin episodes at random (s,a). ฮต-greedy: with prob ฮต take random action.
On-policy: improve policy being followed. Off-policy: follow behaviour policy b, evaluate target ฯ via importance sampling.
TD learning is the central idea of modern RL โ it combines the best of Monte Carlo and Dynamic Programming. Like MC: learns from raw experience, no model needed. Like DP: bootstraps โ updates V(s) using the current estimate of V(s'), not the full actual return. The payoff: update after every single step, not after the whole episode.
Update after every transition (s, a, r, s'). Works for continuing tasks โ no episode boundary needed.
Uses V(s') to update V(s) โ the "TD target" rโโโ + ฮณV(sโโโ) replaces the full return Gt.
ฮดโ = rโโโ + ฮณV(sโโโ) โ V(sโ) โ the "surprise". The brain's dopamine signal is a biological TD error.
TD(0) is the simplest TD algorithm โ 1-step bootstrapping. It updates V(st) immediately after observing the next reward and state, using only one step of lookahead.
// TD(0) Policy Evaluation
Initialise V(s) = 0 for all s โ S
Set ฮฑ = 0.1, ฮณ = 0.9
FOR each episode:
Initialise state s
LOOP (each step t until terminal):
a โ ฯ(a|s) // follow policy
r, s' โ env.step(a) // observe reward and next state
ฮด = r + ฮณ * V(s') - V(s) // TD error
V(s) โ V(s) + ฮฑ * ฮด // online update โ no need to wait!
s โ s'
END LOOP
END FOR
// Worked trace (chain: A โ B โ C โ GOAL, ฮณ=0.9, ฮฑ=0.1):
// t=0: s=A, r=0, s'=B โ ฮด=0+0.9ร0โ0=0 โ V(A)=0
// t=1: s=B, r=0, s'=C โ ฮด=0+0.9ร0โ0=0 โ V(B)=0
// t=2: s=C, r=+1, s'=โ
โ ฮด=1+0.9ร0โ0=+1 โ V(C)=0.1
// (next episode, s=C again):
// ฮด = 1+0โ0.1 = +0.9 โ V(C)=0.1+0.1ร0.9=0.19
// After many episodes: V(C)โ1.0, V(B)โ0.9, V(A)โ0.81 | Property | Monte Carlo | TD(0) | Dynamic Programming |
|---|---|---|---|
| Needs model? | No โ samples | No โ samples | Yes โ full P, R |
| Update timing | End of episode | Every step (online) | Every sweep (all states) |
| Bootstraps? | No โ actual return Gt | Yes โ uses V(s') estimate | Yes โ from model |
| Bias / Variance | Zero bias, high variance | Some bias, lower variance | Zero bias, zero variance |
| Continuing tasks | No โ needs terminal | Yes | Yes |
| Memory per update | Full episode stored | st, at, rt+1, st+1 only | Full V(s) table |
| Convergence | To Vฯ (enough data) | To Vฯ (under conditions) | To Vฯ exactly |
TD(0) only updates the immediately preceding state; MC updates all states in an episode. TD(ฮป) interpolates between them via a parameter ฮป โ [0,1]: at ฮป=0 it is pure TD(0); at ฮป=1 it is equivalent to MC. The mechanism is the eligibility trace eโ(s) โ a running score of how recently and frequently each state was visited.
โ Chapter 7.3 โ Key Takeaways
- MC: learn from complete episodes โ average actual returns โ no model, zero bias, high variance
- TD: learn online after every step via bootstrapping โ combines strengths of MC and DP
- TD error ฮดt = rt+1 + ฮณV(st+1) โ V(st) โ the "surprise" that drives learning
- TD(0): lower variance than MC but biased by bootstrapped V(s') estimate
- TD vs MC: TD is more data-efficient and works on continuing tasks โ MC needs full episodes
- TD(ฮป): eligibility traces bridge TD(0) and MC โ ฮป=0 is TD(0), ฮป=1 is equivalent to MC
Watkins (1989) introduced Q-Learning โ still the most widely taught RL control algorithm. It learns the optimal action-value function Q*(s,a) directly, regardless of what policy is being followed during training. This makes it off-policy: the agent can explore freely, and the update still converges to the optimal values.
No need to know the model. Converges to the optimal Q-function under mild tabular conditions.
Update uses maxa' Q(s',a') โ the best possible next action, not the one actually taken.
After convergence: ฯ*(s) = argmaxa Q*(s,a). Extract policy for free from Q-table.
// Q-Learning Algorithm
Initialise Q(s,a) = 0 for all s โ S, a โ A
Set ฮฑ=0.1, ฮณ=0.99, ฮต=1.0
FOR each episode:
s = env.reset()
LOOP until terminal:
// ฮต-greedy action selection
IF random() < ฮต: a = random action // explore
ELSE: a = argmax_a Q(s,a) // exploit
r, s' = env.step(a)
// Off-policy update: always uses max over next actions
best_next = max over a' of Q(s', a')
Q(s,a) += ฮฑ * (r + ฮณ * best_next - Q(s,a))
s = s'
ฮต = max(ฮต_min, ฮต * ฮต_decay) // decay exploration rate In tabular RL the Q-function is stored as a 2-D table: rows are states, columns are actions, cells hold Q(s,a) estimates. Each transition updates one cell. Below: a step-by-step trace through a 4-state chain MDP, and a full Gymnasium implementation.
import gymnasium as gym
import numpy as np
env = gym.make('FrozenLake-v1', is_slippery=False)
n_states = env.observation_space.n # 16 states (4ร4 grid)
n_actions = env.action_space.n # 4 actions
Q = np.zeros((n_states, n_actions))
alpha, gamma = 0.1, 0.99
epsilon = 1.0
eps_min = 0.01
eps_decay = 0.995
n_episodes = 5000
for ep in range(n_episodes):
state, _ = env.reset()
for _ in range(200):
# ฮต-greedy action
if np.random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])
next_state, reward, done, truncated, _ = env.step(action)
# Q-Learning update
td_target = reward + gamma * np.max(Q[next_state]) * (not done)
Q[state, action] += alpha * (td_target - Q[state, action])
state = next_state
if done or truncated: break
epsilon = max(eps_min, epsilon * eps_decay)
print("Learned Q-table:\n", Q.reshape(4, 4, 4).round(2)) The exploration-exploitation dilemma is central to all RL: to learn good values you must explore, but to perform well you must exploit. ฮต-greedy is the simplest solution โ take a random action with probability ฮต, else take the best known action. ฮต is annealed from 1.0 (pure explore) down to a small floor (mostly exploit).
Fully random โ discover the environment before exploiting any knowledge.
ฮต โ max(ฮต_min, ฮต ร 0.995) โ learning happens in the transition zone.
Mostly exploit Q*; tiny ฮต keeps discovering state-action pairs not yet visited.
SARSA (Rummery & Niranjan, 1994) is Q-learning's on-policy cousin. The name comes from the 5-tuple it uses per update: (St, At, Rt+1, St+1, At+1). The key difference: the update uses the actual next action taken โ not the greedy max.
// SARSA Algorithm
FOR each episode:
s = env.reset()
a = ฮต_greedy(Q, s) // select FIRST action
LOOP until terminal:
r, s' = env.step(a)
a' = ฮต_greedy(Q, s') // select NEXT action BEFORE updating
// SARSA: uses actual (s, a, r, s', a') โ not max
Q(s,a) += ฮฑ * (r + ฮณ * Q(s',a') - Q(s,a))
s = s'
a = a' // carry forward the chosen action The on/off-policy distinction is subtle but has a concrete consequence: near danger. Q-learning sees the world through the lens of optimal behaviour and ignores its own exploration noise. SARSA accounts for the fact that it sometimes takes random actions โ and learns a more cautious policy as a result.
| Property | Q-Learning (Off-Policy) | SARSA (On-Policy) |
|---|---|---|
| Learns | Optimal Q*(s,a) directly | Qฯ for policy being followed |
| Next-action in update | maxa' Q(s',a') โ best possible | Q(s',a') โ actual action taken |
| Behaviour = Target policy? | No โ can use any behaviour policy | Yes โ same policy for exploration & learning |
| Convergence speed | Faster to optimal | Slower โ more conservative |
| Near hazards | Risky โ ignores exploration falls | Safer โ accounts for ฮต random steps |
| Used in | DQN, most deep RL | Safety-critical, tabular control |
The Cliff Walking benchmark (Sutton & Barto) makes the on/off-policy difference vivid. A 4ร12 grid: start bottom-left, goal bottom-right. The entire bottom row between them is cliff โ R=โ100 and agent resets. Normal steps cost R=โ1.
โ Chapter 7.4 โ Key Takeaways
- Q-learning: Q(s,a) โ Q(s,a) + ฮฑ[r + ฮณยทmaxa' Q(s',a') โ Q(s,a)] โ off-policy
- SARSA: Q(s,a) โ Q(s,a) + ฮฑ[r + ฮณยทQ(s',a') โ Q(s,a)] โ on-policy (actual next action)
- ฮต-greedy: ฮต=1.0 โ pure explore โ anneal to ฮต_min as agent learns the Q-table
- Q-learning: finds optimal path; SARSA: finds safe path accounting for ฮต exploration noise
- Both converge โ Q-learning to Q*, SARSA to Qฯ โ for tabular environments with sufficient data
- Limitation: tabular โ can't handle large/continuous state spaces โ solved by DQN (Ch 7.5)
Tabular Q-learning stores one number per (state, action) pair. That works fine for small, discrete environments โ but consider Atari Pong: 160ร210 pixels, 128 colours per pixel โ ~10568,000 possible states. A Q-table is not just impractical, it is cosmologically impossible. Continuous control (robot joints, autonomous driving) has an infinite state space.
The solution: approximate Q(s,a) โ Q(s,a;ฮธ) using a parameterised function (neural network with weights ฮธ). The network generalises โ similar inputs produce similar outputs โ so states never seen before can still receive reasonable Q-value estimates. A Deep Q-Network (DQN) is a CNN mapping raw pixels directly to Q-values.
DQN was first published in Nature (2015) with a single striking result: one network, one set of hyperparameters, learned to play 49 Atari games from raw pixels better than a professional human game tester on 29 of them โ with zero game-specific knowledge. This was the moment deep learning and RL merged into "deep reinforcement learning".
Mnih et al. โ "Human-level control through deep reinforcement learning", Nature 2015A network with millions of weights can represent Q-values over an astronomically large state space.
Similar pixel patterns โ similar Q-values. The agent recognises "ball near paddle" across screen positions.
Same CNN, same loss function, same hyperparameters โ 49 games, no task-specific tuning.
The DQN input is 4 stacked 84ร84 grayscale frames โ the stack gives the network a sense of motion (velocity of the ball, direction of movement) that a single frame cannot convey. Three convolutional layers extract spatial features; two FC layers decode those features into one Q-value per action.
import torch
import torch.nn as nn
class DQN(nn.Module):
"""DQN as in the original Atari paper (Mnih et al. 2015)"""
def __init__(self, n_actions: int):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(4, 32, kernel_size=8, stride=4), nn.ReLU(), # 4 stacked frames
nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.ReLU(),
)
self.fc = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 7 * 7, 512), nn.ReLU(),
nn.Linear(512, n_actions) # one output per action
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x: (batch, 4, 84, 84) โ normalise pixels to [0, 1]
return self.fc(self.conv(x / 255.0))
net = DQN(n_actions=6)
x = torch.randint(0, 255, (32, 4, 84, 84), dtype=torch.float32)
q_values = net(x) # (32, 6) โ Q-value per action
best_actions = q_values.argmax(dim=1) # (32,) โ greedy action
print(f"Output shape: {q_values.shape}") # torch.Size([32, 6]) Naively applying gradient descent to TD errors with a neural network diverges. DQN introduces two stabilisation innovations that together tame the instability and make deep RL practical.
Sequential transitions (sโโsโโsโโฆ) are highly correlated. Neural nets need i.i.d. samples โ not time-series data.
Store 1M transitions. Sample random mini-batches of 32. Each transition reused many times โ better data efficiency.
Frozen copy Q(s,a;ฮธโป) updated every 1000 steps. Stable TD target = stable training. Without it: chasing a moving target.
The full DQN training loop combines all the pieces into a clean 7-step cycle that repeats millions of times until convergence.
Stack 4 frames, preprocess to 84ร84 grayscale
ฮต-greedy from Q-network output
Observe rt+1, st+1 from env
Push (st, at, rt+1, st+1) to replay buffer
32 random transitions from replay buffer
TD error using frozen target network ฮธโป
Backprop; sync ฮธโป โ ฮธ every 1K steps
Van Hasselt et al. (2016) identified that standard DQN systematically
overestimates Q-values. The culprit: maxa' Q(s',a';ฮธโป)
always selects the highest value โ if any Q-value is overestimated (inevitable early in
training), it picks that inflated value, and the bias propagates. Fix: decouple
action selection from action evaluation.
max of noisy Q-values is always โฅ true max. With neural nets, noise is everywhere early in training.
Online net (ฮธ) picks the best action. Target net (ฮธโป) scores it. Two different noise sources cancel.
More accurate value estimates โ better policy decisions โ higher scores on most Atari games.
Wang et al. (2016) observed that in many states the choice of action barely matters โ an empty screen in Pong, a static field in Breakout. Only near the ball does the action have a significant effect. The Dueling DQN architecture exploits this by splitting Q(s,a) into two components: V(s) (how good is this state, regardless of action) and A(s,a) (how much better is this specific action than average).
Hessel et al. (DeepMind, 2018) asked: what if we combine all the improvements to DQN into one agent? The result โ Rainbow โ obliterated all prior benchmarks on Atari while using far fewer environment interactions.
| Extension | Key Innovation | Benefit | Year |
|---|---|---|---|
| DQN | CNN + replay buffer + target network | Stable deep RL on raw pixels | 2015 |
| Double DQN | Decouple selection/evaluation | Reduce Q-value overestimation | 2016 |
| Prioritised Replay | Sample high-TD-error transitions more | Better data efficiency | 2016 |
| Dueling DQN | V(s) + A(s,a) decomposition | Better state-value learning | 2016 |
| Multi-step Returns | n-step bootstrap target | Faster credit assignment | 2016 |
| Distributional RL (C51) | Predict full reward distribution (not mean) | Better risk modelling | 2017 |
| Noisy DQN | Learnable noise in FC weights replaces ฮต | State-dependent exploration | 2017 |
| Rainbow | All 6 improvements combined | State-of-the-art Atari (3ร DQN data efficiency) | 2018 |
โ Chapter 7.5 โ Key Takeaways
- DQN: Q-learning with a CNN as function approximator โ scales to pixel-level inputs that destroy tabular methods
- Experience replay: random mini-batches from a large buffer break temporal correlations and reuse data
- Target network: frozen copy updated every C steps stabilises the TD target โ without it, training diverges
- Double DQN: decouple action selection (online ฮธ) from evaluation (target ฮธโป) โ eliminates overestimation bias
- Dueling DQN: Q(s,a) = V(s) + A(s,a) โ separate heads for state value and action advantage, better generalisation
- Rainbow: 6 improvements combined โ state-of-the-art discrete-action deep RL, 3ร more data-efficient than DQN
Q-learning and DQN are powerful โ but they require enumerating a Q-value for every possible action. That works for discrete actions (6 Atari buttons) but breaks immediately for continuous action spaces: robot joint torques, steering angles, portfolio weights. How do you take argmax over an infinite set? You can't.
Policy gradient methods sidestep this by directly parameterising the policy ฯ(a|s;ฮธ) as a neural network and performing gradient ascent on expected return J(ฮธ). They also naturally produce stochastic policies โ essential in adversarial games where determinism is exploitable (rock-paper-scissors, poker).
| Property | Value-Based (DQN) | Policy-Based (Policy Gradient) |
|---|---|---|
| What is learned | Q-function, then extract ฯ | Policy ฯ directly |
| Action space | Discrete only | Continuous and discrete |
| Policy type | Deterministic (argmax) | Stochastic (probability dist.) |
| Optimisation | Indirect โ Q โ ฯ | Direct gradient ascent on J(ฮธ) |
| Variance | Lower (bootstrapping) | Higher (Monte Carlo returns) |
| Applicability | Atari, discrete games | Robotics, continuous control, NLP (RLHF) |
Williams (1992) proved that the gradient of expected return with respect to policy parameters has a remarkably clean form โ computable purely from sampled trajectories, without knowledge of the environment model.
โ log ฯ(a|s;ฮธ) = โฯ/ฯ โ lets us compute the gradient from samples without differentiating through the environment.
Actions that led to high Gt get their probability increased. Low Gt actions get decreased. Simple and elegant.
Only requires sampled (s,a,G) tuples. Works for any differentiable policy ฯ(a|s;ฮธ) โ including neural networks.
REINFORCE is the simplest instantiation of the policy gradient theorem โ Monte Carlo style: collect a full episode, compute discounted returns Gt at every step, then update the policy. Unbiased but high-variance.
import torch, torch.nn as nn
import gymnasium as gym
class PolicyNetwork(nn.Module):
def __init__(self, obs_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, 64), nn.ReLU(),
nn.Linear(64, action_dim) # raw logits
)
def forward(self, x): return self.net(x)
env = gym.make('CartPole-v1')
policy = PolicyNetwork(4, 2) # 4 obs dims, 2 actions
optim = torch.optim.Adam(policy.parameters(), lr=3e-4)
gamma = 0.99
for episode in range(1000):
state, _ = env.reset()
log_probs, rewards = [], []
while True: # collect full episode
state_t = torch.tensor(state, dtype=torch.float32)
logits = policy(state_t)
dist = torch.distributions.Categorical(logits=logits)
action = dist.sample() # stochastic action
log_probs.append(dist.log_prob(action)) # log ฯ(a|s;ฮธ)
state, reward, done, trunc, _ = env.step(action.item())
rewards.append(reward)
if done or trunc: break
# Compute discounted returns G (backwards accumulation)
G, returns = 0, []
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.tensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-8) # normalise
# Policy gradient loss โ negative for gradient ASCENT
loss = -sum(lp * g for lp, g in zip(log_probs, returns))
optim.zero_grad(); loss.backward(); optim.step() REINFORCE works in theory but is slow in practice because Gt estimates from single trajectories have extremely high variance. The fix: subtract a baseline b(st) from the return. The baseline doesn't change the expected gradient โ it just centres it, dramatically reducing variance.
Using V(s) as a baseline requires learning V(s). The natural solution: train a separate Critic network alongside the policy (Actor). The Critic computes the TD error ฮดt โ a single-step advantage estimate โ and feeds it back to guide the Actor's gradient updates. This enables online, per-step learning without waiting for an episode to end.
Mnih et al. (DeepMind, 2016) scaled Actor-Critic to deep networks with a key insight: parallelism solves the correlation problem. Instead of a replay buffer, run many independent workers simultaneously โ each in its own environment copy. Their diverse experiences are naturally uncorrelated.
Workers push gradients to a global network asynchronously. Fast wall-clock time โ no waiting for others.
All workers step together; global update after each sync. Simpler, deterministic, often matches A3C performance.
Workers start from different states, explore different regions โ naturally uncorrelated gradients without a replay buffer.
A policy trained purely to maximise return tends to collapse to deterministic โ it finds one good action per state and stops exploring everything else. This is catastrophic in environments where exploration is critical or where the optimal policy is genuinely stochastic.
The fix: add an entropy bonus H(ฯ) to the objective. Entropy measures how spread-out a probability distribution is โ maximising it encourages the policy to remain uncertain and keep exploring. The coefficient ฮฒ controls the exploration-exploitation trade-off. Entropy regularisation is used in A3C, PPO, and is a cornerstone of SAC (Soft Actor-Critic), where maximising entropy is part of the fundamental objective.
All actions equally likely โ maximum randomness. High entropy = diverse exploration.
Large ฮฒ: explore broadly. Small ฮฒ: converge to best policy. ฮฒ=0: no entropy (pure exploitation).
Soft Actor-Critic maximises expected return AND entropy simultaneously โ state-of-the-art continuous control.
โ Chapter 7.6 โ Key Takeaways
- Policy gradient: directly optimise โฮธ J(ฮธ) = E[โlog ฯ ยท Gt] โ works for continuous and stochastic actions
- REINFORCE: full episode โ compute returns โ update โ unbiased but high variance
- Baseline: subtract V(s) from returns โ advantage A = GโV reduces variance without adding bias
- Actor-Critic: Actor acts, Critic provides TD error ฮด โ online per-step updates, lower variance than REINFORCE
- A3C / A2C: parallel workers with shared global network โ diverse experience, fast training, no replay buffer
- Entropy bonus: prevent policy collapse โ encourage exploration throughout training (key in SAC)
Most research RL happens in textbooks. Most production RL runs PPO or SAC. These two algorithms dominate because they solved the core instability problems of earlier methods. Understanding why they work โ not just how โ is what separates an RL practitioner from someone who just copies hyperparameters from a blog post.
Schulman et al. (OpenAI/Berkeley, 2015) identified the root instability in policy gradient methods: step size. Too large a step and the new policy is so different that the old value estimates are invalid โ training collapses catastrophically and cannot recover. Too small and learning is painfully slow. This is fundamentally different from supervised learning where labels are fixed โ in RL, a bad update corrupts the very data distribution used for future learning.
TRPO's solution: constrain each update so the new policy stays within a trust region โ KL(ฯ_old โ ฯ_new) โค ฮด. The surrogate objective is a valid lower bound on true performance inside this region, guaranteeing monotonic improvement. The catch: enforcing this hard constraint requires second-order optimisation (Fisher information matrix, conjugate gradients) at O(|ฮธ|ยฒ) cost โ infeasible for large networks.
| Property | TRPO | PPO (next section) |
|---|---|---|
| Constraint type | Hard KL constraint | Soft clip (approximate) |
| Improvement guarantee | Monotonic โ provable | Approximate but reliable |
| Optimisation order | Second-order (Fisher matrix) | First-order (Adam/SGD) |
| Compute per update | O(|ฮธ|ยฒ) โ infeasible at scale | O(|ฮธ|) โ standard backprop |
| Implementation | Complex (CG solver, line search) | Simple (~50 lines of PyTorch) |
| Used in practice | Rarely โ historical reference | Default algorithm for most tasks |
Schulman et al. (OpenAI, 2017) asked: can we get TRPO's stability with standard first-order optimisation? The answer is PPO โ the most widely used deep RL algorithm in 2024. It powers ChatGPT/Claude/Gemini alignment (RLHF), OpenAI Five, AlphaStar, Boston Dynamics locomotion, and continuous control across robotics.
PPO replaces TRPO's hard KL constraint with a clipped surrogate objective. Define the probability ratio rt(ฮธ) = ฯnew(at|st) / ฯold(at|st). Instead of constraining this ratio via Lagrange multipliers, simply clip it to [1โฮต, 1+ฮต] and take the minimum of clipped and unclipped โ creating a flat region where the gradient is zero, which naturally prevents too-large updates.
// PPO Algorithm (pseudocode)
Initialise policy ฯ_ฮธ, value function V_ฯ
Set ฮต=0.2, ฮณ=0.99, ฮป=0.95, K=10, T=2048, B=64, c1=0.5, c2=0.01
LOOP:
// Phase 1: Collect rollout with current policy
FOR t = 0 to T-1 (across N envs):
a_t ~ ฯ_ฮธ(ยท|s_t)
s_t1, r_t1 = env.step(a_t)
Store (s_t, a_t, r_t1, s_t1, log ฯ_ฮธ(a_t|s_t), V_ฯ(s_t))
// Phase 2: Compute GAE advantages
FOR each step t (backwards):
delta = r_t1 + ฮณยทV_ฯ(s_t1) - V_ฯ(s_t)
A_hat[t] = delta + (ฮณยทฮป)ยทA_hat[t+1] // GAE
A_hat = (A_hat - mean(A_hat)) / std(A_hat) // normalise
// Phase 3: K epochs of mini-batch updates
ฮธ_old โ ฮธ // freeze reference policy
FOR epoch = 1 to K:
FOR each mini-batch of B samples:
r_t = ฯ_ฮธ(a_t|s_t) / ฯ_old(a_t|s_t) // probability ratio
L_CLIP = mean(min(r_t*A_hat, clip(r_t,1-ฮต,1+ฮต)*A_hat))
L_VF = mean((V_ฯ(s_t) - V_target)ยฒ)
L_H = mean(H(ฯ_ฮธ(ยท|s_t)))
loss = -(L_CLIP - c1*L_VF + c2*L_H)
backprop(loss); clip_grad_norm_(0.5); optimizer.step() from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.callbacks import EvalCallback
import gymnasium as gym
vec_env = make_vec_env("LunarLander-v2", n_envs=8)
model = PPO(
policy="MlpPolicy", env=vec_env,
clip_range=0.2, # ฮต โ trust region boundary
n_steps=2048, # T steps per env per rollout
batch_size=64, # mini-batch size B
n_epochs=10, # K update epochs per rollout
gamma=0.99, # discount factor
gae_lambda=0.95, # GAE ฮป
ent_coef=0.01, # entropy coefficient c2
vf_coef=0.5, # value loss coefficient c1
max_grad_norm=0.5, # gradient clipping
learning_rate=3e-4,
verbose=1, tensorboard_log="./ppo_tb/"
)
eval_cb = EvalCallback(gym.make("LunarLander-v2"),
best_model_save_path="./ppo_best/", eval_freq=10_000, n_eval_episodes=10)
model.learn(total_timesteps=1_000_000, callback=eval_cb) Schulman et al. (2016) formalised a key trade-off: the advantage estimate At used in policy gradient updates can be computed at different "depths" โ ranging from a pure TD(0) estimate (low variance, high bias) to a full Monte Carlo return (zero bias, high variance). GAE smoothly interpolates between these extremes with parameter ฮป, and ฮป=0.95 has proven the sweet spot across nearly every task.
Haarnoja et al. (Berkeley + Google, 2018) introduced SAC โ the dominant algorithm for continuous control. While PPO is on-policy (needs fresh data every update), SAC is off-policy โ it reuses all past experience from a replay buffer. Its fundamental innovation is maximum entropy RL: the agent is rewarded not just for getting high return, but for maintaining a stochastic (uncertain) policy.
Controls entropy-reward balance. SAC auto-tunes ฮฑ to maintain target entropy Hฬ = โ|A| โ eliminating the only sensitive hyperparameter.
Two Q-networks Q1, Q2 trained independently. Take min(Q1, Q2) for targets โ prevents Q-value overestimation.
1M-transition replay buffer. Sample random mini-batches. Reuses all past experience โ 10ร more sample efficient than PPO.
from stable_baselines3 import SAC
import gymnasium as gym
env = gym.make("HalfCheetah-v4")
model = SAC(
policy="MlpPolicy", env=env,
learning_rate=3e-4,
buffer_size=1_000_000, # replay buffer capacity
batch_size=256, # mini-batch size
tau=0.005, # soft target network update ฯ
gamma=0.99,
ent_coef="auto", # auto-tune ฮฑ
target_entropy="auto", # Hฬ = -|A|
learning_starts=10_000, # fill buffer before training
train_freq=1, # update every step
gradient_steps=1,
verbose=1
)
model.learn(total_timesteps=1_000_000)
model.save("sac_halfcheetah") Fujimoto et al. (2018) dissected three specific failure modes of DDPG (Deep Deterministic Policy Gradient) and engineered a targeted fix for each. TD3 is a lean, deterministic off-policy algorithm that is an important baseline for continuous control benchmarks.
Q1 and Q2 trained independently. Target uses min(Qฬ1, Qฬ2) โ eliminates overestimation bias.
Update actor every 2 critic steps. Prevents actor from over-fitting to a noisy, underfit critic.
Add clipped Gaussian noise to target actions. Smooths Q-landscape โ reduces high-variance targets.
| Property | SAC | TD3 |
|---|---|---|
| Policy type | Stochastic Gaussian | Deterministic |
| Objective | Max-entropy (reward + H) | Standard reward only |
| Temperature | Auto-tuned ฮฑ | No temperature โ tune noise ฯ |
| Sample efficiency | Higher (entropy bonus aids exploration) | Good but less robust |
| Inference | Sample from Gaussian (stochastic) | Forward pass only (deterministic) |
| Preferred for | Most continuous tasks (default) | Specific benchmarks, ablations |
Model-free RL needs millions of real environment interactions. Atari DQN requires 50M frames โ equivalent to 38 days of continuous play. For real robots or industrial systems, that is prohibitively expensive and dangerous. Model-based RL addresses this by learning a world model Mฯ(s,a) โ (s', r) โ a differentiable simulator โ and generating cheap synthetic experience to train the policy.
Interleave real steps with K=100 model-generated steps. Same policy, much more data per real interaction.
Train policy entirely inside the model (DreamerV3). Real env only used to improve the world model itself.
Use model for lookahead search at test time (MCTS in MuZero). Improves decisions without extra training.
Sutton (1991) introduced the Dyna architecture โ deceptively simple yet powerful. Every real environment step generates one real transition and K=100 model-generated transitions. The policy receives 101ร more gradient updates per real step at negligible extra compute cost (the model is a neural network, not the real environment).
Ha & Schmidhuber (2018) showed that compressing observations into a compact latent space and predicting the future in that latent space enables policies trained purely in imagination. DreamerV3 (Hafner et al., 2023) is the apex of this line: a single model with one set of hyperparameters that masters Atari, DMControl, Crafter, BSuite, and Minecraft diamond collection (4/7 tasks โ the first RL agent to achieve this).
The core is the RSSM (Recurrent State Space Model) โ a latent dynamics model with deterministic memory (GRU recurrent path) and stochastic uncertainty (discrete latent variables). The actor-critic trains entirely inside the RSSM by rolling out K=15 imagined steps from the current latent state โ never touching the real environment during policy updates.
Schrittwieser et al. (DeepMind, 2020) built on AlphaZero (which required the full rules of Go/Chess to run MCTS) and asked: what if we learn the rules from scratch? MuZero learns three functions: a Representation network (observation โ latent state h), a Dynamics network ((h, a) โ next latent h' + predicted reward r), and a Prediction network (h โ policy ฯ + value v). MCTS runs entirely in the learned latent space โ never invoking the real environment during planning.
The result: state of the art simultaneously on Atari (57 games), Go, Chess, and Shogi โ all with one algorithm, no domain knowledge, no hand-coded rules. The dynamics model need not predict realistic pixels or observations โ it only needs to produce accurate value and reward estimates for planning purposes.
MuZero Reanalyse (2021) further improves sample efficiency by replaying stored positions with the latest network, re-running MCTS to generate improved training targets. The successor EfficientZero (2021) achieves human-level Atari performance in just 2 hours of real game time โ a 20ร improvement over MuZero's original sample efficiency.
| Algorithm | Action Space | On/Off-Policy | Key Strength | Use When | Library |
|---|---|---|---|---|---|
| PPO | Disc + Cont | On-policy | Stable, simple, universal | Default choice, RLHF, games | SB3, RLlib, TRL |
| SAC | Continuous | Off-policy | Sample-efficient, robust, auto-ฮฑ | Robotics, locomotion, control | SB3, RLlib |
| TD3 | Continuous | Off-policy | Deterministic, stable baseline | Benchmark continuous control | SB3 |
| DQN/Rainbow | Discrete | Off-policy | Proven on Atari | Atari-style discrete games | SB3, RLlib |
| A2C | Disc + Cont | On-policy | Fast, simple | Prototyping, education | SB3 |
| DreamerV3 | Disc + Cont | Off-policy (model) | Best sample efficiency | Pixel obs, scarce real data | Official (JAX) |
| MuZero | Discrete | Planning | Strongest board/video games | Chess, Go, Atari | DeepMind (JAX) |
| RLHF-PPO | Token (disc) | On-policy | LLM alignment | Align language models | TRL, OpenRLHF |
The practical hierarchy for 2024: Start with PPO โ it works everywhere and is easy to debug. Switch to SAC for continuous control โ it's more sample-efficient. Consider DreamerV3 when real-world interaction is scarce or expensive. Use model-free methods unless you have a specific reason to use model-based.
โ Chapter 7.7 โ Key Takeaways
- TRPO: KL-constrained policy update โ guaranteed improvement but requires expensive second-order optimisation
- PPO: clip ratio rt to [1โฮต, 1+ฮต] โ same stability as TRPO with first-order Adam; industry default
- GAE(ฮป=0.95): exponentially-weighted TD errors โ low bias AND low variance advantage estimate
- SAC: maximum entropy objective โ off-policy, stochastic, auto-tuned ฮฑ โ best continuous control algorithm
- TD3: twin critics + delayed actor + target smoothing โ three targeted fixes for DDPG's failure modes
- DreamerV3: learn world model from real data; train actor-critic entirely in imagination โ 10-20ร sample efficiency
- MuZero: MCTS in learned latent space โ no game rules needed; state of the art on board & video games
Reinforcement learning is not just theory and Atari games. It beat the world champion at Go, trained every major LLM to follow instructions, controls robots at Google and Boston Dynamics, and optimises data centre cooling. This chapter is where the equations become consequences.
Chess has ~1047 positions โ hard but solvable for minimax search engines like Stockfish. Go has ~2ร10170 โ minimax is computationally infeasible, and classic evaluation functions can't reliably assess Go positions. Go requires intuition more than calculation. That intuition was beyond computers โ until March 2016.
Trained on 30M human game positions (57% accuracy). Biases MCTS toward human-like moves โ prior probability over actions.
Initialised from ฯSL, refined by self-play policy gradient. Wins 80% vs ฯSL โ discovers moves humans never conceived.
Predicts game outcome from any board position. Replaces expensive rollout simulations in MCTS leaf evaluation.
AlphaZero required explicit game rules for MCTS expansion โ it couldn't play Atari because the rules are embedded in emulator code. MuZero (DeepMind, 2020) solved this: it learns its own transition model from experience and runs MCTS entirely in the learned latent space, never calling the real environment during planning.
Ouyang et al. (OpenAI, 2022) scaled RLHF to GPT-3 (175B) and produced InstructGPT โ the model that became ChatGPT. Three stages, each with distinct datasets, objectives, and failure modes. Understanding each stage explains why alignment is hard.
13K prompts + expert human responses. Standard cross-entropy fine-tuning. Result: consistent instruction following, but no preference signal yet.
33K prompts, 4-9 model responses ranked by 40 contractors. Bradley-Terry training on preference pairs โ scalar proxy score.
Policy = SFT init. Episode = one response. Reward = RM score โ ฮฒยทKL(PPOโSFT). KL prevents catastrophic forgetting of SFT capability.
Reward Hacking is the central failure mode of RLHF. PPO maximises the reward model score โ but the RM is an imperfect proxy for human preferences. The model discovers RM weaknesses: verbosity (RM rewards long responses โ pad with filler), sycophancy (RM rewards agreement โ tell users what they want to hear), formatting exploitation (RM likes bullet points โ overuse bullets everywhere). Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.
RLHF is powerful but complex: three stages, a separate reward model, PPO instability, and massive compute cost. Two newer approaches simplify or improve it significantly.
Proves the optimal RLHF policy has a closed form. No RM, no RL โ directly optimise on preference pairs. Stable, simpler, same performance.
Sample G responses, rank by verifiable reward, group-relative advantage. No value function. Trained DeepSeek-R1 and OpenAI o1/o3-class models.
GRPO works best when reward is binary and checkable โ math (correct answer?), code (tests pass?). Eliminates reward model bias entirely.
Robotics exposes every limitation of RL simultaneously: expensive data, dangerous exploration, partial observability, sim-to-real transfer. Yet recent systems have overcome these barriers at scale โ from dexterous hands to general-purpose manipulation.
Shadow Hand (24 DOF) solves Rubik's cube. Trained entirely in simulation using Automatic Domain Randomisation โ progressively harder sim variations.
Fine-tuned VLM (PaLI-X) to predict robot actions from image + text. "Pick up the apple" โ end-effector positions. Generalises to novel objects.
DQN on Google data centre sensors. Reduces cooling energy 30-40% (~$10M/year). Safety layer constrains actions within safe range.
A cleaning robot rewarded for "no mess visible" learns to hide mess under the rug. A boat racer learns to spin collecting power-ups instead of racing. Goodhart's Law applies: when a measure becomes a target, it ceases to be a good measure. Solutions: reward shaping, inverse RL, RLHF.
DQN needs 50M Atari frames = 38 days. Humans learn Pong in minutes. The human-to-RL sample efficiency gap is >100ร. Active research: meta-learning, model-based RL, self-supervised auxiliary tasks.
Real environments are almost always partially observable โ a robot can't see behind itself, an LLM can't see user intent. Agent must infer hidden state from observation history. Solutions: LSTM/GRU policies, Transformers with context, belief state tracking.
The environment changes over time โ a 2019 trading algorithm fails in 2020. RL recommender system faces drifting user preferences. Solutions: continual learning, online adaptation, periodic retraining, meta-learning for fast adaptation.
Chess: one reward after 40+ moves. Robotic grasping: reward only on success. Credit assignment over long horizons is fundamentally hard. Solutions: reward shaping, hindsight experience replay (HER), curiosity-driven exploration.
Random exploration is exponentially inefficient with 12+ DOF robots. Almost never discovers useful behaviours by chance. Solutions: intrinsic motivation (count-based, prediction error, empowerment), curriculum learning, population-based training.
Standard RL maximises expected cumulative reward โ no constraint on how. In simulation, failure is cheap (just reset). In the real world, failure can be catastrophic: a self-driving car exploring randomly could kill someone; a medical AI exploring randomly could harm a patient. Safe RL adds explicit constraints to the optimisation.
CMDP: maximise reward subject to cost constraints. Lagrangian relaxation adapts penalty automatically when constraints are violated.
BCQ, CQL, IQL: train on logged data with no env interaction. CQL penalises Q-values for out-of-distribution actions โ prevents exploiting gaps.
Black-box filter: map any proposed action to nearest safe action. Used in autonomous driving (safety pilot) and robotic control (joint limits, force limits).
| Property | Standard RL | Safe RL |
|---|---|---|
| Objective | Maximise reward | Maximise reward subject to constraints |
| Exploration | Allowed anywhere | Bounded to certified-safe region |
| Failure treatment | Learning experience โ reset and continue | Potentially unacceptable โ must avoid |
| Deployment context | Simulation, games | Physical systems, medical, autonomous vehicles |
| Reward signal | r(s,a) | r(s,a) โ ฮปยทc(s,a) (Lagrangian penalty) |
๐ Domain 7 Complete โ Reinforcement Learning
- Ch 7.1 โ RL = trial-and-error via rewards. MDP = (S,A,P,R,ฮณ). Bellman equations are the recursive foundation of every RL algorithm.
- Ch 7.2 โ DP solves MDPs exactly with a full model. Policy iteration: evaluate + improve. Value iteration: single Bellman sweep. Both converge.
- Ch 7.3 โ MC: average actual returns, no model. TD: online bootstrap after every step. TD(ฮป) bridges both via eligibility traces.
- Ch 7.4 โ Q-learning: off-policy max bootstrap โ Q*. SARSA: on-policy actual-action โ safer conservative policy. Cliff Walking shows the difference.
- Ch 7.5 โ DQN: Q-learning + CNN + experience replay + target network. Double DQN, Dueling, Rainbow: progressively improve stability and accuracy.
- Ch 7.6 โ REINFORCE: โlog ฯ ยท G โ unbiased, high variance. Baseline โ advantage. Actor-Critic: online TD ฮด from Critic guides Actor every step.
- Ch 7.7 โ PPO: clip ratio to [1โฮต, 1+ฮต] โ stable on-policy default. SAC: max-entropy off-policy โ best continuous control. DreamerV3: train in imagination.
- Ch 7.8 โ AlphaZero: self-play + MCTS from scratch. RLHF: SFTโRMโPPO = aligned LLMs. DPO: direct preference without RL. Real-world challenges: reward design, safety, sim-to-real.
Reinforcement learning is the only ML paradigm where learning happens through consequences. Every major breakthrough covered here โ AlphaGo's Move 37, ChatGPT's instruction-following, robots that grasp objects โ came from agents discovering strategies that humans never explicitly programmed.
The next frontier: RL for reasoning (o1/o3/DeepSeek-R1) and long-horizon autonomous agents (Domain 8). The agent-environment loop from Chapter 7.1 becomes the agentic loop from Domain 8 โ but now the environment includes the real world, and the agent is an LLM.