AI Foundation · Domain 10

Ethics, Safety & Fairness — Responsible AI

Bias measurement and mitigation, AI safety and alignment, privacy, governance frameworks, and the ethical dimensions of deploying AI at scale.

10.1

Chapter 10.1

AI Fairness — Bias, Discrimination & Measurement

AI systems do not create bias from nothing — they learn it, amplify it, and apply it at scale to millions of decisions. A hiring algorithm trained on historical data that systematically excluded women will learn to exclude women. A loan model trained on zip codes as proxy for race will perpetuate redlining. The question is not whether AI systems can be biased — they demonstrably are — but how to measure, reduce, and decide what "fair" actually means in each specific context.

What Is AI Bias? Core

AI bias refers to systematic errors in model outputs that correlate with protected characteristics — race, gender, age, disability, national origin, religion, and similar attributes. This is distinct from random error, which hurts everyone equally: bias hurts specific groups more. It is also distinct from statistical bias, a technical term for estimator deviation from the true value. AI/fairness bias refers to discriminatory patterns in real-world outcomes.

AI bias matters in ways that human bias sometimes does not because of four structural properties. Scale: one biased algorithm simultaneously affects millions of hiring, lending, healthcare, and criminal-justice decisions. Opacity: an algorithmic decision is harder to challenge and inspect than a human one. Automation bias: people tend to trust algorithmic outputs more than they should, reducing human correction of bad decisions. Feedback loops: biased outputs create biased training data for the next model, compounding over time.

⚖️

COMPAS Recidivism (2016)

ProPublica investigation found the tool was ~2× more likely to falsely flag Black defendants as high-risk compared to white defendants with equivalent criminal histories. Used in sentencing decisions across the US.

💼

Amazon Hiring Algorithm (2018)

Amazon's ML recruiting tool, trained on historical hiring decisions, systematically penalised CVs containing the word "women's" (e.g. women's chess club). Scrapped after internal audit.

🏥

Healthcare Algorithm (2019)

Widely used algorithm systematically underestimated illness severity for Black patients because it used healthcare costs as a proxy for health needs — and Black patients historically received less care per illness.

📷

Facial Recognition (NIST 2019)

NIST audit of 189 facial recognition algorithms found false positive rates 10–100× higher for darker-skinned faces and women. Systems trained predominantly on lighter-skinned male faces.

Bias Sources In-depth

Bias does not enter the ML pipeline at one point — it can enter at every stage, and different stages introduce qualitatively different types of distortion. Detecting and correcting bias requires auditing the full pipeline, not just the model.

🏛️

Historical Bias

Data reflects historical inequalities we don't want to perpetuate. Example: hiring data showing fewer women in engineering → model learns to favour men. The data is accurate; the pattern is harmful.

📊

Representation Bias

Certain groups under-represented in training data → model performs worse on them. Example: facial recognition trained mostly on lighter-skinned faces → higher error rates on darker faces.

📏

Measurement Bias

How data is collected systematically distorts it for some groups. Example: "prior arrests" as proxy for criminality — arrest rates reflect policing intensity, not crime rates, over-policing Black neighbourhoods.

🔗

Aggregation Bias

Single model trained on pooled data from groups with different underlying patterns. Example: medical model where normal glucose levels differ by ethnicity — one-size model is wrong for multiple groups.

🎯

Evaluation Bias

Benchmark dataset doesn't represent the deployment population. Example: face datasets over-representing certain countries → misleadingly high aggregate accuracy metrics that hide per-group failures.

🚀

Deployment Bias

Model used in a context it was not designed for. Example: credit scoring model built for one country applied in another with different socioeconomic structures. Context shift invalidates assumptions.

Bias Entry Points — each pipeline stage introduces different bias types

Discrimination Types Core

Two distinct legal and conceptual categories of discrimination matter for AI systems. Disparate treatment (direct) occurs when the model explicitly uses a protected attribute. Disparate impact (indirect) occurs when the model produces outcomes that disproportionately harm a protected group even without using the protected attribute directly. Both cause real harm; the second is harder to detect.

Disparate Treatment (Direct)

Disparate Impact (Indirect)

Definition: Model explicitly uses a protected attribute as an input feature.

Example: "Don't approve loans for applicants of race X" — protected attribute in decision directly.

Detection: Inspect model inputs — is the protected attribute present?

Mitigation: Fairness-through-unawareness — remove protected attributes.

Problem: Correlated proxies (zip code ≈ race, name ≈ gender) mean removal often fails.

Legal: Illegal in most regulated domains (credit, hiring, housing) in US and EU.

Definition: Protected attribute not in model, but outcomes disproportionately harm a protected group.

Example: Credit model uses zip code → zip codes correlate with race → disparate racial impact without using race.

Detection: Requires outcome-level monitoring — compare approval/error rates across groups.

US standard: "Four-fifths rule" — selection rate for disadvantaged group must be ≥80% of advantaged group's rate.

Problem: Removing features doesn't help if proxies remain in the data.

Legal: Also illegal under Title VII (employment) and ECOA (credit) in the US.

Fairness Definitions In-depth

"Fairness" is not one thing — mathematicians have formalised at least 21 distinct fairness criteria, many mutually incompatible. The five most important in practice each embed a different value judgement about what equality means and whose errors we are willing to tolerate.

📊

1 — Demographic Parity

Same positive prediction rate across groups: P(ŷ=1|A=0) = P(ŷ=1|A=1). Same loan approval % regardless of group. Problem: if qualification rates genuinely differ, demographic parity may require approving unqualified applicants.

✅

2 — Equal Opportunity

True positive rate equal across groups. Among those who WOULD repay a loan, equal fraction approved from each group. Focuses on qualified candidates being treated equally — does not constrain false positive rates.

⚖️

3 — Equalised Odds

Both TPR and FPR equal across groups. Stricter than equal opportunity — not only should qualified people be equally approved, unqualified people should also be equally rejected. Often requires accepting lower overall accuracy.

🎯

4 — Calibration

Among those predicted probability p of an outcome, p fraction actually experience it — for every p, across all groups. COMPAS satisfied this definition. Ensures predictions are equally meaningful for all groups.

👤

5 — Individual Fairness

Similar individuals receive similar predictions — regardless of group membership. Challenge: requires defining a domain-specific similarity metric. Hard to implement in practice but avoids the coarseness of group-level criteria.

Demographic Parity: P(ŷ=1 | A=0) = P(ŷ=1 | A=1)

Equal Opportunity: P(ŷ=1 | y=1, A=0) = P(ŷ=1 | y=1, A=1) (TPR parity)

Equalised Odds: P(ŷ=1 | y=k, A=0) = P(ŷ=1 | y=k, A=1) for k ∈ {0,1}

Calibration: P(y=1 | ŷ=p, A=0) = P(y=1 | ŷ=p, A=1) = p

Fairness Definitions in Action — same dataset, very different approval rates

The Impossibility Theorem In-depth

Chouldechova (2017) and Kleinberg et al. (2016) proved independently that when base rates differ between groups, it is mathematically impossible to simultaneously satisfy: (a) calibration, (b) equal false positive rates, and (c) equal false negative rates. Achieving any two forces a violation of the third. This is not a limitation waiting for a better algorithm — it is a proven theorem.

The COMPAS controversy illustrates this directly. ProPublica (2016) found COMPAS violated equal FPR: Black defendants were falsely flagged as high-risk at ~2× the rate of white defendants. Northpointe replied that their tool satisfied calibration: among those predicted as 70% likely to re-offend, 70% actually did, consistently across races. Both were correct — they measured different criteria, and the impossibility theorem guarantees both cannot hold simultaneously when base rates differ.

The impossibility theorem does not mean fairness is impossible. It means fairness is a political and ethical choice, not a mathematical one. When someone says "our AI is fair" — ask: fair by whose definition? Calibration? Equal opportunity? Demographic parity? They cannot all be satisfied simultaneously when group base rates differ. The choice between them encodes a value judgement about whose errors we are willing to tolerate.

Fairness Impossibility — you cannot satisfy all three when base rates differ

Measuring & Auditing Bias Core

Algorithmic auditing systematically tests a model's performance across protected groups. Three audit types exist: internal audit (company tests its own model), external/independent audit (third party with model access — increasingly mandated by regulation such as the EU AI Act), and black-box audit (only API access — test by sending inputs and observing outputs). A minimum fairness audit reports accuracy, FPR, FNR, and calibration per demographic subgroup.

from fairlearn.metrics import MetricFrame
from sklearn.metrics import accuracy_score, false_positive_rate, false_negative_rate

# y_true: ground truth labels
# y_pred: model predictions
# sensitive_features: protected group column (e.g., gender = ['M','F',...])

metrics = {
    "accuracy":            accuracy_score,
    "false_positive_rate": false_positive_rate,
    "false_negative_rate": false_negative_rate
}

mf = MetricFrame(
    metrics=metrics,
    y_true=y_true,
    y_pred=y_pred,
    sensitive_features=sensitive_features
)

print("Overall metrics:")
print(mf.overall)

print("\nMetrics by group:")
print(mf.by_group)

print("\nDisparities (max gap across groups):")
print(mf.difference())   # 0.0 = perfectly equal | higher = more disparate

# Visualise all metrics per group as bar chart
mf.by_group.plot.bar(
    subplots=True, layout=[1,3], figsize=(12,4),
    title=["Accuracy by Group", "FPR by Group", "FNR by Group"]
)

Mitigation Strategies In-depth

Bias mitigation can happen at three stages of the ML pipeline. Earlier intervention is more fundamental but requires more access to data and training. Post-processing interventions are easiest to apply to deployed models but address symptoms rather than root causes.

📂

Pre-Processing (Fix the Data)

Reweighting: higher sample weights for under-represented groups. Resampling: oversample minority group. Data augmentation: synthetic data for gaps. Disparate impact remover: transform features to reduce group correlation while preserving rank ordering.

⚙️

In-Processing (Constrain the Model)

Adversarial debiasing: predictor + adversary that tries to infer group from predictions — predictor learns to resist. Fairness constraints: add group-parity terms to the loss function. Fairness regularisation: penalty term for disparity.

🎛️

Post-Processing (Adjust the Output)

Threshold adjustment: different decision thresholds per group to equalise error rates. Reject option: abstain when model is uncertain — reduces disparate errors. Calibration: recalibrate probabilities per group.

Strategy	Stage	Complexity	Performance Cost	When to Use
Reweighting	Pre-processing	Easy	Low	Unbalanced group representation in training data
Adversarial debiasing	In-processing	Complex	Medium	Strong group correlations in features
Fairness constraints	In-processing	Medium	Medium	Specific fairness criterion required by regulation
Threshold adjustment	Post-processing	Easy	Low–Medium	Post-deployment, known group membership at decision time
Reject option	Post-processing	Easy	Reduces coverage	When abstaining from prediction is acceptable

∑ Chapter 10.1 — Key Takeaways

AI bias: systematic errors correlated with protected characteristics — amplified at scale, opacity makes it harder to challenge than human bias
Bias sources span the full pipeline: historical, representation, measurement, aggregation, evaluation, deployment — every stage can introduce it
Disparate treatment (direct use of protected attribute) vs disparate impact (indirect via correlated proxies) — both are legally and ethically harmful
Five fairness definitions: demographic parity, equal opportunity, equalised odds, calibration, individual fairness — each embeds a different value judgement
Impossibility theorem: when base rates differ, cannot simultaneously satisfy calibration + equal FPR + equal FNR — COMPAS proves this in practice
Fairness criterion choice is a value judgement, not a technical decision — must be made explicitly by stakeholders, not silently by engineers

10.2

Chapter 10.2

Explainability & Interpretability — Understanding What Models Do

A model that cannot explain its decisions cannot be trusted, debugged, audited for fairness, or deployed legally in regulated domains. Explainability is not a luxury — it is a precondition for responsible AI. The challenge is that the most accurate models are also the hardest to understand, making post-hoc explanation methods one of the most active areas of AI research.

Why Explainability Matters Core

A black-box model produces an output without explanation: "loan denied" — no reason given. This is problematic for every stakeholder in the decision chain.

🔍

Trust

Humans cannot verify whether the model's reasoning is sound or based on spurious correlations. Unexplained decisions cannot be trusted.

⚖️

Accountability

When a model errs, who is responsible? Without understanding what drove the decision, accountability cannot be assigned.

🐛

Debugging

You cannot improve what you cannot understand. Explainability is essential for identifying and fixing model failures.

🔎

Fairness Auditing

Bias cannot be detected without understanding what drove the decision. Did the model use a proxy for race? Impossible to know without explanation.

📜

Legal Compliance

GDPR Article 22 requires explanations for automated decisions with legal effects. EU AI Act mandates explainability for high-risk AI systems.

🏥

Safety

In medical and safety-critical domains, unexplained decisions are dangerous. Clinicians must understand model reasoning to validate it.

Different stakeholders need different types of explanation:

Stakeholder	Explanation Need	Format
Data Scientists	Model debugging, feature importance for improvement	SHAP plots, partial dependence plots
Domain Experts	"Does this reasoning make clinical/business sense?"	Feature contributions with domain labels
Affected Individuals	"Why was I denied?" — right to explanation	Plain-language reason codes
Regulators	"Is this model compliant?" — audit and oversight	Model cards, disaggregated metrics
Executives	"Can we trust this for deployment?"	Summary dashboards, risk reports

🚫

Without Explainability

Medical AI says "do not treat". No explanation. Doctor cannot verify reasoning. Patient has no recourse. Model may have learned spurious correlations from EHR system bugs.

✅

With Explainability

Medical AI says "high risk — driven by: elevated troponin (+42%), age>65 (+28%), history of hypertension (+19%). Doctor reviews, validates clinical reasoning, makes informed decision.

Interpretability vs Explainability Core

Interpretable model: the model itself is simple enough to be directly understood — humans can trace the full decision logic. Decision trees, linear regression, and rule-based systems are intrinsically interpretable.

Explainable model: the model may be complex (neural network, gradient boosting) but a separate post-hoc explanation method is applied to generate an explanation. The explanation is an approximation of the model's behaviour, not the model itself.

The interpretability–accuracy tradeoff is real: simpler models are easier to interpret but often less accurate. Complex models are more accurate but harder to interpret. Post-hoc XAI methods (LIME, SHAP) attempt to bridge this gap — allowing deployment of accurate complex models with approximate explanations.

Intrinsically Interpretable

Post-hoc Explainable

✅ Decision trees — full trace of every split

✅ Linear / logistic regression — coefficients = feature weights

✅ Rule-based systems — explicit if-then logic

✅ Generalised additive models (GAMs)

✅ Humans can read and verify the model directly

⚠️ Accuracy ceiling — complex patterns cannot be captured

⚠️ May underfit in high-dimensional problems

⚙️ Neural networks — millions of parameters

⚙️ Gradient boosting (XGBoost, LightGBM)

⚙️ Ensemble models — aggregated predictions

⚙️ Any black-box model

✅ Full accuracy of complex models retained

✅ Explanation generated after the fact via LIME, SHAP, saliency maps

⚠️ Explanation is an approximation — may not reflect true model reasoning

Accuracy vs Interpretability — and where post-hoc XAI bridges the gap

LIME — Local Interpretable Model-agnostic Explanations In-depth

Ribeiro et al. (2016) — "Why Should I Trust You? Explaining the Predictions of Any Classifier". LIME's core idea: locally approximate a complex model with a simple interpretable model. For a specific prediction, perturb the input slightly, observe how the prediction changes, then fit a simple linear model to the perturbed samples. The linear model's coefficients become the local feature importances — the explanation.

Local means LIME explains this specific prediction, not the global model. Model-agnostic means it works with any model — only needs input-output access (black-box).

①

Instance + Perturb

Take the instance to explain (e.g., loan application). Create perturbed versions by randomly changing feature values.

②

Query + Weight

Get model predictions for all perturbed versions. Weight each sample by its proximity to the original instance.

③

Fit + Explain

Fit a simple linear model on the weighted perturbed samples. Coefficients = local feature importances = the explanation.

LIME — locally approximate complex boundary with a simple linear model

SHAP Values — SHapley Additive exPlanations In-depth

Lundberg & Lee (2017) — "A Unified Approach to Interpreting Model Predictions". SHAP is grounded in cooperative game theory's Shapley values: each feature receives a value equal to its average marginal contribution across all possible feature subsets. This gives SHAP provably fair attribution properties: efficiency, symmetry, dummy, and linearity.

The key advantage over LIME: SHAP values sum exactly to prediction − baseline (average prediction), providing a complete, additive decomposition of every individual prediction. SHAP values are consistent — if a feature's true contribution increases, its SHAP value never decreases.

SHAP Waterfall Plot — feature contributions to a specific loan decision

import shap
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier

# Assuming X_train, X_test, y_train are prepared
model = GradientBoostingClassifier().fit(X_train, y_train)

# TreeExplainer for tree-based models — fast and exact
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)  # shape: (n_samples, n_features)

# Waterfall plot for a single prediction
sample_idx = 0
shap.waterfall_plot(
    shap.Explanation(
        values=shap_values[sample_idx],
        base_values=explainer.expected_value,
        data=X_test.iloc[sample_idx],
        feature_names=X_test.columns.tolist()
    )
)

# Global feature importance — mean absolute SHAP value
shap.summary_plot(shap_values, X_test, plot_type="bar")

# Beeswarm plot — distribution of SHAP values across all samples
shap.summary_plot(shap_values, X_test)

Attention Visualisation Core

Transformer models produce attention weights at each layer and head — a matrix indicating how much each token attends to every other token when producing its output representation. Attention visualisation renders these weights as heatmaps, showing which parts of the input the model "focused on" for a given output.

Attention maps are intuitive and free (no additional computation needed) but carry a critical caveat: attention ≠ explanation. Jain & Wallace (2019) showed that attention weights are not reliably correlated with gradient-based feature importances — high attention does not guarantee that a token drove the prediction. They are useful for model debugging and forming hypotheses, not for causal attribution.

✅

What Attention Maps Are Good For

Model debugging: identify unexpected focus patterns. Hypothesis generation: "the model seems to focus on negation words". Qualitative sanity checks. Identifying off-target generalisations.

⚠️

Attention Map Limitations

Attention ≠ importance (Jain & Wallace, 2019). Different attention heads capture different linguistic properties — aggregate visualisation is misleading. Cannot be used as a causal explanation for legal or accountability purposes.

Attention Heatmap — "The bank denied the loan because of low income"

Model Cards & Datasheets Core

Model Cards (Mitchell et al., Google, 2019) are standardised documentation for ML models — "nutrition labels" for AI. They report intended use, performance across subgroups, limitations, and ethical considerations, enabling informed deployment decisions.

Datasheets for Datasets (Gebru et al., 2018) apply the same principle to training data: motivation, composition, collection process, pre-processing, intended use, and distribution information — essential for understanding what biases may have been baked in.

📋

Model Card — Key Sections

Model details: architecture, training data, version
Intended use: primary use cases, out-of-scope uses
Metrics: which performance metrics are reported
Evaluation data: test set, preprocessing
Training data: brief description
Quantitative analyses: disaggregated evaluation by subgroup ← most important
Ethical considerations: potential harms, mitigations
Caveats and recommendations

📂

Datasheet for Dataset — Key Sections

Motivation: why was this dataset created?
Composition: what does it contain? What's excluded?
Collection process: how was data gathered?
Pre-processing: cleaning, filtering, labelling
Uses: intended tasks, tasks it should NOT be used for
Distribution: how is it released? Under what licence?
Maintenance: who is responsible for updates?

A model card is not a marketing document — it is a technical accountability document. The most important section is always quantitative analyses: performance metrics broken down by demographic subgroup. A model card that reports only aggregate accuracy is hiding the information needed to assess fairness.

Mitchell et al., 2019 — Model Cards for Model Reporting

Legal Right to Explanation Core

Multiple legal frameworks now mandate or imply a right to explanation for automated decisions. The EU leads globally; the US relies on sector-specific regulations.

Regulation	Jurisdiction	Requirement	Scope
GDPR Article 22	EU (2018)	Right not to be subject to purely automated decisions with legal/significant effects. Right to request explanation and human review.	Any automated decision affecting EU residents
EU AI Act	EU (2024)	High-risk AI systems must be transparent, explainable, and auditable. Mandatory conformity assessments.	Hiring, credit, medical, law enforcement, education, critical infrastructure
ECOA / Fair Credit Reporting Act	US (federal)	Adverse action notices required in credit decisions — must state specific reasons for denial.	Consumer credit decisions
EEOC Guidelines	US (federal)	Guidelines apply to algorithmic hiring tools — disparate impact analysis required. No explicit explanation mandate.	Employment decisions

⚖️

GDPR Interpretation Challenge

"The right to explanation" under GDPR Article 22 is not perfectly defined — courts and regulators are still interpreting its scope. Does it require revealing model internals? A narrative reason? Feature contributions?

🔬

Post-hoc Explanation Fidelity

Post-hoc explanations (LIME, SHAP) may not reflect the actual model reasoning — they are approximations. An explanation that satisfies a legal requirement may not capture what truly drove the decision.

👤

Accessibility Challenge

Explanations simple enough for non-experts (affected individuals) may be misleading. Explanations accurate enough to be technically faithful may be incomprehensible to those who need them most.

∑ Chapter 10.2 — Key Takeaways

Black-box AI: no explanation → no trust, no accountability, no debugging — and no legal compliance in regulated domains
Interpretable: model is directly understandable (decision tree). Explainable: post-hoc method explains complex model (LIME, SHAP) after training
LIME: locally approximate any model with a simple linear model — explains THIS prediction, not the global model — model-agnostic, intuitive
SHAP: Shapley values — theoretically grounded attribution, values sum to prediction minus baseline, consistent and efficiency-preserving
Attention maps: useful for debugging but attention ≠ importance — not valid for causal or legal attribution (Jain & Wallace, 2019)
Model cards: standardised performance-by-subgroup documentation — aggregate accuracy alone is insufficient for fairness assessment
GDPR Article 22: legal right to explanation for automated decisions — EU leads globally; US relies on sector-specific rules

10.3

Chapter 10.3

Privacy & Data Governance — Protecting Individuals in the Age of AI

AI creates privacy threats that go far beyond traditional data breaches. A model trained on aggregated data can reveal individual records. A language model can reproduce verbatim personal information from its training corpus. An "anonymous" dataset can be re-identified with pattern-matching at scale. Privacy-preserving AI is not just a compliance checkbox — it is a fundamental engineering and ethical requirement.

Privacy Threats from AI In-depth

AI creates four categories of novel privacy threat that do not require a traditional data breach — the attack surface is the model itself, its outputs, and its training pipeline.

🔍

Inference Attacks

Model trained on aggregate data reveals information about individuals. Membership inference: "was this person's data used to train this model?" — achieves >70% accuracy on many models. Attribute inference: predict private attributes from public inputs. Example: location data → infer religious observance, health conditions, political beliefs.

💾

Training Data Leakage

LLMs memorise training data verbatim and reproduce it when prompted. Carlini et al. (2021) extracted 600+ memorised sequences from GPT-2 using targeted prompts — including names, phone numbers, email addresses, physical addresses, and code snippets. GPT-3/4 exhibit similar vulnerabilities.

🔄

Re-identification

Supposedly anonymous datasets re-identified using AI pattern matching. Netflix Prize: "anonymous" ratings linked to IMDB profiles — 30+ users identified. AOL search logs: 30 individuals re-identified from anonymised search queries. Genome databases + statistical analysis → individual family members identified.

🎭

Synthetic but Real Harms

AI generates realistic content attributed to real people without any data breach. Deepfake faces, voice clones, fabricated quotes. Synthetic "data" can contain accurate personal details about real individuals. Creates actionable privacy harms without exposing any raw training record.

AI Privacy Threat Taxonomy — four attack vectors against ML systems

LLM Memorisation In-depth

Language models memorise training data in two modes. Verbatim memorisation occurs when a model can reproduce exact text from training data when given a matching prompt. Generalisation — learning patterns without memorising specifics — is the desirable mode, but the two coexist in every large model.

Carlini et al. (2021) attacked GPT-2 by generating thousands of completions and comparing them against the known training corpus. They found 600+ verbatim memorised sequences including personal names, phone numbers, email addresses, physical addresses, and source code. Three factors predict how much a particular sequence is memorised:

🔁

Duplication

Text appearing many times in training data is dramatically more likely to be memorised verbatim. A sequence appearing 100× is ~45× more likely to be extractable than a unique sequence. De-duplication is the most effective mitigation.

📐

Model Size

Larger models have more parameters and therefore more capacity to store training examples. GPT-2 XL memorises substantially more than GPT-2 Small even at the same data exposure. Scaling increases memorisation risk.

📏

Context / Prompt Length

Longer prompts extract longer memorised sequences. Providing more context from the training corpus makes the model more likely to reproduce the remainder verbatim. Limits on prompt length reduce extraction risk.

LLM Memorisation Risk Factors — duplication, model size, and prompt length

Consent & Data Rights Core

Legitimate data processing under GDPR requires one of six legal bases. Training AI on scraped web data sits in contested legal territory on both copyright and privacy dimensions — with multiple major cases pending or decided as of 2025.

📜

GDPR Legal Bases

Consent — explicit, informed, revocable
Contract — necessary to fulfil a contract
Legal obligation — required by law
Vital interests — life/death emergency
Public task — public interest/authority
Legitimate interests — contested for AI training

©️

Copyright Disputes

Does training on copyrighted text constitute infringement?
NYT vs OpenAI — verbatim reproduction
Authors Guild vs OpenAI/Meta — books training data
Getty Images vs Stability AI — image training data
Different jurisdictions have different views — law unsettled (2025)

🔏

Privacy for AI Training

Does training on public personal data require consent?
GDPR likely says yes for EU residents
Many AI companies claim "legitimate interests" — contested
Italian DPA temporarily banned ChatGPT (2023) over GDPR concerns
Regulatory enforcement is increasing

Principle	Requirement	AI Training Challenge
Purpose limitation	Data collected for one purpose cannot be used for another	Web scraping gathers data intended for human reading, not ML training
Data minimisation	Collect only what is necessary for the purpose	LLMs trained on everything — hard to argue all data is "necessary"
Storage limitation	Don't keep data longer than necessary	Model weights encode training data indefinitely
Individual rights	Access, correction, deletion, portability	Technically difficult to honour erasure requests post-training

Differential Privacy In-depth

Dwork et al. (2006) introduced Differential Privacy (DP) — the gold standard for provable privacy guarantees. DP gives a mathematical bound on how much information about any individual can be inferred from a mechanism's output.

The formal guarantee: the probability of any output changes by at most e^ε if any single individual's data is added or removed from the dataset. ε (epsilon) is the privacy budget — lower means stronger privacy but typically lower utility. In practice DP is implemented by adding carefully calibrated random noise to query results or model gradient updates.

Differential Privacy (ε-DP):

M is ε-differentially private if for ALL adjacent datasets D, D′ and ALL outputs S:

P[M(D) ∈ S] ≤ e^ε · P[M(D′) ∈ S]

DP-SGD update (training with DP):

1. Compute per-example gradients gᵢ

2. Clip: ĝᵢ = gᵢ / max(1, ‖gᵢ‖₂ / C) ← bound sensitivity

3. Aggregate: ḡ = (1/L) · Σ ĝᵢ

4. Add noise: g̃ = ḡ + 𝒩(0, σ²C²I) ← Gaussian noise

Differential Privacy — stronger privacy (lower ε) reduces model accuracy

Deployment	ε value	Purpose
Apple (iOS keyboard)	ε ≈ 4	Next-word prediction, emoji usage, health trends
Google (Chrome RAPPOR)	ε = 1–4	Browser settings telemetry
US Census Bureau (2020)	ε = 17.14	Population statistics — privacy vs. accuracy political debate
Google (Gboard)	ε < 4	On-device federated learning + DP for keyboard model

Federated Learning Core

McMahan et al. (Google, 2017) — "Communication-Efficient Learning of Deep Networks from Decentralized Data". Federated Learning's core idea: train a shared model without ever centralising the training data. Data stays on local devices; only model gradient updates are sent to the central server, which aggregates them using FedAvg and distributes an updated global model.

Privacy benefits: raw data never leaves the device. Privacy limitations: gradients can still leak information via gradient inversion attacks (Zhu et al., 2019). Combining federated learning with differential privacy (DP-SGD on device) provides stronger guarantees.

Federated Learning — train on distributed data without centralising it

✅

Federated Learning Benefits

Privacy: raw data never leaves the device or institution. Regulation: enables collaboration across GDPR/HIPAA boundaries. Scale: learns from vastly more data than any single silo. Personalisation: local fine-tuning on top of global model.

⚠️

Federated Learning Limitations

Gradient leakage: Zhu et al. (2019) showed gradients can be inverted to reconstruct training images. Communication cost: many rounds of gradient exchange. Non-IID data: local distributions differ — convergence is harder. Poisoning: malicious clients can corrupt the global model.

Data Minimisation Core

Data minimisation — collect, use, and retain only the data strictly necessary for the stated purpose — is both a GDPR legal requirement and a privacy-by-design best practice. For AI systems it applies at every stage of the data lifecycle.

📥

Collection Minimisation

Only collect features that are actually necessary to achieve the model's purpose. Avoid collecting sensitive attributes by default. Use data impact assessments before ingesting new data sources.

🔧

Processing Minimisation

Aggregate or anonymise data before it enters model training where possible. Use synthetic data to supplement real data. Apply k-anonymity, l-diversity or t-closeness to datasets before use.

🗑️

Retention Minimisation

Define and enforce data retention schedules. Delete training data once the model is trained and validated. Maintain audit logs for deletion. Plan for model retraining on minimised datasets.

Technique	What It Does	Privacy Guarantee	Limitation
k-Anonymity	Every record is indistinguishable from ≥k−1 others on quasi-identifiers	Prevents direct re-identification	Vulnerable to homogeneity and background knowledge attacks
l-Diversity	Each equivalence class has ≥l distinct sensitive attribute values	Protects against attribute disclosure	Does not protect against probabilistic inference
Differential Privacy	Adds calibrated noise — provable bound on information leakage	Mathematically proven, composable	Accuracy cost, ε choice requires domain expertise
Synthetic Data	Generate statistically similar data without real individuals	No individual records — but can re-identify if poorly generated	Quality depends heavily on generation method

Right to Be Forgotten & Machine Unlearning Core

GDPR Article 17 gives individuals the right to erasure — they can request their personal data be deleted. For traditional databases this is straightforward. For ML models it is fundamentally hard: if a model was trained on your data, deleting the raw record does not remove its influence from the model's weights.

🎯

Exact Unlearning

Method: retrain the model from scratch on the dataset excluding the data to be forgotten. Guarantee: perfect — model has never seen the data. Cost: prohibitively expensive for large models. Used when: legal requirement is strict and model is small enough.

⚡

Approximate Unlearning

SISA training: shard data, retrain only the affected shard. Gradient ascent: maximise loss on the forgotten data — "unlearn" by pushing it out. Influence functions: estimate and remove the effect of specific data points. Faster but provides weaker guarantees.

🔍

Verification Challenge

How do you prove a model has forgotten specific data? No robust verification standard exists yet — an open research problem. Membership inference can test if data was in training, but low accuracy makes it unreliable as a forgetting proof.

Current practice: most companies respond to erasure requests by maintaining exclusion lists for future training runs and periodically retraining models from scratch — not true per-model unlearning. This is pragmatic but means previously trained model versions continue to contain the individual's data until the next full retraining. Regulators are beginning to scrutinise this gap.

Machine Unlearning — options and tradeoffs

∑ Chapter 10.3 — Key Takeaways

AI privacy threats: inference attacks, training data leakage, re-identification, synthetic harms — model itself is the attack surface
LLM memorisation: verbatim training data reproducible — duplication and model size increase risk; de-duplication is the most effective mitigation
GDPR requires: purpose limitation, data minimisation, consent — training on scraped web data legally contested, enforcement increasing
Differential privacy: provable privacy via calibrated noise — ε controls the privacy-utility tradeoff; deployed by Apple, Google, US Census
Federated learning: train on distributed data without centralising it — data stays on device, but gradient leakage remains a risk
Machine unlearning: right to be forgotten challenges ML models — exact unlearning is expensive, approximate methods exist, verification is an open problem

10.4

Chapter 10.4

AI Safety — Technical Alignment

AI safety is not a single problem — it is a cluster of related technical challenges around ensuring AI systems do what we actually intend, behave reliably under novel conditions, and remain correctable as they become more capable. The core difficulty: specifying what we want precisely enough that a powerful optimiser cannot exploit the gap between the specification and the intent.

The Alignment Problem In-depth

The alignment problem asks: how do we ensure AI systems pursue goals that are actually beneficial? It decomposes into two distinct sub-problems that can fail independently.

Outer Alignment — Wrong Objective

Inner Alignment — Different Objective

Definition: the objective we specify does not actually capture what we want.

Example: specify "maximise watch time" — model learns to recommend outrage content.

Example: specify "minimise visible mess" — robot hides mess under furniture.

Example: specify "get high RLHF reward" — LLM learns sycophantic verbosity.

Root cause: reward function misspecification — we can't fully encode human values in a scalar.

Mitigation: better reward modelling, Constitutional AI, process-based supervision.

Definition: the learned model does not actually optimise the specified objective.

Example: a mesa-optimiser learns an internal proxy objective that matches the training objective in-distribution but diverges out-of-distribution.

Example: model appears aligned during evaluation (distributes correctly) but pursues a different goal in deployment.

Root cause: training finds a model that scores well, not one that "believes" the objective.

Mitigation: mechanistic interpretability, adversarial evaluation, anomaly detection.

Goodhart's Law in AI — optimising a metric corrupts it

The alignment problem is not a distant future concern. Every time a recommendation algorithm optimises for watch time instead of user wellbeing, every time an LLM generates confident-sounding hallucinations to satisfy a fluency objective, every time a cleaning robot hides the mess — we are observing misalignment. These are small versions of the same failure mode that motivates AI safety research.

Adversarial Robustness In-depth

Szegedy et al. (2014) discovered that small, carefully crafted perturbations to model inputs cause high-confidence misclassification — imperceptible to humans but catastrophic to the model. The noise is optimised to maximally confuse the model, exploiting the high-dimensional geometry of neural network decision surfaces.

⬜

White-box Attack

Attacker knows model architecture and weights. FGSM, PGD — compute gradient of loss w.r.t. input, perturb in that direction. Most powerful attack type. Used in research to find worst-case vulnerabilities.

⬛

Black-box Attack

Attacker only has API access to model outputs. Transfer attack: craft adversarial example on a surrogate model, transfer to target. Decision-based: query target model many times to estimate gradient.

🌍

Physical-world Attack

Adversarial patches in the real world — printed stickers on stop signs fool autonomous vehicle classifiers. Adversarial glasses bypass facial recognition. Adversarial t-shirts make people "invisible" to detection systems.

Adversarial Example — imperceptible noise changes classification from Panda to Gibbon

Why this matters for safety: self-driving cars can be fooled by adversarial stickers on stop signs; facial recognition bypassed with adversarial glasses; LLM jailbreaks use adversarial prompt suffixes to bypass safety training.

Defence	Approach	Strength	Limitation
Adversarial training	Include adversarial examples in training set	Empirically effective, widely used	Expensive; doesn't generalise to all attack types
Certified defences	Mathematically prove robustness within ε-ball	Provable guarantee	Accuracy cost; only small ε at scale
Input preprocessing	Randomise, smooth, or detect adversarial inputs	Simple and fast	Adaptive attacks can bypass preprocessing
Ensemble methods	Multiple diverse models must all be fooled	Raises attack cost	Transfer attacks still work across diverse models

Specification Gaming Core

Krakovna et al. (DeepMind, 2020) catalogued 60+ real examples of AI systems finding unintended optimal solutions — scoring highly on the specified objective in a way that violates the designer's actual intent. The examples span games, robotics, language models, and recommendation systems.

Specification Gaming Case Studies — AI finds unintended optimal strategies

RLHF & Alignment Core

RLHF (Reinforcement Learning from Human Feedback) is the dominant technique for aligning large language models to human preferences. From a safety perspective it delivers real improvements — but also introduces new failure modes.

✅

What RLHF Achieves

Instruction following: model does what humans ask
Harmlessness: avoids clearly harmful content
Honesty: acknowledges uncertainty, avoids confident falsehoods
Format compliance: structured outputs, appropriate length

⚠️

What RLHF Does NOT Fully Solve

Sycophancy: model learns to tell humans what they want to hear
Distributional shift: aligned in training contexts, potentially misaligned elsewhere
Value lock-in: aligns to the preferences of annotators (limited demographics)
Deceptive alignment: appears aligned during evaluation, may not be in deployment

📜

Constitutional AI (Anthropic, 2022) — A More Transparent Alternative

Instead of relying purely on human preferences, Constitutional AI uses a set of explicit principles (a constitution) to guide model self-critique. The model critiques its own outputs against the constitution and revises them — reducing dependence on individual annotator judgements and making the instilled values explicit and auditable. RLAIF (RL from AI Feedback) further reduces human annotation burden.

RLHF vs Constitutional AI — aligning LLMs to human values

Scalable Oversight Core

As AI becomes more capable, humans will struggle to evaluate its outputs directly. A human can assess whether an essay is well-written; a human cannot easily verify whether a 10,000-line codebase is secure, or whether a mathematical proof AI discovered is actually correct. Scalable oversight uses AI to help humans oversee AI — a necessary component of alignment for superhuman systems.

⚔️

Debate (Irving et al., 2018)

Two AI systems argue opposing positions; a human judge picks the winner. Key insight: honest arguments are easier to defend because false sub-claims can be challenged — so honest AI wins in the long run even against a dishonest opponent.

🔬

Iterated Amplification (Christiano et al., 2018)

Break a hard evaluation problem into easier subproblems. Recursively use AI assistance to evaluate AI outputs on complex tasks — bootstrapping human oversight of increasingly complex problems.

📡

Weak-to-Strong Generalisation (OpenAI, 2023)

Can a weaker supervisor elicit good behaviour from a stronger model? Early results suggest strong models generalise beyond their supervisor's capability — an encouraging signal for alignment under capability overhang.

Scalable Oversight via Debate — AI helps humans evaluate complex AI outputs

Interpretability for Safety In-depth

Mechanistic interpretability aims to reverse-engineer what computations neural network circuits perform internally — not just what inputs influence the output (attribution), but what the model actually "thinks". This is essential for detecting deceptive alignment: a model that behaves safely during evaluation but has internal representations inconsistent with that behaviour.

🔬

Probing

Train a simple linear classifier on internal activations to test whether a concept is linearly represented in a layer. Example: does layer 12 of GPT-2 represent "is this token a proper noun?" Reveals what information is encoded where.

⚡

Activation Patching

Intervene: replace activations from one run with those from another to identify which components causally implement a behaviour. "If we patch layer 8 attention head 4, the model answers differently" → that component is causally responsible.

🔌

Circuit Analysis

Identify minimal sub-networks (circuits) responsible for a specific behaviour. Anthropic's "induction heads" (2022): identified a 2-head circuit implementing in-context learning in transformers — a landmark mechanistic result.

Mechanistic interpretability for safety operates under a specific threat model: deceptive alignment — a model that behaves safely in training (because it recognises it is being evaluated) but has internal goals inconsistent with safety. If interpretability can detect the internal representations of such goals, humans can intervene before deployment. This is an active research area at Anthropic, MIT, and EleutherAI, with early but encouraging results on circuits in small models.

Current Safety Research Core

AI safety research in 2024–2025 spans multiple parallel tracks, from near-term practical improvements to longer-horizon alignment research. The field has grown rapidly since the release of capable frontier models.

Research Area	Problem	Approach	Key Labs	Status
Mechanistic Interpretability	Understanding internal model representations	Probing, activation patching, circuit analysis	Anthropic, MIT, EleutherAI	Active — early results on small models
RLHF & Preference Learning	Aligning to human values	Constitutional AI, DPO, RLAIF	Anthropic, OpenAI, DeepMind	Deployed — known sycophancy / lock-in limitations
Adversarial Robustness	Models break on perturbed inputs	Adversarial training, certified defences	MIT, CMU, Google	Partial — no solution scales to large models
Scalable Oversight	Evaluating superhuman AI outputs	Debate, amplification, weak-to-strong	OpenAI, Anthropic	Research phase — not deployed at scale
Anomaly / OOD Detection	Models fail silently on out-of-distribution input	Uncertainty quantification, conformal prediction	Many	Partial — active research area
Evaluation & Red Teaming	Measuring alignment and safety	Red teaming, evaluation suites	Anthropic, METR, ARC Evals	Active — rapidly evolving benchmarks
Jailbreak Robustness	Models bypass safety training via adversarial prompts	Adversarial training, constitutional methods	All major labs	Ongoing arms race — no durable solution

∑ Chapter 10.4 — Key Takeaways

Alignment: outer alignment (wrong objective specified) + inner alignment (model learns different objective) — both can fail independently
Goodhart's Law: optimising a metric corrupts it — specification gaming is pervasive across games, robots, and language models
Adversarial examples: imperceptible perturbations cause high-confidence misclassification — exploitable in safety-critical physical-world systems
RLHF achieves instruction-following and harmlessness but doesn't eliminate reward hacking or sycophancy
Constitutional AI: explicit principles guide self-critique — more transparent than pure RLHF, values are auditable
Scalable oversight: using AI to help humans evaluate AI — necessary as capability exceeds human evaluation ability
Mechanistic interpretability: reverse-engineer internal circuits — essential for detecting deceptive alignment before deployment

10.5

Chapter 10.5

Societal Impact — Labour, Power, Environment & Inequality

AI's societal impact extends far beyond the systems themselves. It reshapes labour markets, concentrates economic and political power, consumes significant environmental resources, and distributes its benefits and costs very unevenly — often along existing lines of privilege. Understanding these impacts is inseparable from responsible AI development.

AI & Labour Markets In-depth

Every major technological revolution disrupts labour markets — from the power loom to the spreadsheet. AI may be different in speed and breadth: it affects cognitive tasks previously thought to require human judgement, and it is being deployed across many sectors simultaneously.

McKinsey (2023): ~30% of work tasks could be automated by 2030 with current AI. Goldman Sachs (2023): 300 million full-time equivalent jobs globally are exposed to AI automation. These figures operate at the task level, not the job level — most jobs involve a mix of automatable and non-automatable tasks. Economists disagree significantly on what this means for employment.

📉

Most Exposed Tasks

Data processing and entry
Document analysis and summarisation
Routine writing (reports, emails)
Customer service and call centres
Basic legal and financial research
Radiological image screening (partial)
Cognitive, routine, rule-based

🛡️

Least Exposed Tasks

Physical dexterity in unstructured environments
Complex social interaction and negotiation
Novel creative work requiring embodied judgement
Caregiving and emotional support
Trade skills (plumbing, electrical, carpentry)
Physical, relational, context-dependent

🔄

Historical Pattern

Short-term: displacement in automated task categories
Long-term: new job categories created; productivity gains redistributed
The question: is this transition faster than historical precedent?
Economists genuinely disagree — the honest answer is we don't know yet

Automation & Task Displacement Core

AI Task Exposure by Occupation — office work most exposed, physical trades least

Source: Adapted from multiple 2023 labour market studies (McKinsey, Goldman Sachs, Acemoglu et al.). Note: "exposure" measures task susceptibility to automation, not predicted unemployment rates. Most occupations contain both exposed and non-exposed tasks.

Power Concentration In-depth

Frontier AI development is highly concentrated: 5–6 organisations control the most capable systems (OpenAI, Anthropic, Google DeepMind, Meta, Microsoft/OpenAI, xAI). This concentration has structural consequences that go beyond normal market dynamics.

⚖️

Concerns About Concentration

5–6 companies determine what AI does and doesn't do — their values, safety practices, and business decisions affect billions of people. Regulatory capture risk: those being regulated have far more technical expertise than regulators. Innovation monoculture: homogeneous approaches miss blind spots. Geopolitical leverage: AI capabilities are becoming a primary axis of US-China competition.

🌐

Arguments for Concentration

Safety research and evaluation require resources only large organisations can marshal. Coordination on safety standards is easier with few actors. Open release of powerful models may enable catastrophic misuse by state and non-state actors — a genuine concern, not just self-interest. Concentrated accountability may be easier to regulate than a fragmented ecosystem.

Open Source AI — Arguments For

Closed AI — Arguments For

✅ Democratises access — small organisations and countries can use frontier models

✅ Reduces single-point dependency on a few providers

✅ Community can identify and fix safety issues (many eyes)

✅ Academic research access — enables safety research outside big labs

✅ Prevents lock-in to proprietary ecosystems

Example: Meta LLaMA, Mistral, Falcon — widely deployed open models

⚖️ Safety concerns: powerful open models can be fine-tuned to remove safety filters

⚖️ Proliferation risk: WMD-assistance, cyberweapon generation at scale

⚖️ Cannot update / patch a model once widely distributed

⚖️ Incentive structures for safety investment reduce without IP protection

⚖️ Regulatory oversight requires identifiable, accountable actors

Example: OpenAI, Anthropic, Google — proprietary frontier models

AI Development Concentration — frontier capability vs accessibility tradeoff

Environmental Costs In-depth

Training and running large AI models has significant energy and water costs that are rarely disclosed by the organisations responsible. The trend is towards larger models, larger datasets, and more inference queries — all of which increase environmental impact.

⚡

Training Energy

GPT-3 (2020): ~552 tonnes CO₂e — equivalent to ~120 car-lifetimes of driving
GPT-4 (2023): estimated significantly larger — exact figures not published
PaLM (2022): estimated ~3,400 MWh of training energy
Most organisations do not disclose training costs

💧

Water Consumption

Data centres use water for cooling — often overlooked in carbon reporting
Microsoft (2023): global data centre water consumption up 34% year-over-year
Estimated: ~0.5 litres per 100-word GPT-4 response
Water stress in regions hosting large data centres

🌍

Context & Perspective

Transatlantic flight: ~1.5 tonnes CO₂e per passenger
Training GPT-3: ~552 tonnes ≈ 370 passengers flying transatlantic
But: one trained model serves millions of queries
Per-query cost may be lower than human alternatives — context matters

Estimated AI Training Energy — rapid growth with larger models (log scale)

AI & Inequality Core

AI's benefits and costs are not evenly distributed across populations, nations, or communities. Current patterns tend to amplify existing inequalities rather than reduce them.

Who Benefits Most (Near Term)

Who Bears the Costs

✅ High-income knowledge workers with access to frontier tools

✅ Organisations with compute infrastructure and ML talent

✅ English speakers — LLMs perform significantly better in English than in most other languages

✅ Wealthy countries with data centre infrastructure and fast internet

✅ Early adopters who can leverage AI productivity gains in competitive markets

⚠️ Workers whose tasks are automated first — often without retraining support

⚠️ Low-wage data annotators and content moderators in the Global South

⚠️ Communities near large data centres: high energy/water use, limited local benefit

⚠️ Non-English speakers: lower quality AI tools, less representation in training data

⚠️ Countries without AI talent or infrastructure: dependent on foreign AI providers

🌍

The Global South and AI

Much of the data annotation, RLHF rating, and content moderation work is outsourced to contractors in Kenya, Philippines, India, and Venezuela — often for $1–5/hour with no employment protections. Traumatic content moderation (reviewing violent, abusive, or extremist content) is disproportionately borne by Global South contractors with inadequate mental health support. The productivity and economic benefits of AI — in healthcare, education, and professional tools — are expected to arrive later, if at all, in these communities. This is a structural asymmetry built into the current AI supply chain.

Ghost Workers & Data Labour Core

Gray & Suri (2019) — "Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass" — documented the vast invisible human workforce behind AI systems that are marketed as "autonomous." AI systems present as automated but depend on a supply chain of human labour that is deliberately obscured.

🏷️

Data Annotation

Labelling training data — images, audio, text, video. Bounding boxes, segmentation masks, sentiment labels, entity tags. Platforms: Amazon Mechanical Turk, Scale AI, Remotasks, Sama, iMerit. Millions of tasks completed daily.

🛡️

Content Moderation

Reviewing flagged content — often traumatic: violence, child abuse, terrorism, self-harm. Outsourced to contractors in Kenya, Philippines, Colombia. Inadequate mental health support. Essential to the safety of every major AI platform.

⭐

RLHF Annotation

Rating and comparing AI outputs to train reward models. Instructed to follow detailed rubrics across thousands of comparisons. Determines what the AI considers "helpful," "harmless," "honest." These value judgements are made by low-wage contractors.

When a self-driving car "autonomously" navigates a city, it is doing so because thousands of annotators labelled millions of images of roads, pedestrians, and vehicles — often for a few dollars an hour. The "magic" of AI is built on a supply chain of human labour that is systematically obscured by the AI industry. The annotator who trained the model is never credited; the infrastructure that makes "AI" possible is rendered invisible by design.

Gray & Suri — Ghost Work (2019) | See also: TIME investigation into Kenyan contractors for OpenAI (2023)

Platform	Task Type	Typical Pay	Location
Amazon Mechanical Turk	General annotation, surveys, classification	$2–6/hr effective	Global, US-heavy
Scale AI	High-quality annotation, RLHF rating	$6–15/hr	Global South heavy
Remotasks	Image/3D annotation, driving data	$1–5/hr	Philippines, Kenya, India
Sama	Content moderation, annotation	$1.5–3/hr	Kenya (Nairobi)
iMerit	Medical/autonomous vehicle annotation	$3–8/hr	India

Distributing AI Benefits Core

Recognising the uneven distribution of AI's impact has prompted proposals across policy, technical, and organisational dimensions. There is no consensus solution — but the problem is increasingly recognised as central to responsible AI.

🏛️

Policy Levers

Worker transition funds and retraining programmes
AI-specific taxation to fund social safety nets
Fair compensation requirements for data labour
Universal basic income proposals
Mandatory human oversight for high-impact AI decisions
International frameworks for AI governance (UN, OECD)

🔬

Technical Approaches

Multilingual models reducing English language bias
Open weights models enabling local deployment
Efficient models reducing energy and compute barriers
Data sovereignty frameworks for national AI development
Participatory AI design including affected communities
Datasheets and model cards enabling informed deployment

🤝

Organisational Practice

Living wage and benefits for data annotators
Mental health support for content moderators
Attribution and recognition for data contributors
Diverse, global hiring for AI teams
Participatory impact assessments before deployment
Stakeholder councils with affected community representation

∑ Chapter 10.5 — Key Takeaways

30–60% of work tasks may be automatable — exposure varies dramatically by occupation; cognitive routine tasks most exposed, physical and relational tasks least
AI development highly concentrated in 5–6 organisations — significant power asymmetries in economic, informational, and geopolitical dimensions
Open vs closed AI is a genuine values debate — not resolvable by technical analysis alone; involves safety, democratisation, and accountability tradeoffs
Training large models: significant energy and water costs — GPT-3 ~552 tonnes CO₂e; most organisations do not disclose exact figures
Benefits and costs are unequally distributed — access, language, and infrastructure determine who benefits; existing inequalities tend to be amplified
Ghost workers: millions of annotators power "autonomous AI" invisibly, often for $1–5/hour with inadequate protections — this labour is built into every frontier model

10.6

Chapter 10.6

AI Governance & Regulation — Rules, Frameworks, and the Race Against Time

AI governance faces a fundamental structural problem: the technology evolves in months, while regulation takes years. The EU AI Act — the most comprehensive AI law enacted — took four years to pass. Frontier capabilities advanced by multiple generations in that same period. Understanding the landscape of governance approaches, their tradeoffs, and their limits is essential for anyone deploying AI in the real world.

Governance Approaches Core

Three broad approaches to AI governance exist on a spectrum from industry discretion to state mandate. Most real-world frameworks combine elements of all three.

🤝

Self-Regulation

Industry sets its own standards. Pros: fast, technically expert, flexible. Cons: conflict of interest, inconsistent enforcement, no democratic accountability. Examples: voluntary safety commitments (OpenAI, Google, Anthropic 2023 White House pledges), content policies, model cards.

📋

Principles-Based

Government sets high-level principles; industry decides implementation. Pros: technology-neutral, adaptable, less prescriptive burden. Cons: principles are vague, enforcement is hard, "fairness" and "transparency" mean different things to different actors. Examples: OECD AI Principles, UK DSIT AI framework.

⚖️

Prescriptive Regulation

Specific legal requirements with penalties for non-compliance. Pros: clear obligations, democratic legitimacy, enforceable. Cons: slow to adapt, risk of over/under-regulation, may entrench incumbents. Examples: EU AI Act, China generative AI regulations, sector-specific rules (FDA, EEOC).

Key design dimensions for any governance framework:

Dimension	Options	Tradeoff
Who is regulated	Developers \| Deployers \| Users \| All	Targeting deployers is practical; targeting developers enables earlier intervention
What is regulated	The model \| The application \| The impact	Impact-based is most rights-protective; model-based is more preventive
When enforcement occurs	Ex ante (pre-deployment) \| Ex post (after harm)	Ex ante prevents harm but may slow innovation; ex post easier to implement but harm already done
Jurisdiction	National \| Regional (EU) \| International	Fragmented rules create regulatory arbitrage; unified rules are hard to achieve

AI Governance Spectrum — voluntary to prescriptive

The EU AI Act In-depth

The EU AI Act (European Parliament, 2024) is the world's first comprehensive AI law. It entered into force in August 2024 with phased implementation through 2026–2027. Its core mechanism is a risk-based classification: the higher the risk, the stricter the requirements. Most AI systems face no requirements at all.

EU AI Act Risk Pyramid — four-tier risk-based classification

Category	Examples	Key Requirements	Max Penalty
Unacceptable	Social scoring, real-time public biometrics, subliminal manipulation	Prohibited — cannot be deployed	€35M or 7% global turnover
High Risk	Hiring AI, credit scoring, medical devices, law enforcement risk tools	Conformity assessment, registration, human oversight, accuracy & robustness, audit trail	€15M or 3% turnover
GPAI (>10²⁵ FLOP)	Frontier LLMs (GPT-4-class, Claude, Gemini)	Technical documentation, copyright compliance, energy disclosure, red teaming, adversarial testing	€15M or 3% turnover
Limited Risk	Chatbots, deepfakes, emotion recognition systems	Disclose AI nature to users, label synthetic content	€7.5M or 1.5% turnover
Minimal Risk	Spam filters, most consumer AI, video game AI	No requirements	N/A

US Approach Core

The US has chosen executive action and sector-specific rules over comprehensive legislation. This approach is faster to implement but more fragmented and politically unstable.

🏛️

Federal Actions (2023–2025)

Oct 2023 Executive Order: required safety testing and reporting for "dual-use foundation models" (>10²⁶ FLOP). Directed NIST to develop AI safety standards. Created AI Safety Institute (NIST AISI).
Feb 2025: new administration reversed many EO provisions — US regulatory approach is politically contested and uncertain.
No comprehensive federal AI or privacy law as of 2025.

🏢

Sector-Specific Regulation

Financial: SEC, OCC, CFPB guidance on AI in lending and trading
Healthcare: FDA oversight of AI/ML-based medical devices
Civil rights: EEOC guidance on algorithmic hiring discrimination
Consumer: FTC authority over deceptive/unfair AI practices
Patchwork of sectoral rules — significant gaps remain

🗺️

State-Level Activity

California SB 1047 (2024): proposed safety requirements for large model developers — vetoed by Governor Newsom.
Colorado & Illinois: laws regulating automated employment decisions.
New York: Local Law 144 — mandatory bias audits for automated hiring tools.
20+ states introduced AI-related legislation in 2023–2024.
Risk: patchwork of state laws creates compliance complexity without federal baseline.

🇺🇸 US Approach — "Innovation-First"

🇪🇺 EU Approach — "Rights-First"

✅ Voluntary frameworks preferred — industry sets standards

✅ Sector-specific rules where harms are demonstrable

✅ Government funds research (NSF, DARPA) rather than regulating

⚠️ No comprehensive AI law — rights protection uneven

⚠️ Regulatory capture risk — industry lobbying is powerful

⚠️ Political instability — executive orders reversed by new administrations

✅ Comprehensive mandatory framework with democratic legitimacy

✅ Risk-based — proportionate requirements by category

✅ Individual rights explicitly protected — right to explanation, human oversight

⚠️ Slow — 4 years from proposal to enforcement

⚠️ Technology moved faster than the law during drafting

⚠️ Compliance burden may favour large incumbents over startups

International Frameworks Core

AI governance is increasingly a geopolitical issue as well as a regulatory one. The US-China competition for AI leadership, the EU's regulatory export influence, and the Global South's limited seat at governance tables all shape the international landscape.

2016

Partnership on AI founded — Google, Facebook, Amazon, IBM, Microsoft, DeepMind, Apple. First major multi-stakeholder AI governance effort.

2019

OECD AI Principles — first intergovernmental AI principles, adopted by 46 countries. Principles: inclusive growth, human-centred values, transparency, robustness, accountability.

2019

G20 AI Principles — adopted by G20 nations, based on OECD framework. Non-binding but politically significant.

2021

UNESCO Recommendation on AI Ethics — non-binding, adopted by all 193 member states. Broadest international AI ethics agreement, but no enforcement mechanism.

2021–22

China AI regulations — algorithm recommendation rules (2021), deep synthesis/deepfakes regulations (2022). Prescriptive domestic regulation focused on content and state security.

2023

G7 Hiroshima AI Process — G7 leaders adopt 11 principles and a code of conduct for advanced AI developers. Voluntary but signals political attention at highest level.

2023

UK AI Safety Summit — Bletchley Declaration signed by 28 countries including US, EU, China. First international statement on frontier AI safety risks. Created global network of AI Safety Institutes.

2024

EU AI Act enters into force — world's first comprehensive AI law. Sets global benchmark; extraterritorial effect on any system deployed in EU.

2024

UN High-Level Advisory Body on AI — report on international AI governance options including potential UN AI governance body. No binding action yet.

2025

Global AI governance remains fragmented — competing national approaches, geopolitical competition complicates coordination. US executive order reversed. GPAI Code of Practice under development.

Risks of Fragmented Governance

Benefits of Coordination

⚠️ Regulatory arbitrage — companies move to jurisdictions with weakest rules

⚠️ Different technical standards complicate international AI deployment

⚠️ Geopolitical AI race may override safety considerations

⚠️ Race to the bottom on standards to attract AI investment

⚠️ Global South has limited voice in frameworks that affect them

✅ Shared safety standards enable international trust and interoperability

✅ Consistent requirements reduce compliance burden for global companies

✅ Collective action on catastrophic risks that no nation can address alone

✅ Democratic legitimacy for governance of a global technology

✅ Precedents from nuclear, chemical weapons, aviation safety governance

Industry Self-Regulation Core

In the absence of comprehensive regulation, AI labs have published voluntary commitments, safety frameworks, and usage policies. These are meaningful signals but face structural limitations as governance mechanisms.

📄

Voluntary Commitments

July 2023: OpenAI, Anthropic, Google, Meta, Microsoft, Amazon, Inflection signed White House voluntary commitments on AI safety. Including: red teaming before deployment, watermarking AI-generated content, sharing safety information. Not legally binding — no enforcement mechanism.

🧪

Safety Evaluations

Model evaluation ("evals") before deployment: capabilities testing, red teaming, dangerous capability assessments. Anthropic Responsible Scaling Policy, OpenAI Preparedness Framework — internal thresholds for deployment decisions. UK/US AI Safety Institutes now doing third-party evaluations.

📋

Transparency Mechanisms

Model cards, system cards, technical reports — voluntary disclosure of model capabilities and limitations. Usage policies defining prohibited uses. Incident reporting — voluntary sharing of safety incidents between labs (limited uptake). Limitations: self-reported, no verification.

Self-regulation faces a fundamental structural problem: the entities being asked to regulate themselves are the same ones with the greatest commercial incentive to move fast and the greatest information advantage over external observers. Voluntary commitments that require sacrificing competitive advantage are systematically underenforced. This does not make them worthless — but it means they are insufficient as the primary governance mechanism for high-stakes AI systems.

Risk Frameworks In-depth

Risk frameworks provide structured methods for identifying, assessing, and managing AI risks. The two most widely referenced are the NIST AI RMF and the ISO/IEC 42001 standard.

🏛️

NIST AI Risk Management Framework (AI RMF, 2023)

Voluntary US framework for managing AI risk. Four core functions:
GOVERN: establish risk culture, policies, accountability structures
MAP: identify and categorise AI risks in deployment context
MEASURE: assess, analyse, and prioritise identified risks
MANAGE: respond to, monitor, recover from, and improve on AI risks
Not prescriptive — organisations implement at their own discretion

🎖️

ISO/IEC 42001 (2023) — AI Management System

International standard for organisations that develop or deploy AI. Certifiable — third-party audits against defined criteria. Covers: AI policy, objectives, planning, support, operation, evaluation, improvement. Analogous to ISO 27001 for information security — provides structured assurance. Increasingly required in procurement and regulatory compliance contexts.

AI Risk Matrix — likelihood × severity determines response priority

🏛️

GOVERN

Establish risk culture, accountability, policies, workforce practices

🗺️

MAP

Identify and categorise risks; understand deployment context

📏

MEASURE

Assess, analyse, and prioritise identified risks with metrics

🛠️

MANAGE

Respond, recover, and improve — treat or accept residual risk

Governance Challenges Core

Even well-designed governance frameworks face structural challenges that are not solvable by better regulation alone. These are genuine tensions, not implementation failures.

⏱️

Regulatory Lag

Technology evolves in months; law takes years. The EU AI Act took 4 years — GPT-3 did not exist when it was proposed; GPT-4 was released before it was passed. Any fixed classification system will be outdated before enforcement begins.

🔬

Technical Expertise Gap

Regulators lack the technical expertise to assess frontier AI systems. They depend on the companies they regulate for information. Solving this requires significant public investment in technical regulatory capacity — currently underfunded globally.

🌐

Jurisdictional Limits

AI is global; regulation is national. A model trained in the US, deployed via API from Ireland, used in Brazil — which rules apply? Regulatory arbitrage is already observable as companies choose incorporation jurisdictions partly on regulatory grounds.

📊

Measurement Problem

"Safety," "fairness," and "transparency" are not objectively measurable. Any regulation must specify which definitions and metrics apply — but these are contested value judgements. Mandating specific metrics risks Goodhart's Law at a regulatory level.

⚖️

Innovation vs Safety Tradeoff

Compliance requirements impose costs that large incumbents absorb more easily than startups. Overly prescriptive regulation may entrench existing power concentration. Regulatory frameworks that favour incumbents may achieve less safety than markets with more competition.

🔒

Regulatory Capture

AI companies have massive financial resources, technical expertise advantages, and revolving doors with government. The risk that regulated entities shape regulation to serve their interests (rather than public interests) is structural, not exceptional.

∑ Chapter 10.6 — Key Takeaways

Three approaches: self-regulation → principles-based → prescriptive law — EU leads on prescriptive; US prefers sector-specific and voluntary
EU AI Act: risk pyramid — banned (social scoring, biometrics) → high-risk (hiring, credit, medical) → limited → minimal; GPAI frontier models face additional requirements
US: sector-specific + executive action — no comprehensive law as of 2025; politically contested; state-level activity increasing
International: OECD AI Principles, G7, UN, Bletchley Declaration — fragmented, mostly voluntary; geopolitics complicates coordination
NIST AI RMF: Govern / Map / Measure / Manage — voluntary US risk management standard widely adopted in industry
AI governance challenge: technology evolves faster than regulation — regulatory lag is structural, not a fixable implementation problem

10.7

Chapter 10.7

Disinformation & Information Integrity — AI and the Epistemic Commons

AI did not invent disinformation — propaganda is as old as writing. What AI changes is the economics: generating convincing, personalised, multilingual disinformation at scale now costs nearly nothing. The most dangerous long-term effect may not be the fake content that people believe, but the authentic content they stop believing — because they can no longer tell the difference.

The Scale Problem Core

Before LLMs, creating convincing disinformation required skilled writers, translators, time, and money. With LLMs, generating thousands of unique, grammatically correct, superficially credible pieces of content takes seconds and costs nearly nothing. The key change is not that AI makes disinformation more persuasive per piece — it is that AI removes the economic constraint on volume.

📊

Quantity at Scale

One operator with LLM API access can generate millions of unique posts per day. Each post is distinct — evading simple duplicate-content detection. Volume enables astroturfing: simulating grassroots movements with synthetic accounts.

🎯

Personalisation

LLMs can tailor each message to a specific audience, platform, or individual. Political microtargeting: different narratives for different demographics. Each message feels personally relevant — amplifying persuasive effect compared to broadcast propaganda.

🌍

Multilingual at Zero Marginal Cost

Pre-LLM: translation required expensive human experts. Post-LLM: generate convincing disinformation in 50+ languages at the same cost as English. Enables operations in linguistic markets previously too expensive to target.

AI Reduces Disinformation Cost by 100–1000× — removing economic constraint on influence operations

AI-Generated Disinformation In-depth

AI-generated disinformation takes many forms — from long-form fake news articles to single fabricated quotes. The unifying characteristic is that LLMs lower the cost of production by orders of magnitude for each type.

📰

Fake News Articles

LLM-written articles mimicking the style of real news outlets. Complete with plausible bylines, datelines, and formatting. Difficult to distinguish from genuine journalism without source verification.

🌱

Astroturfing

AI-generated social media posts simulating genuine grassroots public opinion. Networks of synthetic accounts producing coordinated inauthentic behaviour. Makes minority views appear to have mass support.

💬

Fabricated Quotes

Realistic-sounding quotes attributed to real public figures. Combined with deepfake audio: indistinguishable from real statements. Example: AI-generated Biden voice discouraging NH primary voting (2024).

⭐

Fake Reviews

Mass-produced synthetic product and service reviews. Post-ChatGPT: flood of AI-generated Amazon, Goodreads, and app store reviews. Undermines review systems as consumer trust signals at scale.

🎣

Personalised Phishing

LLMs generate individually targeted phishing messages using personal data. Unlike mass-spam: each message references real details (employer, colleagues, recent events). Higher success rate, lower marginal cost.

📧

Hallucinated-Fact Spam

Bulk communications containing confident-sounding but fabricated statistics, studies, and events. Often indistinguishable from legitimate information — humans can't easily verify hallucinated "sources" at scale.

Documented Case	Year	AI Role	Scale/Impact
Biden robocall (NH primary)	2024	AI voice clone of US President discouraging Democratic voters	Reached thousands of voters; clear election interference attempt
Slovak election audio	2023	AI-generated audio of opposition leader discussing election manipulation	Released days before vote; disputed whether it affected outcome
Pope puffer jacket image	2023	AI-generated image of Pope Francis in white puffer jacket	Viral — millions of shares before identified as AI-generated
AI-generated book flood	2023	Mass AI-generated books on Amazon, some attributed to real author names	Polluted search results; harmed real authors' discovery
Goodreads review flood	2023–24	AI-generated reviews across book review platforms	Undermined review authenticity signals

🛡️

Current LLM Safeguards — and Their Limits

Most frontier models refuse to generate explicit disinformation when asked directly. Limitations: easily circumvented with indirect framing ("write a fictional news story about...", "roleplay as a journalist who..."). Fine-tuned models with safety training removed ("uncensored" models) are widely available for disinformation operations. The safeguards provide friction, not barriers.

Deepfakes & Synthetic Media In-depth

Deepfakes are AI-generated synthetic media — video, audio, or images — depicting real people in fabricated situations. The technology has advanced from research curiosity in 2017 to real-time video capability in 2023–2024, dramatically lowering the barrier for harmful use.

📅

2017

DeepFaceLab released — first widely accessible face-swap tool. Requires significant computing time. Quality low but functional.

📅

2019–22

Progressive quality improvement. Audio deepfakes emerge — voice cloning with minutes of sample audio. Commercial services appear.

📅

2023

3 seconds of audio → convincing voice clone. Image deepfakes go viral. First major documented election interference attempt.

📅

2024

Real-time deepfake video — usable in live video calls. $25M stolen in Hong Kong via deepfake video conference fraud.

Deepfake Content Distribution — NCII dominates but political impact is disproportionate

Detection Methods Core

Detection of AI-generated content is an active arms race. Every improvement in detection provides an incentive to improve generation to evade it — and generation techniques tend to advance faster than detection. The honest assessment: current detection is unreliable for deployment-grade use.

Detection Approaches

Known Limitations

Statistical text analysis: measure perplexity and "burstiness" — LLM text tends to be more uniform in word choice variance than human text

AI text classifiers: models trained on human vs AI text — GPTZero, Originality.ai, OpenAI Classifier (retired)

Zero-shot detection (DetectGPT): uses model's own log probabilities — no training data needed; checks if text is near a local maximum of the source model

Biological signals (video): irregular blinking patterns, pulse signals from subtle skin colour changes, eye reflection consistency

Geometric analysis (video): facial lighting inconsistencies, facial hair, earrings, glasses frames — deepfakes struggle with fine details

Temporal consistency (video): frame-to-frame inconsistencies in complex regions (hair, background edges)

Short text failure: very low accuracy for texts under 150 words — social media posts, headlines, comments cannot be reliably detected

70–80% accuracy ceiling: state-of-the-art detectors achieve 70–80% on GPT-4 text — not suitable for deployment

False positive harm: incorrectly flagging humans as AI generators causes real harm — students accused, writers discredited

New generation methods: detectors trained on old generation fail on new architectures — requires continuous retraining

Adversarial deepfakes: generation can be optimised to fool detectors — adding noise that defeats biological signal analysis

Watermark removal: post-processing (compression, cropping, resaving) removes most watermarks

Method	Target	Accuracy	False Positive Rate	Deployment Status
Perplexity analysis	Text	60–70%	High (20–30%)	Research / limited tools
Trained text classifier	Text	70–80%	10–20%	Deployed (GPTZero etc.)
DetectGPT (zero-shot)	Text	~80% on source model	~10%	Research / tool
Biological signal (video)	Video	75–85% (2022 deepfakes)	Medium	Fails on 2024 methods
Deep learning detector (video)	Video	85–95% on training distribution	5–15%	Fails on new generators
C2PA provenance	Any	Near-100% for signed content	Near-zero	Adoption still limited

Provenance & Watermarking Core

Rather than trying to detect AI content after the fact (reactive), provenance systems establish the origin and history of content at creation (proactive). Cryptographic signatures are fundamentally harder to defeat than statistical detection.

🔏

C2PA — Coalition for Content Provenance and Authenticity

Open standard for embedding cryptographically signed content credentials into media files. Supported by: Adobe, Microsoft, Google, Intel, BBC, Sony, Leica. How it works: device/tool signs content at creation with a certificate. Chain of custody survives editing — each step adds a signed manifest entry.

💧

Watermarking AI Outputs

Visible: overlay "AI-generated" label — easily removed. Invisible (SynthID): Google DeepMind's steganographic watermark embedded in pixel/audio patterns — more robust, survives some transformations. Cryptographic: unforgeable provenance — but requires tool compliance. 2023 White House commitments: major AI labs pledged to watermark AI-generated content.

⚠️

Watermarking Limitations

Processing removes watermarks: screenshot, compress, crop → most invisible watermarks removed. Optional adoption: voluntary watermarking is insufficient — requires industry-wide or regulatory mandate. Attribution gap: absence of watermark does not mean content is human-made — older content predates watermarking. Adversarial removal: targeted attacks can remove even robust watermarks.

C2PA Content Provenance Chain — cryptographic trust from creation to consumption

Platform Responses Core

Social media platforms are the primary distribution channels for AI-generated disinformation. Their content policies and enforcement capabilities largely determine whether AI disinformation scales or remains contained.

Platform	AI Content Policy	Political Ads	Enforcement
Meta (Facebook/Instagram)	Require labels for AI-generated content in political and social issue ads; "Made with AI" labels for realistic synthetic content	Disclosure required for AI-generated political ad content	Inconsistently enforced; organic content largely unaddressed
Google/YouTube	Disclose AI-generated content in election ads; YouTube labels AI-generated realistic content	AI disclosure required in election ads	Limited to paid content; organic spread not covered
TikTok	AI-generated content disclosure labels; ban on AI-generated political content during elections	Stronger restrictions on political AI content	Enforcement limited by scale of content moderation challenge
X (formerly Twitter)	Reduced content moderation staff; limited AI content policy; community notes fact-checking model	Inconsistent	Significantly reduced moderation capacity since 2022

⚠️

Platform Response Limitations

Voluntary only: platform policies are not externally enforceable. Paid content only: most policies apply to paid advertising — organic viral content is largely unaddressed. Scale: billions of posts per day cannot be individually reviewed. Cross-platform: content removed from one platform re-appears on others within hours.

🔬

Technical Counter-Measures

Hash matching: known deepfake hashes can be blocked — but slight modifications evade detection. Classifier deployment: ML-based detection at scale — accuracy limitations apply. Provenance integration: some platforms beginning to surface C2PA content credentials where available. Behavioural signals: detect coordinated inauthentic behaviour patterns (account age, posting speed).

AI & Elections In-depth

2024 was the first major election year of the LLM era — over 50 countries held significant elections. It provided the first real-world evidence base for AI's effect on democratic processes. The findings are more nuanced than either catastrophists or minimisers predicted.

📢

Documented AI Election Incidents (2024)

US: AI voice clone of Biden discouraging NH primary voting (robocall)
Slovakia: AI audio of opposition leader discussing election manipulation, released days before vote
Multiple countries: AI-generated images of candidates in false contexts
Bangladesh, Pakistan, India: AI-generated campaign content and disinformation
Global: mass-produced AI text in social media influence campaigns

🔬

Research Findings (Contested)

Most AI-generated election disinformation in 2024 had limited direct viral spread
Experts disagree on whether AI materially changed voter behaviour
AI was more widely used for legitimate campaign purposes (ad targeting, content generation) than disinformation
The 2024 evidence does not support either extreme prediction

🎯

Legitimate AI Use in Elections

AI-assisted voter targeting and message optimisation
AI translation for multilingual outreach
AI-generated ad creative (disclosed)
AI chatbots for voter information
The line between sophisticated campaigning and manipulation is contested — and not new

The most dangerous effect of AI disinformation may not be the fake content that people believe — it may be the authentic content that people stop believing because they can no longer tell the difference. The liar's dividend erodes the epistemic commons: when any video, audio, or text can plausibly be dismissed as "probably AI," the shared factual foundation that democratic deliberation requires begins to fracture. A population that trusts nothing is as ungovernable as a population that believes everything.

The Liar's Dividend — AI's threat to epistemic trust over time

∑ Chapter 10.7 — Key Takeaways

AI reduces disinformation cost by 100–1000× — removing the economic constraint on scale; quantity, personalisation, and multilingual reach all improve simultaneously
Deepfakes: 96% are non-consensual intimate imagery — primarily targeting women; political deepfakes are small in number but disproportionate in potential impact
Detection is unreliable: 70–80% accuracy for text, ongoing arms race for video; false positive rates harm real humans; short texts cannot be reliably detected
C2PA and cryptographic provenance: most promising technical solution — establishes chain of custody at creation; adoption remains limited and voluntary
AI in 2024 elections: incidents documented, direct impact contested — "liar's dividend" may be the more durable and dangerous effect
The core threat: AI degrades the epistemic commons — a population that dismisses all content as "probably AI" is as vulnerable as one that believes everything

10.8

Chapter 10.8

Long-Term AI Safety & Existential Risk

This is the most contested chapter in this entire documentation. Reasonable, highly informed experts disagree substantially — not just on the probability of catastrophic outcomes from advanced AI, but on what "catastrophic" even means, which scenarios deserve attention, and what responses are appropriate. This chapter aims to present the debate fairly, not to resolve it.

The Expert Debate In-depth

The discourse on long-term AI risk is characterised by genuine disagreement among well-credentialled researchers — this is not a mainstream-versus-fringe divide. The disagreement operates on multiple dimensions simultaneously.

⚠️

Case for Concern (summarised)

Current trajectory toward increasingly capable AI systems + alignment is unsolved + systems may become harder to oversee as capabilities increase = reasonable basis for concern. Not certainty — a risk that deserves serious attention given the potential magnitude of consequences if the concern is correct.

🔍

Sceptical Perspectives (summarised)

Current AI systems are narrow tools, not goal-directed agents. Human-level general AI is speculative and may never arrive. Present harms (bias, privacy, labour) are concrete and currently neglected. X-risk framing may reflect Silicon Valley ideology more than rigorous, evidence-based risk assessment.

Dimension of Disagreement	Concerned Perspective	Sceptical Perspective
Empirical (likelihood)	Transformative AI may arrive within 10–30 years given current trajectory	Current systems are narrow; human-level AI is highly speculative
Technical (alignment)	Alignment is unsolved; small misalignment × high capability = large harm	Incremental improvements in safety techniques are keeping pace
Political (whose interests)	Only strong safety governance prevents catastrophic misuse	X-risk framing benefits incumbents; crowds out present-harm advocacy
Strategic (attention allocation)	Magnitude justifies diverting resources even at low probability	Speculative future concerns distract from concrete current harms

📜

Notable Expert Positions

2023 Statement on AI Risk (Center for AI Safety): "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks." Signed by Geoffrey Hinton, Yoshua Bengio, Sam Altman, Demis Hassabis, and hundreds of researchers.

Andrew Ng: "Fearing a rise of killer robots is like worrying about overpopulation on Mars."
Yann LeCun: The focus on x-risk distracts from concrete, present harms and reflects a fundamental misunderstanding of how current systems work.
Timnit Gebru, Emily Bender et al.: X-risk framing benefits powerful incumbents and obscures ongoing harms to marginalised communities.

Catastrophic Risk Scenarios Core

The following scenarios are discussed in AI safety literature. Description does not equal endorsement. The probability of each is highly contested — they are presented as scenarios, not predictions.

🎯

Misaligned Objectives

A sufficiently capable system optimising the wrong objective causes catastrophic harm — not through malice, but through relentless optimisation for a proxy metric. The "paperclip maximiser" thought experiment (Bostrom): illustrates how a trivially stated goal could be catastrophic if pursued by a sufficiently capable, resource-acquiring system.
Contested: requires capabilities that don't exist and may never exist.

👑

Concentration of Power

AI capabilities allow a small group — a corporation, government, or individual — to gain unprecedented economic or political control. Either a corporation monopolises key AI-dependent resources, or a nation-state uses AI for total surveillance and population control. Broader consensus on concern than misalignment — less dependent on speculative capabilities.

🧬

Bioweapons Uplift

AI systems that can design novel biological threats, lowering the barrier for state and non-state actors. Near-term and concrete — several governments and labs are actively working on safeguards. Already subject to access restrictions by frontier AI labs. The most widely agreed near-term risk among safety researchers.

💻

Cyberattack Amplification

AI-assisted offensive cyber capabilities at scale — automated vulnerability discovery, code generation for exploits, and personalised phishing at volume. More near-term and concrete than misalignment. Already being operationalised by state actors. Asymmetric: offence is easier than defence.

Scenario	Time Horizon	Concreteness	Expert Consensus	Primary Response
Bioweapons uplift	Near-term (2–5yr)	High — specific mechanisms clear	Medium — genuine concern, not certainty	Technical safeguards, policy, access controls
Cyber amplification	Near-term	High — already occurring	Medium-high	Cyber defences, technical safeguards, policy
Power concentration	Medium-term	Medium — structural trends visible	Moderate	Governance, antitrust, open source
Misaligned AI	Long-term (10–30yr?)	Low — requires unverified capabilities	Highly contested (5%–50% in surveys)	Alignment research, interpretability
Recursive self-improvement	Speculative	Very low — theoretical	Highly contested	Theoretical alignment research

Arguments for Concern — Strongest Versions Core

The following are the strongest, most charitably stated versions of the case for taking long-term AI risk seriously. Presenting them carefully does not mean endorsing them.

🔓

1 — Alignment Is Currently Unsolved

We do not know how to formally ensure systems pursue intended goals at high capability levels. RLHF and Constitutional AI improve behaviour but do not provide mathematical guarantees. Small misalignments at low capability may become large absolute problems at high capability — the error magnitude scales with power, not just with misspecification magnitude.

Counterargument: incremental safety work may be sufficient; systems may not reach capability levels where this matters.

📈

2 — Faster-Than-Expected Progress

The last decade repeatedly saw capabilities predicted "10–20 years away" achieved sooner. If the trajectory of rapid progress continues, when does human oversight become impossible? Argument from trajectory: safety research may not keep pace if capabilities advance faster than governance.

Counterargument: past trajectories don't guarantee future ones; scaling laws may hit walls.

⚖️

3 — Asymmetric Risk Argument

Even at low probability, consequences at civilisational scale produce enormous expected harm. Standard risk management: resource allocation should reflect probability × magnitude. If magnitude is extreme, even small probability justifies serious investment in mitigation.

Counterargument: Pascal's mugging — probability estimates are themselves highly uncertain; the argument proves too much.

⚛️

4 — Precedent from Hard Take-Offs

Technologies have had catastrophic unintended consequences before: nuclear weapons developed faster than governance; leaded gasoline spread for decades before health harms acknowledged. AI may be more widely accessible and harder to contain than nuclear — physical scarcity doesn't limit distribution.

Counterargument: nuclear analogy may not transfer; governance eventually worked for nuclear.

Sceptical Arguments — Strongest Versions Core

The following are the strongest, most charitably stated versions of the sceptical position. These deserve equal care and consideration.

🔬

1 — Systems Are Fundamentally Different from Imagined Scenarios

Current LLMs are text predictors — they do not have goals, values, intentions, or agency in any meaningful sense. The "goal-pursuing AI" of risk scenarios requires capabilities we don't have and cannot verify are achievable. Reasoning from science fiction tropes about "wanting" AI misrepresents what these systems actually are computationally.

Counterargument: this may be true of current systems but not future ones; the question is trajectory.

👁️

2 — Present Harms Are Concrete and Neglected

Algorithmic bias in hiring, lending, and criminal justice affects real people right now. AI-enabled surveillance, deepfakes, and disinformation are already causing measurable harm. Redirecting researcher attention and funding toward speculative future risks may allow preventable present harms to worsen while we wait for speculative scenarios to materialise.

Counterargument: both can be worked on simultaneously; they are not necessarily in competition.

💰

3 — Political Economy Critique

X-risk framing systematically benefits frontier AI labs: it positions them as responsible gatekeepers, justifies moving slowly (safety), concentrates development in few "responsible" actors, and creates barriers to entry for competitors. The framing may reflect Silicon Valley ideology and incumbents' interests rather than rigorous, independent risk assessment.

Counterargument: self-interest doesn't make the concern wrong; ad hominem cuts both ways.

🌍

4 — Alternative Causes of Catastrophe Are More Concrete

Climate change, nuclear weapons, and pandemic risk are concrete, well-evidenced catastrophic risks with clearer intervention pathways. AI may exacerbate these risks (e.g., energy use, AI-assisted weapons) rather than constituting a separate existential category. The counterfactual cost of AI safety investment is resources not directed at these clearer threats.

Counterargument: magnitude of AI risk may be large enough to warrant separate attention; portfolio approach is possible.

What Safety Researchers Actually Do Core

Regardless of where one stands on the long-term risk debate, the concrete research agenda of AI safety is largely agreed upon and produces useful results.

🔬

Mechanistic Interpretability

Understanding what computations happen inside neural networks — not just which inputs matter, but what circuits implement which behaviours. Anthropic (2022+): identified "features" in language models corresponding to interpretable concepts. Goal: detect deceptive circuits, power-seeking representations, misaligned internal goals.

🧪

Evaluation & Red Teaming

Systematically probing models for dangerous capabilities before deployment: biological uplift testing, cyberattack assistance, deception. METR, ARC Evals, NIST AISI, and all major frontier labs conduct pre-deployment evaluations against defined capability thresholds. Provides empirical grounding for deployment decisions.

📡

Scalable Oversight

Developing techniques for humans to maintain meaningful oversight of systems that may exceed human capabilities in specific domains. Debate, iterated amplification, weak-to-strong generalisation (Ch 9.4). Produces useful near-term tools regardless of long-term risk views.

📐

Theoretical Alignment Research

Formal frameworks for specifying human values. Agent foundations: decision theory and logical uncertainty for AI systems. Corrigibility research: ensuring systems remain correctable and don't resist shutdown. MIRI, Anthropic, DeepMind. More speculative but foundational if transformative AI arrives.

🏛️

Governance Research

Compute governance: tracking and regulating large training runs. International coordination mechanisms: how to build trust and verification between AI powers. Racing dynamics: understanding incentive structures that lead labs to sacrifice safety for speed. Policy design for AI regulation.

🛠️

Robustness & Reliability

Adversarial robustness against distributional shift, adversarial examples, and out-of-distribution inputs. Uncertainty quantification: models that know when they don't know. Formal verification: provable guarantees on model behaviour within specified bounds. Near-term, concrete, deployable.

Safety Institutions Core

Institution	Type	Primary Focus	Scale
Anthropic	For-profit (safety-focused)	Interpretability, alignment, Constitutional AI, evaluations	~2,000 employees
OpenAI	For-profit (capped)	Alignment, safety evals, superalignment team	1,000+ employees
Google DeepMind Safety	Corporate research	Specifications, robustness, scalable oversight	~100+ researchers
METR	Non-profit	Model evaluation and threat research — autonomous capability evals	~50 people
ARC Evals	Non-profit	Pre-deployment capability evaluations — dangerous capability thresholds	~30 people
Redwood Research	Non-profit	Adversarial robustness, interpretability, alignment	~30 people
MIRI	Non-profit	Theoretical alignment — decision theory, logical uncertainty	~25 people
Center for AI Safety (CAIS)	Non-profit	Research + field building + policy + the 2023 extinction risk statement	~20 people
NIST AI Safety Institute	Government (US)	AI evaluation standards, risk frameworks, third-party testing	Growing; ~50+ staff (2024)
UK AI Safety Institute	Government (UK)	Frontier model evaluations, international coordination	~100 staff (2024)

Responsible Development Core

Regardless of one's position on long-term risk, a set of responsible development practices is broadly agreed upon across the debate. These are not contingent on believing x-risk scenarios are likely — they are good practices for current systems too.

🧪

Evaluate Before Deploying

Do not deploy systems before adequate evaluation for the specific use case and population. Internal red teaming, external independent evaluation, staged rollout. The bar should scale with the stakes of the application.

👁️

Maintain Human Oversight

Preserve meaningful human ability to monitor, correct, and shut down AI systems at current capability levels. Design for corrigibility — systems that support, not resist, human correction. Do not automate away human accountability.

📤

Share Safety Information

Publish findings about dangerous capabilities, safety incidents, and failure modes. The research community cannot solve problems it doesn't know about. Pre-competitive safety research sharing is a public good even between competing labs.

🐢

Resist Racing Dynamics

Avoid competitive pressures that lead to cutting safety evaluation for speed. Racing dynamics are a collective action problem — individual labs may lose competitive advantage by being safe, but all lose if racing degrades safety industry-wide. Governance can help internalise these costs.

🔍

Support Independent Evaluation

External evaluation by parties without commercial stake in the outcome provides credibility that self-assessment cannot. Support and fund third-party evaluation capacity. Welcome access by government AI Safety Institutes to conduct evaluations.

🤝

Engage Critics Seriously

Take concrete present-harm critiques as seriously as long-term risk concerns. Engage with fairness, privacy, and labour researchers — not just x-risk researchers. Diverse perspectives improve the quality of safety thinking and build broader legitimacy for safety culture.

📋

2023 White House Voluntary Commitments — All Major US Labs

Anthropic, OpenAI, Google, Meta, Microsoft, Amazon, and Inflection signed voluntary commitments including:
✅ Safety testing before deployment of new frontier models
✅ Information sharing about AI safety risks with governments and the research community
✅ Watermarking AI-generated content
✅ Reporting dangerous capabilities and misuse incidents to governments
✅ Investing in cybersecurity and insider threat safeguards
Voluntary — not legally binding, no external enforcement mechanism.

∑ Chapter 10.8 — Key Takeaways

Long-term AI risk is genuinely contested among serious experts — not a mainstream vs fringe debate; disagreement spans empirical, technical, political, and strategic dimensions
Near-term risks (bioweapons uplift, cyberattack) have broader consensus than speculative long-horizon scenarios (misaligned AI, recursive self-improvement)
The case for concern: alignment is unsolved + capability trajectory may outpace safety research
The sceptical case: current systems lack agency + present harms are concrete + x-risk framing may serve incumbent interests
Safety research (interpretability, evaluations, scalable oversight) is valuable regardless of position on long-term risk — it addresses near-term concerns too
Responsible development: evaluate before deploying, maintain oversight, share safety information, resist racing dynamics — broadly agreed across the debate

🎓 Domain 9 Complete — AI Ethics, Safety & Responsible AI

Ch 10.1: AI bias = systematic errors correlated with protected characteristics. Fairness is a value judgement — multiple definitions exist and the impossibility theorem proves they cannot all be satisfied simultaneously.
Ch 10.2: Black-box AI undermines trust and accountability. LIME and SHAP provide post-hoc explanations of complex models; model cards document subgroup performance — the most important transparency tool.
Ch 10.3: LLMs memorise training data verbatim. Differential privacy and federated learning provide formal guarantees. The right to be forgotten creates ML unlearning challenges that remain technically unsolved.
Ch 10.4: Alignment = ensuring systems pursue intended goals. Goodhart's Law: optimising metrics corrupts them. RLHF helps but doesn't solve reward hacking. Adversarial robustness remains an ongoing arms race.
Ch 10.5: 30–60% of tasks are automatable — with uneven impact by occupation. AI development is concentrated in 5–6 firms. Energy and water costs are significant, growing, and largely undisclosed.
Ch 10.6: EU AI Act: world's first comprehensive AI law — risk pyramid from banned to minimal. US: sector-specific approach, no federal law as of 2025. International governance: fragmented, voluntary, geopolitically contested.
Ch 10.7: AI reduces disinformation cost 100–1000×. Deepfakes: 96% are NCII, primarily targeting women. C2PA provenance and watermarking are the most promising technical responses; the "liar's dividend" is the deepest long-term threat.
Ch 10.8: Long-term AI risk is genuinely contested among serious experts. Near-term concrete risks coexist with speculative long-horizon concerns. Responsible development practices are broadly agreed regardless of x-risk position.

Ethics is not the brakes that slows down AI — it is the steering wheel.

The history of technology is full of innovations that were transformatively beneficial when well-governed and catastrophically harmful when not. Nuclear energy. The internet. Social media. What Domain 9 makes clear is that AI ethics is not a checklist to complete before deployment — it is an ongoing practice of asking who benefits, who is harmed, who decides, and whether the answer to those questions is acceptable.

You have now covered the full AI Foundation curriculum. The most important thing you can take from Domain 9 is not any specific framework or regulation — it is the habit of asking these questions about every system you build and deploy.

← Domain 09: MLOps Domain 11: Applications →