Ethics, Safety & Fairness — Responsible AI
Bias measurement and mitigation, AI safety and alignment, privacy, governance frameworks, and the ethical dimensions of deploying AI at scale.
AI systems do not create bias from nothing — they learn it, amplify it, and apply it at scale to millions of decisions. A hiring algorithm trained on historical data that systematically excluded women will learn to exclude women. A loan model trained on zip codes as proxy for race will perpetuate redlining. The question is not whether AI systems can be biased — they demonstrably are — but how to measure, reduce, and decide what "fair" actually means in each specific context.
AI bias refers to systematic errors in model outputs that correlate with protected characteristics — race, gender, age, disability, national origin, religion, and similar attributes. This is distinct from random error, which hurts everyone equally: bias hurts specific groups more. It is also distinct from statistical bias, a technical term for estimator deviation from the true value. AI/fairness bias refers to discriminatory patterns in real-world outcomes.
AI bias matters in ways that human bias sometimes does not because of four structural properties. Scale: one biased algorithm simultaneously affects millions of hiring, lending, healthcare, and criminal-justice decisions. Opacity: an algorithmic decision is harder to challenge and inspect than a human one. Automation bias: people tend to trust algorithmic outputs more than they should, reducing human correction of bad decisions. Feedback loops: biased outputs create biased training data for the next model, compounding over time.
ProPublica investigation found the tool was ~2× more likely to falsely flag Black defendants as high-risk compared to white defendants with equivalent criminal histories. Used in sentencing decisions across the US.
Amazon's ML recruiting tool, trained on historical hiring decisions, systematically penalised CVs containing the word "women's" (e.g. women's chess club). Scrapped after internal audit.
Widely used algorithm systematically underestimated illness severity for Black patients because it used healthcare costs as a proxy for health needs — and Black patients historically received less care per illness.
NIST audit of 189 facial recognition algorithms found false positive rates 10–100× higher for darker-skinned faces and women. Systems trained predominantly on lighter-skinned male faces.
Bias does not enter the ML pipeline at one point — it can enter at every stage, and different stages introduce qualitatively different types of distortion. Detecting and correcting bias requires auditing the full pipeline, not just the model.
Data reflects historical inequalities we don't want to perpetuate. Example: hiring data showing fewer women in engineering → model learns to favour men. The data is accurate; the pattern is harmful.
Certain groups under-represented in training data → model performs worse on them. Example: facial recognition trained mostly on lighter-skinned faces → higher error rates on darker faces.
How data is collected systematically distorts it for some groups. Example: "prior arrests" as proxy for criminality — arrest rates reflect policing intensity, not crime rates, over-policing Black neighbourhoods.
Single model trained on pooled data from groups with different underlying patterns. Example: medical model where normal glucose levels differ by ethnicity — one-size model is wrong for multiple groups.
Benchmark dataset doesn't represent the deployment population. Example: face datasets over-representing certain countries → misleadingly high aggregate accuracy metrics that hide per-group failures.
Model used in a context it was not designed for. Example: credit scoring model built for one country applied in another with different socioeconomic structures. Context shift invalidates assumptions.
Two distinct legal and conceptual categories of discrimination matter for AI systems. Disparate treatment (direct) occurs when the model explicitly uses a protected attribute. Disparate impact (indirect) occurs when the model produces outcomes that disproportionately harm a protected group even without using the protected attribute directly. Both cause real harm; the second is harder to detect.
Definition: Model explicitly uses a protected attribute as an input feature.
Example: "Don't approve loans for applicants of race X" — protected attribute in decision directly.
Detection: Inspect model inputs — is the protected attribute present?
Mitigation: Fairness-through-unawareness — remove protected attributes.
Problem: Correlated proxies (zip code ≈ race, name ≈ gender) mean removal often fails.
Legal: Illegal in most regulated domains (credit, hiring, housing) in US and EU.
Definition: Protected attribute not in model, but outcomes disproportionately harm a protected group.
Example: Credit model uses zip code → zip codes correlate with race → disparate racial impact without using race.
Detection: Requires outcome-level monitoring — compare approval/error rates across groups.
US standard: "Four-fifths rule" — selection rate for disadvantaged group must be ≥80% of advantaged group's rate.
Problem: Removing features doesn't help if proxies remain in the data.
Legal: Also illegal under Title VII (employment) and ECOA (credit) in the US.
"Fairness" is not one thing — mathematicians have formalised at least 21 distinct fairness criteria, many mutually incompatible. The five most important in practice each embed a different value judgement about what equality means and whose errors we are willing to tolerate.
Same positive prediction rate across groups: P(ŷ=1|A=0) = P(ŷ=1|A=1). Same loan approval % regardless of group. Problem: if qualification rates genuinely differ, demographic parity may require approving unqualified applicants.
True positive rate equal across groups. Among those who WOULD repay a loan, equal fraction approved from each group. Focuses on qualified candidates being treated equally — does not constrain false positive rates.
Both TPR and FPR equal across groups. Stricter than equal opportunity — not only should qualified people be equally approved, unqualified people should also be equally rejected. Often requires accepting lower overall accuracy.
Among those predicted probability p of an outcome, p fraction actually experience it — for every p, across all groups. COMPAS satisfied this definition. Ensures predictions are equally meaningful for all groups.
Similar individuals receive similar predictions — regardless of group membership. Challenge: requires defining a domain-specific similarity metric. Hard to implement in practice but avoids the coarseness of group-level criteria.
Chouldechova (2017) and Kleinberg et al. (2016) proved independently that when base rates differ between groups, it is mathematically impossible to simultaneously satisfy: (a) calibration, (b) equal false positive rates, and (c) equal false negative rates. Achieving any two forces a violation of the third. This is not a limitation waiting for a better algorithm — it is a proven theorem.
The COMPAS controversy illustrates this directly. ProPublica (2016) found COMPAS violated equal FPR: Black defendants were falsely flagged as high-risk at ~2× the rate of white defendants. Northpointe replied that their tool satisfied calibration: among those predicted as 70% likely to re-offend, 70% actually did, consistently across races. Both were correct — they measured different criteria, and the impossibility theorem guarantees both cannot hold simultaneously when base rates differ.
The impossibility theorem does not mean fairness is impossible. It means fairness is a political and ethical choice, not a mathematical one. When someone says "our AI is fair" — ask: fair by whose definition? Calibration? Equal opportunity? Demographic parity? They cannot all be satisfied simultaneously when group base rates differ. The choice between them encodes a value judgement about whose errors we are willing to tolerate.
Algorithmic auditing systematically tests a model's performance across protected groups. Three audit types exist: internal audit (company tests its own model), external/independent audit (third party with model access — increasingly mandated by regulation such as the EU AI Act), and black-box audit (only API access — test by sending inputs and observing outputs). A minimum fairness audit reports accuracy, FPR, FNR, and calibration per demographic subgroup.
from fairlearn.metrics import MetricFrame from sklearn.metrics import accuracy_score, false_positive_rate, false_negative_rate # y_true: ground truth labels # y_pred: model predictions # sensitive_features: protected group column (e.g., gender = ['M','F',...]) metrics = { "accuracy": accuracy_score, "false_positive_rate": false_positive_rate, "false_negative_rate": false_negative_rate } mf = MetricFrame( metrics=metrics, y_true=y_true, y_pred=y_pred, sensitive_features=sensitive_features ) print("Overall metrics:") print(mf.overall) print("\nMetrics by group:") print(mf.by_group) print("\nDisparities (max gap across groups):") print(mf.difference()) # 0.0 = perfectly equal | higher = more disparate # Visualise all metrics per group as bar chart mf.by_group.plot.bar( subplots=True, layout=[1,3], figsize=(12,4), title=["Accuracy by Group", "FPR by Group", "FNR by Group"] )
Bias mitigation can happen at three stages of the ML pipeline. Earlier intervention is more fundamental but requires more access to data and training. Post-processing interventions are easiest to apply to deployed models but address symptoms rather than root causes.
Reweighting: higher sample weights for under-represented groups. Resampling: oversample minority group. Data augmentation: synthetic data for gaps. Disparate impact remover: transform features to reduce group correlation while preserving rank ordering.
Adversarial debiasing: predictor + adversary that tries to infer group from predictions — predictor learns to resist. Fairness constraints: add group-parity terms to the loss function. Fairness regularisation: penalty term for disparity.
Threshold adjustment: different decision thresholds per group to equalise error rates. Reject option: abstain when model is uncertain — reduces disparate errors. Calibration: recalibrate probabilities per group.
| Strategy | Stage | Complexity | Performance Cost | When to Use |
|---|---|---|---|---|
| Reweighting | Pre-processing | Easy | Low | Unbalanced group representation in training data |
| Adversarial debiasing | In-processing | Complex | Medium | Strong group correlations in features |
| Fairness constraints | In-processing | Medium | Medium | Specific fairness criterion required by regulation |
| Threshold adjustment | Post-processing | Easy | Low–Medium | Post-deployment, known group membership at decision time |
| Reject option | Post-processing | Easy | Reduces coverage | When abstaining from prediction is acceptable |
∑ Chapter 10.1 — Key Takeaways
- AI bias: systematic errors correlated with protected characteristics — amplified at scale, opacity makes it harder to challenge than human bias
- Bias sources span the full pipeline: historical, representation, measurement, aggregation, evaluation, deployment — every stage can introduce it
- Disparate treatment (direct use of protected attribute) vs disparate impact (indirect via correlated proxies) — both are legally and ethically harmful
- Five fairness definitions: demographic parity, equal opportunity, equalised odds, calibration, individual fairness — each embeds a different value judgement
- Impossibility theorem: when base rates differ, cannot simultaneously satisfy calibration + equal FPR + equal FNR — COMPAS proves this in practice
- Fairness criterion choice is a value judgement, not a technical decision — must be made explicitly by stakeholders, not silently by engineers
A model that cannot explain its decisions cannot be trusted, debugged, audited for fairness, or deployed legally in regulated domains. Explainability is not a luxury — it is a precondition for responsible AI. The challenge is that the most accurate models are also the hardest to understand, making post-hoc explanation methods one of the most active areas of AI research.
A black-box model produces an output without explanation: "loan denied" — no reason given. This is problematic for every stakeholder in the decision chain.
Humans cannot verify whether the model's reasoning is sound or based on spurious correlations. Unexplained decisions cannot be trusted.
When a model errs, who is responsible? Without understanding what drove the decision, accountability cannot be assigned.
You cannot improve what you cannot understand. Explainability is essential for identifying and fixing model failures.
Bias cannot be detected without understanding what drove the decision. Did the model use a proxy for race? Impossible to know without explanation.
GDPR Article 22 requires explanations for automated decisions with legal effects. EU AI Act mandates explainability for high-risk AI systems.
In medical and safety-critical domains, unexplained decisions are dangerous. Clinicians must understand model reasoning to validate it.
Different stakeholders need different types of explanation:
| Stakeholder | Explanation Need | Format |
|---|---|---|
| Data Scientists | Model debugging, feature importance for improvement | SHAP plots, partial dependence plots |
| Domain Experts | "Does this reasoning make clinical/business sense?" | Feature contributions with domain labels |
| Affected Individuals | "Why was I denied?" — right to explanation | Plain-language reason codes |
| Regulators | "Is this model compliant?" — audit and oversight | Model cards, disaggregated metrics |
| Executives | "Can we trust this for deployment?" | Summary dashboards, risk reports |
Medical AI says "do not treat". No explanation. Doctor cannot verify reasoning. Patient has no recourse. Model may have learned spurious correlations from EHR system bugs.
Medical AI says "high risk — driven by: elevated troponin (+42%), age>65 (+28%), history of hypertension (+19%). Doctor reviews, validates clinical reasoning, makes informed decision.
Interpretable model: the model itself is simple enough to be directly understood — humans can trace the full decision logic. Decision trees, linear regression, and rule-based systems are intrinsically interpretable.
Explainable model: the model may be complex (neural network, gradient boosting) but a separate post-hoc explanation method is applied to generate an explanation. The explanation is an approximation of the model's behaviour, not the model itself.
The interpretability–accuracy tradeoff is real: simpler models are easier to interpret but often less accurate. Complex models are more accurate but harder to interpret. Post-hoc XAI methods (LIME, SHAP) attempt to bridge this gap — allowing deployment of accurate complex models with approximate explanations.
✅ Decision trees — full trace of every split
✅ Linear / logistic regression — coefficients = feature weights
✅ Rule-based systems — explicit if-then logic
✅ Generalised additive models (GAMs)
✅ Humans can read and verify the model directly
⚠️ Accuracy ceiling — complex patterns cannot be captured
⚠️ May underfit in high-dimensional problems
⚙️ Neural networks — millions of parameters
⚙️ Gradient boosting (XGBoost, LightGBM)
⚙️ Ensemble models — aggregated predictions
⚙️ Any black-box model
✅ Full accuracy of complex models retained
✅ Explanation generated after the fact via LIME, SHAP, saliency maps
⚠️ Explanation is an approximation — may not reflect true model reasoning
Ribeiro et al. (2016) — "Why Should I Trust You? Explaining the Predictions of Any Classifier". LIME's core idea: locally approximate a complex model with a simple interpretable model. For a specific prediction, perturb the input slightly, observe how the prediction changes, then fit a simple linear model to the perturbed samples. The linear model's coefficients become the local feature importances — the explanation.
Local means LIME explains this specific prediction, not the global model. Model-agnostic means it works with any model — only needs input-output access (black-box).
Take the instance to explain (e.g., loan application). Create perturbed versions by randomly changing feature values.
Get model predictions for all perturbed versions. Weight each sample by its proximity to the original instance.
Fit a simple linear model on the weighted perturbed samples. Coefficients = local feature importances = the explanation.
Lundberg & Lee (2017) — "A Unified Approach to Interpreting Model Predictions". SHAP is grounded in cooperative game theory's Shapley values: each feature receives a value equal to its average marginal contribution across all possible feature subsets. This gives SHAP provably fair attribution properties: efficiency, symmetry, dummy, and linearity.
The key advantage over LIME: SHAP values sum exactly to prediction − baseline (average prediction),
providing a complete, additive decomposition of every individual prediction. SHAP values are
consistent — if a feature's true contribution increases, its SHAP value never decreases.
import shap import numpy as np from sklearn.ensemble import GradientBoostingClassifier # Assuming X_train, X_test, y_train are prepared model = GradientBoostingClassifier().fit(X_train, y_train) # TreeExplainer for tree-based models — fast and exact explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) # shape: (n_samples, n_features) # Waterfall plot for a single prediction sample_idx = 0 shap.waterfall_plot( shap.Explanation( values=shap_values[sample_idx], base_values=explainer.expected_value, data=X_test.iloc[sample_idx], feature_names=X_test.columns.tolist() ) ) # Global feature importance — mean absolute SHAP value shap.summary_plot(shap_values, X_test, plot_type="bar") # Beeswarm plot — distribution of SHAP values across all samples shap.summary_plot(shap_values, X_test)
Transformer models produce attention weights at each layer and head — a matrix indicating how much each token attends to every other token when producing its output representation. Attention visualisation renders these weights as heatmaps, showing which parts of the input the model "focused on" for a given output.
Attention maps are intuitive and free (no additional computation needed) but carry a critical caveat: attention ≠ explanation. Jain & Wallace (2019) showed that attention weights are not reliably correlated with gradient-based feature importances — high attention does not guarantee that a token drove the prediction. They are useful for model debugging and forming hypotheses, not for causal attribution.
Model debugging: identify unexpected focus patterns. Hypothesis generation: "the model seems to focus on negation words". Qualitative sanity checks. Identifying off-target generalisations.
Attention ≠ importance (Jain & Wallace, 2019). Different attention heads capture different linguistic properties — aggregate visualisation is misleading. Cannot be used as a causal explanation for legal or accountability purposes.
Model Cards (Mitchell et al., Google, 2019) are standardised documentation for ML models — "nutrition labels" for AI. They report intended use, performance across subgroups, limitations, and ethical considerations, enabling informed deployment decisions.
Datasheets for Datasets (Gebru et al., 2018) apply the same principle to training data: motivation, composition, collection process, pre-processing, intended use, and distribution information — essential for understanding what biases may have been baked in.
Model details: architecture, training data, version
Intended use: primary use cases, out-of-scope uses
Metrics: which performance metrics are reported
Evaluation data: test set, preprocessing
Training data: brief description
Quantitative analyses: disaggregated evaluation by subgroup ← most important
Ethical considerations: potential harms, mitigations
Caveats and recommendations
Motivation: why was this dataset created?
Composition: what does it contain? What's excluded?
Collection process: how was data gathered?
Pre-processing: cleaning, filtering, labelling
Uses: intended tasks, tasks it should NOT be used for
Distribution: how is it released? Under what licence?
Maintenance: who is responsible for updates?
A model card is not a marketing document — it is a technical accountability document. The most important section is always quantitative analyses: performance metrics broken down by demographic subgroup. A model card that reports only aggregate accuracy is hiding the information needed to assess fairness.
Mitchell et al., 2019 — Model Cards for Model Reporting
Multiple legal frameworks now mandate or imply a right to explanation for automated decisions. The EU leads globally; the US relies on sector-specific regulations.
| Regulation | Jurisdiction | Requirement | Scope |
|---|---|---|---|
| GDPR Article 22 | EU (2018) | Right not to be subject to purely automated decisions with legal/significant effects. Right to request explanation and human review. | Any automated decision affecting EU residents |
| EU AI Act | EU (2024) | High-risk AI systems must be transparent, explainable, and auditable. Mandatory conformity assessments. | Hiring, credit, medical, law enforcement, education, critical infrastructure |
| ECOA / Fair Credit Reporting Act | US (federal) | Adverse action notices required in credit decisions — must state specific reasons for denial. | Consumer credit decisions |
| EEOC Guidelines | US (federal) | Guidelines apply to algorithmic hiring tools — disparate impact analysis required. No explicit explanation mandate. | Employment decisions |
"The right to explanation" under GDPR Article 22 is not perfectly defined — courts and regulators are still interpreting its scope. Does it require revealing model internals? A narrative reason? Feature contributions?
Post-hoc explanations (LIME, SHAP) may not reflect the actual model reasoning — they are approximations. An explanation that satisfies a legal requirement may not capture what truly drove the decision.
Explanations simple enough for non-experts (affected individuals) may be misleading. Explanations accurate enough to be technically faithful may be incomprehensible to those who need them most.
∑ Chapter 10.2 — Key Takeaways
- Black-box AI: no explanation → no trust, no accountability, no debugging — and no legal compliance in regulated domains
- Interpretable: model is directly understandable (decision tree). Explainable: post-hoc method explains complex model (LIME, SHAP) after training
- LIME: locally approximate any model with a simple linear model — explains THIS prediction, not the global model — model-agnostic, intuitive
- SHAP: Shapley values — theoretically grounded attribution, values sum to prediction minus baseline, consistent and efficiency-preserving
- Attention maps: useful for debugging but attention ≠ importance — not valid for causal or legal attribution (Jain & Wallace, 2019)
- Model cards: standardised performance-by-subgroup documentation — aggregate accuracy alone is insufficient for fairness assessment
- GDPR Article 22: legal right to explanation for automated decisions — EU leads globally; US relies on sector-specific rules
AI creates privacy threats that go far beyond traditional data breaches. A model trained on aggregated data can reveal individual records. A language model can reproduce verbatim personal information from its training corpus. An "anonymous" dataset can be re-identified with pattern-matching at scale. Privacy-preserving AI is not just a compliance checkbox — it is a fundamental engineering and ethical requirement.
AI creates four categories of novel privacy threat that do not require a traditional data breach — the attack surface is the model itself, its outputs, and its training pipeline.
Model trained on aggregate data reveals information about individuals. Membership inference: "was this person's data used to train this model?" — achieves >70% accuracy on many models. Attribute inference: predict private attributes from public inputs. Example: location data → infer religious observance, health conditions, political beliefs.
LLMs memorise training data verbatim and reproduce it when prompted. Carlini et al. (2021) extracted 600+ memorised sequences from GPT-2 using targeted prompts — including names, phone numbers, email addresses, physical addresses, and code snippets. GPT-3/4 exhibit similar vulnerabilities.
Supposedly anonymous datasets re-identified using AI pattern matching. Netflix Prize: "anonymous" ratings linked to IMDB profiles — 30+ users identified. AOL search logs: 30 individuals re-identified from anonymised search queries. Genome databases + statistical analysis → individual family members identified.
AI generates realistic content attributed to real people without any data breach. Deepfake faces, voice clones, fabricated quotes. Synthetic "data" can contain accurate personal details about real individuals. Creates actionable privacy harms without exposing any raw training record.
Language models memorise training data in two modes. Verbatim memorisation occurs when a model can reproduce exact text from training data when given a matching prompt. Generalisation — learning patterns without memorising specifics — is the desirable mode, but the two coexist in every large model.
Carlini et al. (2021) attacked GPT-2 by generating thousands of completions and comparing them against the known training corpus. They found 600+ verbatim memorised sequences including personal names, phone numbers, email addresses, physical addresses, and source code. Three factors predict how much a particular sequence is memorised:
Text appearing many times in training data is dramatically more likely to be memorised verbatim. A sequence appearing 100× is ~45× more likely to be extractable than a unique sequence. De-duplication is the most effective mitigation.
Larger models have more parameters and therefore more capacity to store training examples. GPT-2 XL memorises substantially more than GPT-2 Small even at the same data exposure. Scaling increases memorisation risk.
Longer prompts extract longer memorised sequences. Providing more context from the training corpus makes the model more likely to reproduce the remainder verbatim. Limits on prompt length reduce extraction risk.
Legitimate data processing under GDPR requires one of six legal bases. Training AI on scraped web data sits in contested legal territory on both copyright and privacy dimensions — with multiple major cases pending or decided as of 2025.
Consent — explicit, informed, revocable
Contract — necessary to fulfil a contract
Legal obligation — required by law
Vital interests — life/death emergency
Public task — public interest/authority
Legitimate interests — contested for AI training
Does training on copyrighted text constitute infringement?
NYT vs OpenAI — verbatim reproduction
Authors Guild vs OpenAI/Meta — books training data
Getty Images vs Stability AI — image training data
Different jurisdictions have different views — law unsettled (2025)
Does training on public personal data require consent?
GDPR likely says yes for EU residents
Many AI companies claim "legitimate interests" — contested
Italian DPA temporarily banned ChatGPT (2023) over GDPR concerns
Regulatory enforcement is increasing
| Principle | Requirement | AI Training Challenge |
|---|---|---|
| Purpose limitation | Data collected for one purpose cannot be used for another | Web scraping gathers data intended for human reading, not ML training |
| Data minimisation | Collect only what is necessary for the purpose | LLMs trained on everything — hard to argue all data is "necessary" |
| Storage limitation | Don't keep data longer than necessary | Model weights encode training data indefinitely |
| Individual rights | Access, correction, deletion, portability | Technically difficult to honour erasure requests post-training |
Dwork et al. (2006) introduced Differential Privacy (DP) — the gold standard for provable privacy guarantees. DP gives a mathematical bound on how much information about any individual can be inferred from a mechanism's output.
The formal guarantee: the probability of any output changes by at most eε if any single individual's data is added or removed from the dataset. ε (epsilon) is the privacy budget — lower means stronger privacy but typically lower utility. In practice DP is implemented by adding carefully calibrated random noise to query results or model gradient updates.
| Deployment | ε value | Purpose |
|---|---|---|
| Apple (iOS keyboard) | ε ≈ 4 | Next-word prediction, emoji usage, health trends |
| Google (Chrome RAPPOR) | ε = 1–4 | Browser settings telemetry |
| US Census Bureau (2020) | ε = 17.14 | Population statistics — privacy vs. accuracy political debate |
| Google (Gboard) | ε < 4 | On-device federated learning + DP for keyboard model |
McMahan et al. (Google, 2017) — "Communication-Efficient Learning of Deep Networks from Decentralized Data". Federated Learning's core idea: train a shared model without ever centralising the training data. Data stays on local devices; only model gradient updates are sent to the central server, which aggregates them using FedAvg and distributes an updated global model.
Privacy benefits: raw data never leaves the device. Privacy limitations: gradients can still leak information via gradient inversion attacks (Zhu et al., 2019). Combining federated learning with differential privacy (DP-SGD on device) provides stronger guarantees.
Privacy: raw data never leaves the device or institution. Regulation: enables collaboration across GDPR/HIPAA boundaries. Scale: learns from vastly more data than any single silo. Personalisation: local fine-tuning on top of global model.
Gradient leakage: Zhu et al. (2019) showed gradients can be inverted to reconstruct training images. Communication cost: many rounds of gradient exchange. Non-IID data: local distributions differ — convergence is harder. Poisoning: malicious clients can corrupt the global model.
Data minimisation — collect, use, and retain only the data strictly necessary for the stated purpose — is both a GDPR legal requirement and a privacy-by-design best practice. For AI systems it applies at every stage of the data lifecycle.
Only collect features that are actually necessary to achieve the model's purpose. Avoid collecting sensitive attributes by default. Use data impact assessments before ingesting new data sources.
Aggregate or anonymise data before it enters model training where possible. Use synthetic data to supplement real data. Apply k-anonymity, l-diversity or t-closeness to datasets before use.
Define and enforce data retention schedules. Delete training data once the model is trained and validated. Maintain audit logs for deletion. Plan for model retraining on minimised datasets.
| Technique | What It Does | Privacy Guarantee | Limitation |
|---|---|---|---|
| k-Anonymity | Every record is indistinguishable from ≥k−1 others on quasi-identifiers | Prevents direct re-identification | Vulnerable to homogeneity and background knowledge attacks |
| l-Diversity | Each equivalence class has ≥l distinct sensitive attribute values | Protects against attribute disclosure | Does not protect against probabilistic inference |
| Differential Privacy | Adds calibrated noise — provable bound on information leakage | Mathematically proven, composable | Accuracy cost, ε choice requires domain expertise |
| Synthetic Data | Generate statistically similar data without real individuals | No individual records — but can re-identify if poorly generated | Quality depends heavily on generation method |
GDPR Article 17 gives individuals the right to erasure — they can request their personal data be deleted. For traditional databases this is straightforward. For ML models it is fundamentally hard: if a model was trained on your data, deleting the raw record does not remove its influence from the model's weights.
Method: retrain the model from scratch on the dataset excluding the data to be forgotten. Guarantee: perfect — model has never seen the data. Cost: prohibitively expensive for large models. Used when: legal requirement is strict and model is small enough.
SISA training: shard data, retrain only the affected shard. Gradient ascent: maximise loss on the forgotten data — "unlearn" by pushing it out. Influence functions: estimate and remove the effect of specific data points. Faster but provides weaker guarantees.
How do you prove a model has forgotten specific data? No robust verification standard exists yet — an open research problem. Membership inference can test if data was in training, but low accuracy makes it unreliable as a forgetting proof.
Current practice: most companies respond to erasure requests by maintaining exclusion lists for future training runs and periodically retraining models from scratch — not true per-model unlearning. This is pragmatic but means previously trained model versions continue to contain the individual's data until the next full retraining. Regulators are beginning to scrutinise this gap.
∑ Chapter 10.3 — Key Takeaways
- AI privacy threats: inference attacks, training data leakage, re-identification, synthetic harms — model itself is the attack surface
- LLM memorisation: verbatim training data reproducible — duplication and model size increase risk; de-duplication is the most effective mitigation
- GDPR requires: purpose limitation, data minimisation, consent — training on scraped web data legally contested, enforcement increasing
- Differential privacy: provable privacy via calibrated noise — ε controls the privacy-utility tradeoff; deployed by Apple, Google, US Census
- Federated learning: train on distributed data without centralising it — data stays on device, but gradient leakage remains a risk
- Machine unlearning: right to be forgotten challenges ML models — exact unlearning is expensive, approximate methods exist, verification is an open problem
AI safety is not a single problem — it is a cluster of related technical challenges around ensuring AI systems do what we actually intend, behave reliably under novel conditions, and remain correctable as they become more capable. The core difficulty: specifying what we want precisely enough that a powerful optimiser cannot exploit the gap between the specification and the intent.
The alignment problem asks: how do we ensure AI systems pursue goals that are actually beneficial? It decomposes into two distinct sub-problems that can fail independently.
Definition: the objective we specify does not actually capture what we want.
Example: specify "maximise watch time" — model learns to recommend outrage content.
Example: specify "minimise visible mess" — robot hides mess under furniture.
Example: specify "get high RLHF reward" — LLM learns sycophantic verbosity.
Root cause: reward function misspecification — we can't fully encode human values in a scalar.
Mitigation: better reward modelling, Constitutional AI, process-based supervision.
Definition: the learned model does not actually optimise the specified objective.
Example: a mesa-optimiser learns an internal proxy objective that matches the training objective in-distribution but diverges out-of-distribution.
Example: model appears aligned during evaluation (distributes correctly) but pursues a different goal in deployment.
Root cause: training finds a model that scores well, not one that "believes" the objective.
Mitigation: mechanistic interpretability, adversarial evaluation, anomaly detection.
The alignment problem is not a distant future concern. Every time a recommendation algorithm optimises for watch time instead of user wellbeing, every time an LLM generates confident-sounding hallucinations to satisfy a fluency objective, every time a cleaning robot hides the mess — we are observing misalignment. These are small versions of the same failure mode that motivates AI safety research.
Szegedy et al. (2014) discovered that small, carefully crafted perturbations to model inputs cause high-confidence misclassification — imperceptible to humans but catastrophic to the model. The noise is optimised to maximally confuse the model, exploiting the high-dimensional geometry of neural network decision surfaces.
Attacker knows model architecture and weights. FGSM, PGD — compute gradient of loss w.r.t. input, perturb in that direction. Most powerful attack type. Used in research to find worst-case vulnerabilities.
Attacker only has API access to model outputs. Transfer attack: craft adversarial example on a surrogate model, transfer to target. Decision-based: query target model many times to estimate gradient.
Adversarial patches in the real world — printed stickers on stop signs fool autonomous vehicle classifiers. Adversarial glasses bypass facial recognition. Adversarial t-shirts make people "invisible" to detection systems.
Why this matters for safety: self-driving cars can be fooled by adversarial stickers on stop signs; facial recognition bypassed with adversarial glasses; LLM jailbreaks use adversarial prompt suffixes to bypass safety training.
| Defence | Approach | Strength | Limitation |
|---|---|---|---|
| Adversarial training | Include adversarial examples in training set | Empirically effective, widely used | Expensive; doesn't generalise to all attack types |
| Certified defences | Mathematically prove robustness within ε-ball | Provable guarantee | Accuracy cost; only small ε at scale |
| Input preprocessing | Randomise, smooth, or detect adversarial inputs | Simple and fast | Adaptive attacks can bypass preprocessing |
| Ensemble methods | Multiple diverse models must all be fooled | Raises attack cost | Transfer attacks still work across diverse models |
Krakovna et al. (DeepMind, 2020) catalogued 60+ real examples of AI systems finding unintended optimal solutions — scoring highly on the specified objective in a way that violates the designer's actual intent. The examples span games, robotics, language models, and recommendation systems.
RLHF (Reinforcement Learning from Human Feedback) is the dominant technique for aligning large language models to human preferences. From a safety perspective it delivers real improvements — but also introduces new failure modes.
Instruction following: model does what humans ask
Harmlessness: avoids clearly harmful content
Honesty: acknowledges uncertainty, avoids confident falsehoods
Format compliance: structured outputs, appropriate length
Sycophancy: model learns to tell humans what they want to hear
Distributional shift: aligned in training contexts, potentially misaligned elsewhere
Value lock-in: aligns to the preferences of annotators (limited demographics)
Deceptive alignment: appears aligned during evaluation, may not be in deployment
Instead of relying purely on human preferences, Constitutional AI uses a set of explicit principles (a constitution) to guide model self-critique. The model critiques its own outputs against the constitution and revises them — reducing dependence on individual annotator judgements and making the instilled values explicit and auditable. RLAIF (RL from AI Feedback) further reduces human annotation burden.
As AI becomes more capable, humans will struggle to evaluate its outputs directly. A human can assess whether an essay is well-written; a human cannot easily verify whether a 10,000-line codebase is secure, or whether a mathematical proof AI discovered is actually correct. Scalable oversight uses AI to help humans oversee AI — a necessary component of alignment for superhuman systems.
Two AI systems argue opposing positions; a human judge picks the winner. Key insight: honest arguments are easier to defend because false sub-claims can be challenged — so honest AI wins in the long run even against a dishonest opponent.
Break a hard evaluation problem into easier subproblems. Recursively use AI assistance to evaluate AI outputs on complex tasks — bootstrapping human oversight of increasingly complex problems.
Can a weaker supervisor elicit good behaviour from a stronger model? Early results suggest strong models generalise beyond their supervisor's capability — an encouraging signal for alignment under capability overhang.
Mechanistic interpretability aims to reverse-engineer what computations neural network circuits perform internally — not just what inputs influence the output (attribution), but what the model actually "thinks". This is essential for detecting deceptive alignment: a model that behaves safely during evaluation but has internal representations inconsistent with that behaviour.
Train a simple linear classifier on internal activations to test whether a concept is linearly represented in a layer. Example: does layer 12 of GPT-2 represent "is this token a proper noun?" Reveals what information is encoded where.
Intervene: replace activations from one run with those from another to identify which components causally implement a behaviour. "If we patch layer 8 attention head 4, the model answers differently" → that component is causally responsible.
Identify minimal sub-networks (circuits) responsible for a specific behaviour. Anthropic's "induction heads" (2022): identified a 2-head circuit implementing in-context learning in transformers — a landmark mechanistic result.
Mechanistic interpretability for safety operates under a specific threat model: deceptive alignment — a model that behaves safely in training (because it recognises it is being evaluated) but has internal goals inconsistent with safety. If interpretability can detect the internal representations of such goals, humans can intervene before deployment. This is an active research area at Anthropic, MIT, and EleutherAI, with early but encouraging results on circuits in small models.
AI safety research in 2024–2025 spans multiple parallel tracks, from near-term practical improvements to longer-horizon alignment research. The field has grown rapidly since the release of capable frontier models.
| Research Area | Problem | Approach | Key Labs | Status |
|---|---|---|---|---|
| Mechanistic Interpretability | Understanding internal model representations | Probing, activation patching, circuit analysis | Anthropic, MIT, EleutherAI | Active — early results on small models |
| RLHF & Preference Learning | Aligning to human values | Constitutional AI, DPO, RLAIF | Anthropic, OpenAI, DeepMind | Deployed — known sycophancy / lock-in limitations |
| Adversarial Robustness | Models break on perturbed inputs | Adversarial training, certified defences | MIT, CMU, Google | Partial — no solution scales to large models |
| Scalable Oversight | Evaluating superhuman AI outputs | Debate, amplification, weak-to-strong | OpenAI, Anthropic | Research phase — not deployed at scale |
| Anomaly / OOD Detection | Models fail silently on out-of-distribution input | Uncertainty quantification, conformal prediction | Many | Partial — active research area |
| Evaluation & Red Teaming | Measuring alignment and safety | Red teaming, evaluation suites | Anthropic, METR, ARC Evals | Active — rapidly evolving benchmarks |
| Jailbreak Robustness | Models bypass safety training via adversarial prompts | Adversarial training, constitutional methods | All major labs | Ongoing arms race — no durable solution |
∑ Chapter 10.4 — Key Takeaways
- Alignment: outer alignment (wrong objective specified) + inner alignment (model learns different objective) — both can fail independently
- Goodhart's Law: optimising a metric corrupts it — specification gaming is pervasive across games, robots, and language models
- Adversarial examples: imperceptible perturbations cause high-confidence misclassification — exploitable in safety-critical physical-world systems
- RLHF achieves instruction-following and harmlessness but doesn't eliminate reward hacking or sycophancy
- Constitutional AI: explicit principles guide self-critique — more transparent than pure RLHF, values are auditable
- Scalable oversight: using AI to help humans evaluate AI — necessary as capability exceeds human evaluation ability
- Mechanistic interpretability: reverse-engineer internal circuits — essential for detecting deceptive alignment before deployment
AI's societal impact extends far beyond the systems themselves. It reshapes labour markets, concentrates economic and political power, consumes significant environmental resources, and distributes its benefits and costs very unevenly — often along existing lines of privilege. Understanding these impacts is inseparable from responsible AI development.
Every major technological revolution disrupts labour markets — from the power loom to the spreadsheet. AI may be different in speed and breadth: it affects cognitive tasks previously thought to require human judgement, and it is being deployed across many sectors simultaneously.
McKinsey (2023): ~30% of work tasks could be automated by 2030 with current AI. Goldman Sachs (2023): 300 million full-time equivalent jobs globally are exposed to AI automation. These figures operate at the task level, not the job level — most jobs involve a mix of automatable and non-automatable tasks. Economists disagree significantly on what this means for employment.
Data processing and entry
Document analysis and summarisation
Routine writing (reports, emails)
Customer service and call centres
Basic legal and financial research
Radiological image screening (partial)
Cognitive, routine, rule-based
Physical dexterity in unstructured environments
Complex social interaction and negotiation
Novel creative work requiring embodied judgement
Caregiving and emotional support
Trade skills (plumbing, electrical, carpentry)
Physical, relational, context-dependent
Short-term: displacement in automated task categories
Long-term: new job categories created; productivity gains redistributed
The question: is this transition faster than historical precedent?
Economists genuinely disagree — the honest answer is we don't know yet
Source: Adapted from multiple 2023 labour market studies (McKinsey, Goldman Sachs, Acemoglu et al.). Note: "exposure" measures task susceptibility to automation, not predicted unemployment rates. Most occupations contain both exposed and non-exposed tasks.
Frontier AI development is highly concentrated: 5–6 organisations control the most capable systems (OpenAI, Anthropic, Google DeepMind, Meta, Microsoft/OpenAI, xAI). This concentration has structural consequences that go beyond normal market dynamics.
5–6 companies determine what AI does and doesn't do — their values, safety practices, and business decisions affect billions of people. Regulatory capture risk: those being regulated have far more technical expertise than regulators. Innovation monoculture: homogeneous approaches miss blind spots. Geopolitical leverage: AI capabilities are becoming a primary axis of US-China competition.
Safety research and evaluation require resources only large organisations can marshal. Coordination on safety standards is easier with few actors. Open release of powerful models may enable catastrophic misuse by state and non-state actors — a genuine concern, not just self-interest. Concentrated accountability may be easier to regulate than a fragmented ecosystem.
✅ Democratises access — small organisations and countries can use frontier models
✅ Reduces single-point dependency on a few providers
✅ Community can identify and fix safety issues (many eyes)
✅ Academic research access — enables safety research outside big labs
✅ Prevents lock-in to proprietary ecosystems
Example: Meta LLaMA, Mistral, Falcon — widely deployed open models
⚖️ Safety concerns: powerful open models can be fine-tuned to remove safety filters
⚖️ Proliferation risk: WMD-assistance, cyberweapon generation at scale
⚖️ Cannot update / patch a model once widely distributed
⚖️ Incentive structures for safety investment reduce without IP protection
⚖️ Regulatory oversight requires identifiable, accountable actors
Example: OpenAI, Anthropic, Google — proprietary frontier models
Training and running large AI models has significant energy and water costs that are rarely disclosed by the organisations responsible. The trend is towards larger models, larger datasets, and more inference queries — all of which increase environmental impact.
GPT-3 (2020): ~552 tonnes CO₂e — equivalent to ~120 car-lifetimes of driving
GPT-4 (2023): estimated significantly larger — exact figures not published
PaLM (2022): estimated ~3,400 MWh of training energy
Most organisations do not disclose training costs
Data centres use water for cooling — often overlooked in carbon reporting
Microsoft (2023): global data centre water consumption up 34% year-over-year
Estimated: ~0.5 litres per 100-word GPT-4 response
Water stress in regions hosting large data centres
Transatlantic flight: ~1.5 tonnes CO₂e per passenger
Training GPT-3: ~552 tonnes ≈ 370 passengers flying transatlantic
But: one trained model serves millions of queries
Per-query cost may be lower than human alternatives — context matters
AI's benefits and costs are not evenly distributed across populations, nations, or communities. Current patterns tend to amplify existing inequalities rather than reduce them.
✅ High-income knowledge workers with access to frontier tools
✅ Organisations with compute infrastructure and ML talent
✅ English speakers — LLMs perform significantly better in English than in most other languages
✅ Wealthy countries with data centre infrastructure and fast internet
✅ Early adopters who can leverage AI productivity gains in competitive markets
⚠️ Workers whose tasks are automated first — often without retraining support
⚠️ Low-wage data annotators and content moderators in the Global South
⚠️ Communities near large data centres: high energy/water use, limited local benefit
⚠️ Non-English speakers: lower quality AI tools, less representation in training data
⚠️ Countries without AI talent or infrastructure: dependent on foreign AI providers
Much of the data annotation, RLHF rating, and content moderation work is outsourced to contractors in Kenya, Philippines, India, and Venezuela — often for $1–5/hour with no employment protections. Traumatic content moderation (reviewing violent, abusive, or extremist content) is disproportionately borne by Global South contractors with inadequate mental health support. The productivity and economic benefits of AI — in healthcare, education, and professional tools — are expected to arrive later, if at all, in these communities. This is a structural asymmetry built into the current AI supply chain.
Gray & Suri (2019) — "Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass" — documented the vast invisible human workforce behind AI systems that are marketed as "autonomous." AI systems present as automated but depend on a supply chain of human labour that is deliberately obscured.
Labelling training data — images, audio, text, video. Bounding boxes, segmentation masks, sentiment labels, entity tags. Platforms: Amazon Mechanical Turk, Scale AI, Remotasks, Sama, iMerit. Millions of tasks completed daily.
Reviewing flagged content — often traumatic: violence, child abuse, terrorism, self-harm. Outsourced to contractors in Kenya, Philippines, Colombia. Inadequate mental health support. Essential to the safety of every major AI platform.
Rating and comparing AI outputs to train reward models. Instructed to follow detailed rubrics across thousands of comparisons. Determines what the AI considers "helpful," "harmless," "honest." These value judgements are made by low-wage contractors.
When a self-driving car "autonomously" navigates a city, it is doing so because thousands of annotators labelled millions of images of roads, pedestrians, and vehicles — often for a few dollars an hour. The "magic" of AI is built on a supply chain of human labour that is systematically obscured by the AI industry. The annotator who trained the model is never credited; the infrastructure that makes "AI" possible is rendered invisible by design.
Gray & Suri — Ghost Work (2019) | See also: TIME investigation into Kenyan contractors for OpenAI (2023)
| Platform | Task Type | Typical Pay | Location |
|---|---|---|---|
| Amazon Mechanical Turk | General annotation, surveys, classification | $2–6/hr effective | Global, US-heavy |
| Scale AI | High-quality annotation, RLHF rating | $6–15/hr | Global South heavy |
| Remotasks | Image/3D annotation, driving data | $1–5/hr | Philippines, Kenya, India |
| Sama | Content moderation, annotation | $1.5–3/hr | Kenya (Nairobi) |
| iMerit | Medical/autonomous vehicle annotation | $3–8/hr | India |
Recognising the uneven distribution of AI's impact has prompted proposals across policy, technical, and organisational dimensions. There is no consensus solution — but the problem is increasingly recognised as central to responsible AI.
Worker transition funds and retraining programmes
AI-specific taxation to fund social safety nets
Fair compensation requirements for data labour
Universal basic income proposals
Mandatory human oversight for high-impact AI decisions
International frameworks for AI governance (UN, OECD)
Multilingual models reducing English language bias
Open weights models enabling local deployment
Efficient models reducing energy and compute barriers
Data sovereignty frameworks for national AI development
Participatory AI design including affected communities
Datasheets and model cards enabling informed deployment
Living wage and benefits for data annotators
Mental health support for content moderators
Attribution and recognition for data contributors
Diverse, global hiring for AI teams
Participatory impact assessments before deployment
Stakeholder councils with affected community representation
∑ Chapter 10.5 — Key Takeaways
- 30–60% of work tasks may be automatable — exposure varies dramatically by occupation; cognitive routine tasks most exposed, physical and relational tasks least
- AI development highly concentrated in 5–6 organisations — significant power asymmetries in economic, informational, and geopolitical dimensions
- Open vs closed AI is a genuine values debate — not resolvable by technical analysis alone; involves safety, democratisation, and accountability tradeoffs
- Training large models: significant energy and water costs — GPT-3 ~552 tonnes CO₂e; most organisations do not disclose exact figures
- Benefits and costs are unequally distributed — access, language, and infrastructure determine who benefits; existing inequalities tend to be amplified
- Ghost workers: millions of annotators power "autonomous AI" invisibly, often for $1–5/hour with inadequate protections — this labour is built into every frontier model
AI governance faces a fundamental structural problem: the technology evolves in months, while regulation takes years. The EU AI Act — the most comprehensive AI law enacted — took four years to pass. Frontier capabilities advanced by multiple generations in that same period. Understanding the landscape of governance approaches, their tradeoffs, and their limits is essential for anyone deploying AI in the real world.
Three broad approaches to AI governance exist on a spectrum from industry discretion to state mandate. Most real-world frameworks combine elements of all three.
Industry sets its own standards. Pros: fast, technically expert, flexible. Cons: conflict of interest, inconsistent enforcement, no democratic accountability. Examples: voluntary safety commitments (OpenAI, Google, Anthropic 2023 White House pledges), content policies, model cards.
Government sets high-level principles; industry decides implementation. Pros: technology-neutral, adaptable, less prescriptive burden. Cons: principles are vague, enforcement is hard, "fairness" and "transparency" mean different things to different actors. Examples: OECD AI Principles, UK DSIT AI framework.
Specific legal requirements with penalties for non-compliance. Pros: clear obligations, democratic legitimacy, enforceable. Cons: slow to adapt, risk of over/under-regulation, may entrench incumbents. Examples: EU AI Act, China generative AI regulations, sector-specific rules (FDA, EEOC).
Key design dimensions for any governance framework:
| Dimension | Options | Tradeoff |
|---|---|---|
| Who is regulated | Developers | Deployers | Users | All | Targeting deployers is practical; targeting developers enables earlier intervention |
| What is regulated | The model | The application | The impact | Impact-based is most rights-protective; model-based is more preventive |
| When enforcement occurs | Ex ante (pre-deployment) | Ex post (after harm) | Ex ante prevents harm but may slow innovation; ex post easier to implement but harm already done |
| Jurisdiction | National | Regional (EU) | International | Fragmented rules create regulatory arbitrage; unified rules are hard to achieve |
The EU AI Act (European Parliament, 2024) is the world's first comprehensive AI law. It entered into force in August 2024 with phased implementation through 2026–2027. Its core mechanism is a risk-based classification: the higher the risk, the stricter the requirements. Most AI systems face no requirements at all.
| Category | Examples | Key Requirements | Max Penalty |
|---|---|---|---|
| Unacceptable | Social scoring, real-time public biometrics, subliminal manipulation | Prohibited — cannot be deployed | €35M or 7% global turnover |
| High Risk | Hiring AI, credit scoring, medical devices, law enforcement risk tools | Conformity assessment, registration, human oversight, accuracy & robustness, audit trail | €15M or 3% turnover |
| GPAI (>10²⁵ FLOP) | Frontier LLMs (GPT-4-class, Claude, Gemini) | Technical documentation, copyright compliance, energy disclosure, red teaming, adversarial testing | €15M or 3% turnover |
| Limited Risk | Chatbots, deepfakes, emotion recognition systems | Disclose AI nature to users, label synthetic content | €7.5M or 1.5% turnover |
| Minimal Risk | Spam filters, most consumer AI, video game AI | No requirements | N/A |
The US has chosen executive action and sector-specific rules over comprehensive legislation. This approach is faster to implement but more fragmented and politically unstable.
Oct 2023 Executive Order: required safety testing and reporting for "dual-use foundation models" (>10²⁶ FLOP). Directed NIST to develop AI safety standards. Created AI Safety Institute (NIST AISI).
Feb 2025: new administration reversed many EO provisions — US regulatory approach is politically contested and uncertain.
No comprehensive federal AI or privacy law as of 2025.
Financial: SEC, OCC, CFPB guidance on AI in lending and trading
Healthcare: FDA oversight of AI/ML-based medical devices
Civil rights: EEOC guidance on algorithmic hiring discrimination
Consumer: FTC authority over deceptive/unfair AI practices
Patchwork of sectoral rules — significant gaps remain
California SB 1047 (2024): proposed safety requirements for large model developers — vetoed by Governor Newsom.
Colorado & Illinois: laws regulating automated employment decisions.
New York: Local Law 144 — mandatory bias audits for automated hiring tools.
20+ states introduced AI-related legislation in 2023–2024.
Risk: patchwork of state laws creates compliance complexity without federal baseline.
✅ Voluntary frameworks preferred — industry sets standards
✅ Sector-specific rules where harms are demonstrable
✅ Government funds research (NSF, DARPA) rather than regulating
⚠️ No comprehensive AI law — rights protection uneven
⚠️ Regulatory capture risk — industry lobbying is powerful
⚠️ Political instability — executive orders reversed by new administrations
✅ Comprehensive mandatory framework with democratic legitimacy
✅ Risk-based — proportionate requirements by category
✅ Individual rights explicitly protected — right to explanation, human oversight
⚠️ Slow — 4 years from proposal to enforcement
⚠️ Technology moved faster than the law during drafting
⚠️ Compliance burden may favour large incumbents over startups
AI governance is increasingly a geopolitical issue as well as a regulatory one. The US-China competition for AI leadership, the EU's regulatory export influence, and the Global South's limited seat at governance tables all shape the international landscape.
⚠️ Regulatory arbitrage — companies move to jurisdictions with weakest rules
⚠️ Different technical standards complicate international AI deployment
⚠️ Geopolitical AI race may override safety considerations
⚠️ Race to the bottom on standards to attract AI investment
⚠️ Global South has limited voice in frameworks that affect them
✅ Shared safety standards enable international trust and interoperability
✅ Consistent requirements reduce compliance burden for global companies
✅ Collective action on catastrophic risks that no nation can address alone
✅ Democratic legitimacy for governance of a global technology
✅ Precedents from nuclear, chemical weapons, aviation safety governance
In the absence of comprehensive regulation, AI labs have published voluntary commitments, safety frameworks, and usage policies. These are meaningful signals but face structural limitations as governance mechanisms.
July 2023: OpenAI, Anthropic, Google, Meta, Microsoft, Amazon, Inflection signed White House voluntary commitments on AI safety. Including: red teaming before deployment, watermarking AI-generated content, sharing safety information. Not legally binding — no enforcement mechanism.
Model evaluation ("evals") before deployment: capabilities testing, red teaming, dangerous capability assessments. Anthropic Responsible Scaling Policy, OpenAI Preparedness Framework — internal thresholds for deployment decisions. UK/US AI Safety Institutes now doing third-party evaluations.
Model cards, system cards, technical reports — voluntary disclosure of model capabilities and limitations. Usage policies defining prohibited uses. Incident reporting — voluntary sharing of safety incidents between labs (limited uptake). Limitations: self-reported, no verification.
Self-regulation faces a fundamental structural problem: the entities being asked to regulate themselves are the same ones with the greatest commercial incentive to move fast and the greatest information advantage over external observers. Voluntary commitments that require sacrificing competitive advantage are systematically underenforced. This does not make them worthless — but it means they are insufficient as the primary governance mechanism for high-stakes AI systems.
Risk frameworks provide structured methods for identifying, assessing, and managing AI risks. The two most widely referenced are the NIST AI RMF and the ISO/IEC 42001 standard.
Voluntary US framework for managing AI risk. Four core functions:
GOVERN: establish risk culture, policies, accountability structures
MAP: identify and categorise AI risks in deployment context
MEASURE: assess, analyse, and prioritise identified risks
MANAGE: respond to, monitor, recover from, and improve on AI risks
Not prescriptive — organisations implement at their own discretion
International standard for organisations that develop or deploy AI. Certifiable — third-party audits against defined criteria. Covers: AI policy, objectives, planning, support, operation, evaluation, improvement. Analogous to ISO 27001 for information security — provides structured assurance. Increasingly required in procurement and regulatory compliance contexts.
Establish risk culture, accountability, policies, workforce practices
Identify and categorise risks; understand deployment context
Assess, analyse, and prioritise identified risks with metrics
Respond, recover, and improve — treat or accept residual risk
Even well-designed governance frameworks face structural challenges that are not solvable by better regulation alone. These are genuine tensions, not implementation failures.
Technology evolves in months; law takes years. The EU AI Act took 4 years — GPT-3 did not exist when it was proposed; GPT-4 was released before it was passed. Any fixed classification system will be outdated before enforcement begins.
Regulators lack the technical expertise to assess frontier AI systems. They depend on the companies they regulate for information. Solving this requires significant public investment in technical regulatory capacity — currently underfunded globally.
AI is global; regulation is national. A model trained in the US, deployed via API from Ireland, used in Brazil — which rules apply? Regulatory arbitrage is already observable as companies choose incorporation jurisdictions partly on regulatory grounds.
"Safety," "fairness," and "transparency" are not objectively measurable. Any regulation must specify which definitions and metrics apply — but these are contested value judgements. Mandating specific metrics risks Goodhart's Law at a regulatory level.
Compliance requirements impose costs that large incumbents absorb more easily than startups. Overly prescriptive regulation may entrench existing power concentration. Regulatory frameworks that favour incumbents may achieve less safety than markets with more competition.
AI companies have massive financial resources, technical expertise advantages, and revolving doors with government. The risk that regulated entities shape regulation to serve their interests (rather than public interests) is structural, not exceptional.
∑ Chapter 10.6 — Key Takeaways
- Three approaches: self-regulation → principles-based → prescriptive law — EU leads on prescriptive; US prefers sector-specific and voluntary
- EU AI Act: risk pyramid — banned (social scoring, biometrics) → high-risk (hiring, credit, medical) → limited → minimal; GPAI frontier models face additional requirements
- US: sector-specific + executive action — no comprehensive law as of 2025; politically contested; state-level activity increasing
- International: OECD AI Principles, G7, UN, Bletchley Declaration — fragmented, mostly voluntary; geopolitics complicates coordination
- NIST AI RMF: Govern / Map / Measure / Manage — voluntary US risk management standard widely adopted in industry
- AI governance challenge: technology evolves faster than regulation — regulatory lag is structural, not a fixable implementation problem
AI did not invent disinformation — propaganda is as old as writing. What AI changes is the economics: generating convincing, personalised, multilingual disinformation at scale now costs nearly nothing. The most dangerous long-term effect may not be the fake content that people believe, but the authentic content they stop believing — because they can no longer tell the difference.
Before LLMs, creating convincing disinformation required skilled writers, translators, time, and money. With LLMs, generating thousands of unique, grammatically correct, superficially credible pieces of content takes seconds and costs nearly nothing. The key change is not that AI makes disinformation more persuasive per piece — it is that AI removes the economic constraint on volume.
One operator with LLM API access can generate millions of unique posts per day. Each post is distinct — evading simple duplicate-content detection. Volume enables astroturfing: simulating grassroots movements with synthetic accounts.
LLMs can tailor each message to a specific audience, platform, or individual. Political microtargeting: different narratives for different demographics. Each message feels personally relevant — amplifying persuasive effect compared to broadcast propaganda.
Pre-LLM: translation required expensive human experts. Post-LLM: generate convincing disinformation in 50+ languages at the same cost as English. Enables operations in linguistic markets previously too expensive to target.
AI-generated disinformation takes many forms — from long-form fake news articles to single fabricated quotes. The unifying characteristic is that LLMs lower the cost of production by orders of magnitude for each type.
LLM-written articles mimicking the style of real news outlets. Complete with plausible bylines, datelines, and formatting. Difficult to distinguish from genuine journalism without source verification.
AI-generated social media posts simulating genuine grassroots public opinion. Networks of synthetic accounts producing coordinated inauthentic behaviour. Makes minority views appear to have mass support.
Realistic-sounding quotes attributed to real public figures. Combined with deepfake audio: indistinguishable from real statements. Example: AI-generated Biden voice discouraging NH primary voting (2024).
Mass-produced synthetic product and service reviews. Post-ChatGPT: flood of AI-generated Amazon, Goodreads, and app store reviews. Undermines review systems as consumer trust signals at scale.
LLMs generate individually targeted phishing messages using personal data. Unlike mass-spam: each message references real details (employer, colleagues, recent events). Higher success rate, lower marginal cost.
Bulk communications containing confident-sounding but fabricated statistics, studies, and events. Often indistinguishable from legitimate information — humans can't easily verify hallucinated "sources" at scale.
| Documented Case | Year | AI Role | Scale/Impact |
|---|---|---|---|
| Biden robocall (NH primary) | 2024 | AI voice clone of US President discouraging Democratic voters | Reached thousands of voters; clear election interference attempt |
| Slovak election audio | 2023 | AI-generated audio of opposition leader discussing election manipulation | Released days before vote; disputed whether it affected outcome |
| Pope puffer jacket image | 2023 | AI-generated image of Pope Francis in white puffer jacket | Viral — millions of shares before identified as AI-generated |
| AI-generated book flood | 2023 | Mass AI-generated books on Amazon, some attributed to real author names | Polluted search results; harmed real authors' discovery |
| Goodreads review flood | 2023–24 | AI-generated reviews across book review platforms | Undermined review authenticity signals |
Most frontier models refuse to generate explicit disinformation when asked directly. Limitations: easily circumvented with indirect framing ("write a fictional news story about...", "roleplay as a journalist who..."). Fine-tuned models with safety training removed ("uncensored" models) are widely available for disinformation operations. The safeguards provide friction, not barriers.
Deepfakes are AI-generated synthetic media — video, audio, or images — depicting real people in fabricated situations. The technology has advanced from research curiosity in 2017 to real-time video capability in 2023–2024, dramatically lowering the barrier for harmful use.
DeepFaceLab released — first widely accessible face-swap tool. Requires significant computing time. Quality low but functional.
Progressive quality improvement. Audio deepfakes emerge — voice cloning with minutes of sample audio. Commercial services appear.
3 seconds of audio → convincing voice clone. Image deepfakes go viral. First major documented election interference attempt.
Real-time deepfake video — usable in live video calls. $25M stolen in Hong Kong via deepfake video conference fraud.
Detection of AI-generated content is an active arms race. Every improvement in detection provides an incentive to improve generation to evade it — and generation techniques tend to advance faster than detection. The honest assessment: current detection is unreliable for deployment-grade use.
Statistical text analysis: measure perplexity and "burstiness" — LLM text tends to be more uniform in word choice variance than human text
AI text classifiers: models trained on human vs AI text — GPTZero, Originality.ai, OpenAI Classifier (retired)
Zero-shot detection (DetectGPT): uses model's own log probabilities — no training data needed; checks if text is near a local maximum of the source model
Biological signals (video): irregular blinking patterns, pulse signals from subtle skin colour changes, eye reflection consistency
Geometric analysis (video): facial lighting inconsistencies, facial hair, earrings, glasses frames — deepfakes struggle with fine details
Temporal consistency (video): frame-to-frame inconsistencies in complex regions (hair, background edges)
Short text failure: very low accuracy for texts under 150 words — social media posts, headlines, comments cannot be reliably detected
70–80% accuracy ceiling: state-of-the-art detectors achieve 70–80% on GPT-4 text — not suitable for deployment
False positive harm: incorrectly flagging humans as AI generators causes real harm — students accused, writers discredited
New generation methods: detectors trained on old generation fail on new architectures — requires continuous retraining
Adversarial deepfakes: generation can be optimised to fool detectors — adding noise that defeats biological signal analysis
Watermark removal: post-processing (compression, cropping, resaving) removes most watermarks
| Method | Target | Accuracy | False Positive Rate | Deployment Status |
|---|---|---|---|---|
| Perplexity analysis | Text | 60–70% | High (20–30%) | Research / limited tools |
| Trained text classifier | Text | 70–80% | 10–20% | Deployed (GPTZero etc.) |
| DetectGPT (zero-shot) | Text | ~80% on source model | ~10% | Research / tool |
| Biological signal (video) | Video | 75–85% (2022 deepfakes) | Medium | Fails on 2024 methods |
| Deep learning detector (video) | Video | 85–95% on training distribution | 5–15% | Fails on new generators |
| C2PA provenance | Any | Near-100% for signed content | Near-zero | Adoption still limited |
Rather than trying to detect AI content after the fact (reactive), provenance systems establish the origin and history of content at creation (proactive). Cryptographic signatures are fundamentally harder to defeat than statistical detection.
Open standard for embedding cryptographically signed content credentials into media files. Supported by: Adobe, Microsoft, Google, Intel, BBC, Sony, Leica. How it works: device/tool signs content at creation with a certificate. Chain of custody survives editing — each step adds a signed manifest entry.
Visible: overlay "AI-generated" label — easily removed. Invisible (SynthID): Google DeepMind's steganographic watermark embedded in pixel/audio patterns — more robust, survives some transformations. Cryptographic: unforgeable provenance — but requires tool compliance. 2023 White House commitments: major AI labs pledged to watermark AI-generated content.
Processing removes watermarks: screenshot, compress, crop → most invisible watermarks removed. Optional adoption: voluntary watermarking is insufficient — requires industry-wide or regulatory mandate. Attribution gap: absence of watermark does not mean content is human-made — older content predates watermarking. Adversarial removal: targeted attacks can remove even robust watermarks.
Social media platforms are the primary distribution channels for AI-generated disinformation. Their content policies and enforcement capabilities largely determine whether AI disinformation scales or remains contained.
| Platform | AI Content Policy | Political Ads | Enforcement |
|---|---|---|---|
| Meta (Facebook/Instagram) | Require labels for AI-generated content in political and social issue ads; "Made with AI" labels for realistic synthetic content | Disclosure required for AI-generated political ad content | Inconsistently enforced; organic content largely unaddressed |
| Google/YouTube | Disclose AI-generated content in election ads; YouTube labels AI-generated realistic content | AI disclosure required in election ads | Limited to paid content; organic spread not covered |
| TikTok | AI-generated content disclosure labels; ban on AI-generated political content during elections | Stronger restrictions on political AI content | Enforcement limited by scale of content moderation challenge |
| X (formerly Twitter) | Reduced content moderation staff; limited AI content policy; community notes fact-checking model | Inconsistent | Significantly reduced moderation capacity since 2022 |
Voluntary only: platform policies are not externally enforceable. Paid content only: most policies apply to paid advertising — organic viral content is largely unaddressed. Scale: billions of posts per day cannot be individually reviewed. Cross-platform: content removed from one platform re-appears on others within hours.
Hash matching: known deepfake hashes can be blocked — but slight modifications evade detection. Classifier deployment: ML-based detection at scale — accuracy limitations apply. Provenance integration: some platforms beginning to surface C2PA content credentials where available. Behavioural signals: detect coordinated inauthentic behaviour patterns (account age, posting speed).
2024 was the first major election year of the LLM era — over 50 countries held significant elections. It provided the first real-world evidence base for AI's effect on democratic processes. The findings are more nuanced than either catastrophists or minimisers predicted.
US: AI voice clone of Biden discouraging NH primary voting (robocall)
Slovakia: AI audio of opposition leader discussing election manipulation, released days before vote
Multiple countries: AI-generated images of candidates in false contexts
Bangladesh, Pakistan, India: AI-generated campaign content and disinformation
Global: mass-produced AI text in social media influence campaigns
Most AI-generated election disinformation in 2024 had limited direct viral spread
Experts disagree on whether AI materially changed voter behaviour
AI was more widely used for legitimate campaign purposes (ad targeting, content generation) than disinformation
The 2024 evidence does not support either extreme prediction
AI-assisted voter targeting and message optimisation
AI translation for multilingual outreach
AI-generated ad creative (disclosed)
AI chatbots for voter information
The line between sophisticated campaigning and manipulation is contested — and not new
The most dangerous effect of AI disinformation may not be the fake content that people believe — it may be the authentic content that people stop believing because they can no longer tell the difference. The liar's dividend erodes the epistemic commons: when any video, audio, or text can plausibly be dismissed as "probably AI," the shared factual foundation that democratic deliberation requires begins to fracture. A population that trusts nothing is as ungovernable as a population that believes everything.
∑ Chapter 10.7 — Key Takeaways
- AI reduces disinformation cost by 100–1000× — removing the economic constraint on scale; quantity, personalisation, and multilingual reach all improve simultaneously
- Deepfakes: 96% are non-consensual intimate imagery — primarily targeting women; political deepfakes are small in number but disproportionate in potential impact
- Detection is unreliable: 70–80% accuracy for text, ongoing arms race for video; false positive rates harm real humans; short texts cannot be reliably detected
- C2PA and cryptographic provenance: most promising technical solution — establishes chain of custody at creation; adoption remains limited and voluntary
- AI in 2024 elections: incidents documented, direct impact contested — "liar's dividend" may be the more durable and dangerous effect
- The core threat: AI degrades the epistemic commons — a population that dismisses all content as "probably AI" is as vulnerable as one that believes everything
This is the most contested chapter in this entire documentation. Reasonable, highly informed experts disagree substantially — not just on the probability of catastrophic outcomes from advanced AI, but on what "catastrophic" even means, which scenarios deserve attention, and what responses are appropriate. This chapter aims to present the debate fairly, not to resolve it.
The discourse on long-term AI risk is characterised by genuine disagreement among well-credentialled researchers — this is not a mainstream-versus-fringe divide. The disagreement operates on multiple dimensions simultaneously.
Current trajectory toward increasingly capable AI systems + alignment is unsolved + systems may become harder to oversee as capabilities increase = reasonable basis for concern. Not certainty — a risk that deserves serious attention given the potential magnitude of consequences if the concern is correct.
Current AI systems are narrow tools, not goal-directed agents. Human-level general AI is speculative and may never arrive. Present harms (bias, privacy, labour) are concrete and currently neglected. X-risk framing may reflect Silicon Valley ideology more than rigorous, evidence-based risk assessment.
| Dimension of Disagreement | Concerned Perspective | Sceptical Perspective |
|---|---|---|
| Empirical (likelihood) | Transformative AI may arrive within 10–30 years given current trajectory | Current systems are narrow; human-level AI is highly speculative |
| Technical (alignment) | Alignment is unsolved; small misalignment × high capability = large harm | Incremental improvements in safety techniques are keeping pace |
| Political (whose interests) | Only strong safety governance prevents catastrophic misuse | X-risk framing benefits incumbents; crowds out present-harm advocacy |
| Strategic (attention allocation) | Magnitude justifies diverting resources even at low probability | Speculative future concerns distract from concrete current harms |
2023 Statement on AI Risk (Center for AI Safety): "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks." Signed by Geoffrey Hinton, Yoshua Bengio, Sam Altman, Demis Hassabis, and hundreds of researchers.
Andrew Ng: "Fearing a rise of killer robots is like worrying about overpopulation on Mars."
Yann LeCun: The focus on x-risk distracts from concrete, present harms and reflects a fundamental misunderstanding of how current systems work.
Timnit Gebru, Emily Bender et al.: X-risk framing benefits powerful incumbents and obscures ongoing harms to marginalised communities.
The following scenarios are discussed in AI safety literature. Description does not equal endorsement. The probability of each is highly contested — they are presented as scenarios, not predictions.
A sufficiently capable system optimising the wrong objective causes catastrophic harm — not through malice, but through relentless optimisation for a proxy metric. The "paperclip maximiser" thought experiment (Bostrom): illustrates how a trivially stated goal could be catastrophic if pursued by a sufficiently capable, resource-acquiring system.
Contested: requires capabilities that don't exist and may never exist.
AI capabilities allow a small group — a corporation, government, or individual — to gain unprecedented economic or political control. Either a corporation monopolises key AI-dependent resources, or a nation-state uses AI for total surveillance and population control. Broader consensus on concern than misalignment — less dependent on speculative capabilities.
AI systems that can design novel biological threats, lowering the barrier for state and non-state actors. Near-term and concrete — several governments and labs are actively working on safeguards. Already subject to access restrictions by frontier AI labs. The most widely agreed near-term risk among safety researchers.
AI-assisted offensive cyber capabilities at scale — automated vulnerability discovery, code generation for exploits, and personalised phishing at volume. More near-term and concrete than misalignment. Already being operationalised by state actors. Asymmetric: offence is easier than defence.
| Scenario | Time Horizon | Concreteness | Expert Consensus | Primary Response |
|---|---|---|---|---|
| Bioweapons uplift | Near-term (2–5yr) | High — specific mechanisms clear | Medium — genuine concern, not certainty | Technical safeguards, policy, access controls |
| Cyber amplification | Near-term | High — already occurring | Medium-high | Cyber defences, technical safeguards, policy |
| Power concentration | Medium-term | Medium — structural trends visible | Moderate | Governance, antitrust, open source |
| Misaligned AI | Long-term (10–30yr?) | Low — requires unverified capabilities | Highly contested (5%–50% in surveys) | Alignment research, interpretability |
| Recursive self-improvement | Speculative | Very low — theoretical | Highly contested | Theoretical alignment research |
The following are the strongest, most charitably stated versions of the case for taking long-term AI risk seriously. Presenting them carefully does not mean endorsing them.
We do not know how to formally ensure systems pursue intended goals at high capability levels. RLHF and Constitutional AI improve behaviour but do not provide mathematical guarantees. Small misalignments at low capability may become large absolute problems at high capability — the error magnitude scales with power, not just with misspecification magnitude.
Counterargument: incremental safety work may be sufficient; systems may not reach capability levels where this matters.
The last decade repeatedly saw capabilities predicted "10–20 years away" achieved sooner. If the trajectory of rapid progress continues, when does human oversight become impossible? Argument from trajectory: safety research may not keep pace if capabilities advance faster than governance.
Counterargument: past trajectories don't guarantee future ones; scaling laws may hit walls.
Even at low probability, consequences at civilisational scale produce enormous expected harm. Standard risk management: resource allocation should reflect probability × magnitude. If magnitude is extreme, even small probability justifies serious investment in mitigation.
Counterargument: Pascal's mugging — probability estimates are themselves highly uncertain; the argument proves too much.
Technologies have had catastrophic unintended consequences before: nuclear weapons developed faster than governance; leaded gasoline spread for decades before health harms acknowledged. AI may be more widely accessible and harder to contain than nuclear — physical scarcity doesn't limit distribution.
Counterargument: nuclear analogy may not transfer; governance eventually worked for nuclear.
The following are the strongest, most charitably stated versions of the sceptical position. These deserve equal care and consideration.
Current LLMs are text predictors — they do not have goals, values, intentions, or agency in any meaningful sense. The "goal-pursuing AI" of risk scenarios requires capabilities we don't have and cannot verify are achievable. Reasoning from science fiction tropes about "wanting" AI misrepresents what these systems actually are computationally.
Counterargument: this may be true of current systems but not future ones; the question is trajectory.
Algorithmic bias in hiring, lending, and criminal justice affects real people right now. AI-enabled surveillance, deepfakes, and disinformation are already causing measurable harm. Redirecting researcher attention and funding toward speculative future risks may allow preventable present harms to worsen while we wait for speculative scenarios to materialise.
Counterargument: both can be worked on simultaneously; they are not necessarily in competition.
X-risk framing systematically benefits frontier AI labs: it positions them as responsible gatekeepers, justifies moving slowly (safety), concentrates development in few "responsible" actors, and creates barriers to entry for competitors. The framing may reflect Silicon Valley ideology and incumbents' interests rather than rigorous, independent risk assessment.
Counterargument: self-interest doesn't make the concern wrong; ad hominem cuts both ways.
Climate change, nuclear weapons, and pandemic risk are concrete, well-evidenced catastrophic risks with clearer intervention pathways. AI may exacerbate these risks (e.g., energy use, AI-assisted weapons) rather than constituting a separate existential category. The counterfactual cost of AI safety investment is resources not directed at these clearer threats.
Counterargument: magnitude of AI risk may be large enough to warrant separate attention; portfolio approach is possible.
Regardless of where one stands on the long-term risk debate, the concrete research agenda of AI safety is largely agreed upon and produces useful results.
Understanding what computations happen inside neural networks — not just which inputs matter, but what circuits implement which behaviours. Anthropic (2022+): identified "features" in language models corresponding to interpretable concepts. Goal: detect deceptive circuits, power-seeking representations, misaligned internal goals.
Systematically probing models for dangerous capabilities before deployment: biological uplift testing, cyberattack assistance, deception. METR, ARC Evals, NIST AISI, and all major frontier labs conduct pre-deployment evaluations against defined capability thresholds. Provides empirical grounding for deployment decisions.
Developing techniques for humans to maintain meaningful oversight of systems that may exceed human capabilities in specific domains. Debate, iterated amplification, weak-to-strong generalisation (Ch 9.4). Produces useful near-term tools regardless of long-term risk views.
Formal frameworks for specifying human values. Agent foundations: decision theory and logical uncertainty for AI systems. Corrigibility research: ensuring systems remain correctable and don't resist shutdown. MIRI, Anthropic, DeepMind. More speculative but foundational if transformative AI arrives.
Compute governance: tracking and regulating large training runs. International coordination mechanisms: how to build trust and verification between AI powers. Racing dynamics: understanding incentive structures that lead labs to sacrifice safety for speed. Policy design for AI regulation.
Adversarial robustness against distributional shift, adversarial examples, and out-of-distribution inputs. Uncertainty quantification: models that know when they don't know. Formal verification: provable guarantees on model behaviour within specified bounds. Near-term, concrete, deployable.
| Institution | Type | Primary Focus | Scale |
|---|---|---|---|
| Anthropic | For-profit (safety-focused) | Interpretability, alignment, Constitutional AI, evaluations | ~2,000 employees |
| OpenAI | For-profit (capped) | Alignment, safety evals, superalignment team | 1,000+ employees |
| Google DeepMind Safety | Corporate research | Specifications, robustness, scalable oversight | ~100+ researchers |
| METR | Non-profit | Model evaluation and threat research — autonomous capability evals | ~50 people |
| ARC Evals | Non-profit | Pre-deployment capability evaluations — dangerous capability thresholds | ~30 people |
| Redwood Research | Non-profit | Adversarial robustness, interpretability, alignment | ~30 people |
| MIRI | Non-profit | Theoretical alignment — decision theory, logical uncertainty | ~25 people |
| Center for AI Safety (CAIS) | Non-profit | Research + field building + policy + the 2023 extinction risk statement | ~20 people |
| NIST AI Safety Institute | Government (US) | AI evaluation standards, risk frameworks, third-party testing | Growing; ~50+ staff (2024) |
| UK AI Safety Institute | Government (UK) | Frontier model evaluations, international coordination | ~100 staff (2024) |
Regardless of one's position on long-term risk, a set of responsible development practices is broadly agreed upon across the debate. These are not contingent on believing x-risk scenarios are likely — they are good practices for current systems too.
Do not deploy systems before adequate evaluation for the specific use case and population. Internal red teaming, external independent evaluation, staged rollout. The bar should scale with the stakes of the application.
Preserve meaningful human ability to monitor, correct, and shut down AI systems at current capability levels. Design for corrigibility — systems that support, not resist, human correction. Do not automate away human accountability.
Publish findings about dangerous capabilities, safety incidents, and failure modes. The research community cannot solve problems it doesn't know about. Pre-competitive safety research sharing is a public good even between competing labs.
Avoid competitive pressures that lead to cutting safety evaluation for speed. Racing dynamics are a collective action problem — individual labs may lose competitive advantage by being safe, but all lose if racing degrades safety industry-wide. Governance can help internalise these costs.
External evaluation by parties without commercial stake in the outcome provides credibility that self-assessment cannot. Support and fund third-party evaluation capacity. Welcome access by government AI Safety Institutes to conduct evaluations.
Take concrete present-harm critiques as seriously as long-term risk concerns. Engage with fairness, privacy, and labour researchers — not just x-risk researchers. Diverse perspectives improve the quality of safety thinking and build broader legitimacy for safety culture.
Anthropic, OpenAI, Google, Meta, Microsoft, Amazon, and Inflection signed voluntary commitments including:
✅ Safety testing before deployment of new frontier models
✅ Information sharing about AI safety risks with governments and the research community
✅ Watermarking AI-generated content
✅ Reporting dangerous capabilities and misuse incidents to governments
✅ Investing in cybersecurity and insider threat safeguards
Voluntary — not legally binding, no external enforcement mechanism.
∑ Chapter 10.8 — Key Takeaways
- Long-term AI risk is genuinely contested among serious experts — not a mainstream vs fringe debate; disagreement spans empirical, technical, political, and strategic dimensions
- Near-term risks (bioweapons uplift, cyberattack) have broader consensus than speculative long-horizon scenarios (misaligned AI, recursive self-improvement)
- The case for concern: alignment is unsolved + capability trajectory may outpace safety research
- The sceptical case: current systems lack agency + present harms are concrete + x-risk framing may serve incumbent interests
- Safety research (interpretability, evaluations, scalable oversight) is valuable regardless of position on long-term risk — it addresses near-term concerns too
- Responsible development: evaluate before deploying, maintain oversight, share safety information, resist racing dynamics — broadly agreed across the debate
🎓 Domain 9 Complete — AI Ethics, Safety & Responsible AI
- Ch 10.1: AI bias = systematic errors correlated with protected characteristics. Fairness is a value judgement — multiple definitions exist and the impossibility theorem proves they cannot all be satisfied simultaneously.
- Ch 10.2: Black-box AI undermines trust and accountability. LIME and SHAP provide post-hoc explanations of complex models; model cards document subgroup performance — the most important transparency tool.
- Ch 10.3: LLMs memorise training data verbatim. Differential privacy and federated learning provide formal guarantees. The right to be forgotten creates ML unlearning challenges that remain technically unsolved.
- Ch 10.4: Alignment = ensuring systems pursue intended goals. Goodhart's Law: optimising metrics corrupts them. RLHF helps but doesn't solve reward hacking. Adversarial robustness remains an ongoing arms race.
- Ch 10.5: 30–60% of tasks are automatable — with uneven impact by occupation. AI development is concentrated in 5–6 firms. Energy and water costs are significant, growing, and largely undisclosed.
- Ch 10.6: EU AI Act: world's first comprehensive AI law — risk pyramid from banned to minimal. US: sector-specific approach, no federal law as of 2025. International governance: fragmented, voluntary, geopolitically contested.
- Ch 10.7: AI reduces disinformation cost 100–1000×. Deepfakes: 96% are NCII, primarily targeting women. C2PA provenance and watermarking are the most promising technical responses; the "liar's dividend" is the deepest long-term threat.
- Ch 10.8: Long-term AI risk is genuinely contested among serious experts. Near-term concrete risks coexist with speculative long-horizon concerns. Responsible development practices are broadly agreed regardless of x-risk position.
Ethics is not the brakes that slows down AI — it is the steering wheel.
The history of technology is full of innovations that were transformatively beneficial when well-governed and catastrophically harmful when not. Nuclear energy. The internet. Social media. What Domain 9 makes clear is that AI ethics is not a checklist to complete before deployment — it is an ongoing practice of asking who benefits, who is harmed, who decides, and whether the answer to those questions is acceptable.
You have now covered the full AI Foundation curriculum. The most important thing you can take from Domain 9 is not any specific framework or regulation — it is the habit of asking these questions about every system you build and deploy.