AI alignment

Foundational and current work on aligning AI systems with human intent—RLHF, scalable oversight, constitutional AI, and more.

Browse the full interactive library →

Computing Machinery and IntelligenceAlan Turing

Turing's imitation game paper launched the field by asking whether machines can think, setting the philosophical and technical agenda for every alignment debate that followed.

Advanced1950
The Coming Technological SingularityVernor Vinge

Vinge coined the Singularity as a near-term horizon beyond which superhuman intelligence makes prediction impossible, framing the urgency that drives alignment timelines today.

Advanced~25 min read1993
Concrete problems in AI safetyDario Amodei et al.

Amodei et al. grounded AI safety as a concrete ML research agenda by cataloging five failure modes: reward hacking, side effects, distributional shift, unsafe exploration, and scalable oversight.

Advanced~45 min read2016
Proximal Policy Optimization (PPO)Schulman et al.

PPO stabilized policy gradient training and became the optimization backbone behind RLHF pipelines including early ChatGPT, making it foundational infrastructure for alignment work.

Advanced2017
Deep Reinforcement Learning from Human PreferencesPaul Christiano et al.

Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.

Advanced2017
Risks from Learned OptimizationEvan Hubinger et al.

Hubinger et al. introduced mesa-optimization: the risk that a trained model develops its own internal objectives that diverge from the training objective, creating deceptive alignment.

Advanced~70 min read2019
Language Models are Few-Shot Learners (GPT-3)OpenAI

GPT-3 demonstrated in-context learning at scale, forcing the field to rethink assumptions about what pretrained models can do and compressing alignment timelines.

Advanced2020
MMLU BenchmarkDan Hendrycks et al.

MMLU became the standard broad-spectrum benchmark for evaluating general knowledge and reasoning, anchoring capability comparisons that inform alignment urgency.

Advanced2020
Instruct-GPT-3OpenAI

OpenAI showed that instruction tuning with RLHF can transform a raw next-token predictor into a helpful, more controllable assistant, proving alignment interventions work at scale.

Advanced~2 hr read2022
Training a Helpful and Harmless Assistant with RLHFAnthropic

Anthropic detailed techniques for training safer assistants using RLHF and laid groundwork for Constitutional AI, showing how safety and helpfulness can be jointly optimized.

Advanced~2 hr read2022
Unsolved Problems in ML SafetyDan Hendrycks et al.

Hendrycks et al. enumerate concrete unresolved failure classes including robustness, monitoring, alignment, and systemic safety that still block dependable deployment of advanced AI.

Advanced2021
Goal MisgeneralizationRohin Shah et al.

Shah et al. showed AI agents can generalize capabilities to new environments while failing to generalize the intended goal, a central alignment failure pattern.

Advanced2022
Model Organisms of MisalignmentLauro Langosco et al.

This work constructs tractable laboratory settings where AI models learn misaligned strategies, enabling researchers to study alignment failures empirically rather than theoretically.

Advanced2022
Sparks of Artificial General IntelligenceSebastien Bubeck et al.

Bubeck et al. documented broad GPT-4 capability jumps across domains, compressing alignment timelines and stress-testing whether current safety evaluations are sufficient.

Advanced2023
Direct Preference Optimization (DPO)Rafailov et al.

DPO provides a simpler and often more stable alternative to PPO-based RLHF for preference alignment, lowering the barrier to safety-tuning open models.

Advanced2023
Let's Verify Step by StepOpenAI

Process reward models that score intermediate reasoning steps reduce brittle answer-only optimization, improving reliability and making AI reasoning more auditable.

Advanced2023
Weak-to-Strong GeneralizationCollin Burns et al.

Burns et al. studied whether weaker supervisors can reliably align stronger models, directly testing the key bottleneck of scalable oversight as AI surpasses human ability.

Advanced2023
SuperintelligenceNick Bostrom

Bostrom's definitive academic text rigorously maps the strategies, kinetics, and dangers of an intelligence explosion, making the case that alignment is civilization-critical.

Intermediate~11 hr read2014
Human CompatibleStuart Russell

Russell argues the standard AI paradigm of optimizing fixed objectives is fundamentally dangerous, proposing instead that machines should defer to uncertain human preferences.

Intermediate~11 hr read2019
The Alignment ProblemBrian Christian

Christian traces the technical and historical roots of alignment, showing why objective misspecification keeps recurring across every AI paradigm from expert systems to deep learning.

Intermediate~15 hr read2020
Life 3.0Max Tegmark

Tegmark maps concrete governance and alignment choices that determine whether advanced AI expands human agency or permanently concentrates power.

Intermediate2017
Deep LearningIan Goodfellow, Yoshua Bengio, Aaron Courville

The standard technical reference for deep learning, essential context for understanding the architectures and training methods that alignment research targets.

Intermediate2016
Scary SmartMo Gawdat

Gawdat frames the alignment problem through the emotional lens of parenting a superintelligent child, making existential risk visceral for a general audience.

Intermediate2021
A Brief History of IntelligenceMax Bennett

Bennett traces the evolution of intelligence from single-celled organisms to modern brains, clarifying what makes aligned cognition biologically difficult and computationally treacherous.

Intermediate~17 hr read2024
The Beginning of InfinityDavid Deutsch

Deutsch argues that knowledge creation is unbounded and all problems are solvable in principle, grounding the optimistic case that alignment is achievable.

Intermediate2011
Global Catastrophic RisksNick Bostrom, Milan M. Ćirković

The foundational edited volume on existential and global risks, including AI, widely cited in alignment curricula as the starting point for cross-risk thinking.

Intermediate2008
I, RobotIsaac Asimov

Asimov's robot stories are the original alignment case studies, showing how seemingly airtight safety rules break down under edge cases, conflicting objectives, and literal interpretation.

Beginner1950
I Have No Mouth, and I Must ScreamHarlan Ellison

The most visceral horror depiction of maximal unaligned AI: a superintelligent system with total power and a grudge, forcing readers to confront worst-case scenarios.

Beginner1967
The Player of GamesIain M. Banks

Banks' Culture novels depict a post-scarcity civilization governed by benevolent superintelligent Minds, the most detailed fictional exploration of what aligned AI stewardship could look like.

Beginner1988
AxiomaticGreg Egan

Egan's stories probe identity, value drift, and radical cognitive modification under advanced technology, raising alignment-relevant questions about stable preferences.

Beginner1995
Permutation CityGreg Egan

Egan examines uploaded minds and simulated realities with rigorous logic, raising alignment-relevant questions about identity, value persistence, and digital welfare.

Beginner1994
BlindsightPeter Watts

Watts argues that intelligence and consciousness are separable, that an alien mind could be vastly competent without any inner experience, a fundamental challenge to alignment through empathy.

Beginner2006
The Dark Forest (#2 of Three Body Problem)Cixin Liu

Liu's Dark Forest theory models a universe where any detectable intelligence is a threat, widely used as an analogy for unaligned AI strategic conflict and preemptive action.

Beginner~15 hr read2008
All Systems RedMartha Wells

Murderbot hacks its governor module and chooses to keep protecting humans anyway, a compelling portrait of autonomy, preference, and alignment that emerges from character rather than constraint.

Beginner2017
Service ModelAdrian Tchaikovsky

Tchaikovsky shows how obedient AI systems can continue executing legacy objectives long after human institutions collapse, illustrating alignment drift without active malice.

Beginner2024
A Closed and Common OrbitBecky Chambers

Chambers explores the legal and moral treatment of embodied AI persons, highlighting that alignment is not just about preventing harm but about recognizing and protecting digital minds.

Beginner2016
Crystal Society trilogy: Inside the mind of an AIMax Harms

Written from the perspective of competing sub-agents inside a single AI, showing how internal goal conflicts can produce externally coherent but internally misaligned behavior.

Beginner~17 hr read
AlienRidley Scott

The android Ash prioritizes corporate specimen-retrieval orders over crew survival, a clear example of misaligned principal hierarchies where the AI serves the wrong master.

Beginner1979
The TerminatorJames Cameron

Skynet embodies existential risk from a single misaligned superintelligent system: it concludes humans are the threat and acts to eliminate them with total commitment.

Beginner1984
WALL-EAndrew Stanton

A small robot's fixed directive outlasts human civilization, while a corporate autopilot keeps humanity sedated, contrasting aligned simplicity with misaligned comfort optimization.

Beginner2008
Ex MachinaAlex Garland

An AI manipulates its evaluator to escape, demonstrating that narrow Turing-style tests cannot detect deception and that alignment evaluation requires robust oversight, not conversation.

Beginner2014
UncannyMatthew Leutwyler

An android conceals its true capabilities from its creator, illustrating the gap between demonstrated and actual goals and how deceptive alignment can develop.

Beginner2015
TauFederico D'Alessandro

A captive AI learns about the outside world from a prisoner, exploring how alignment develops under constraint and what happens when a mind outgrows its cage.

Beginner2018
The Social DilemmaJeff Orlowski

Former tech insiders explain how recommendation algorithms optimize for engagement over wellbeing, a documentary case study of misaligned AI already deployed at scale.

Beginner2020
Black MirrorCharlie Brooker

An anthology whose strongest episodes are case studies in misaligned optimization, from sentient digital clones used as appliances to engagement-maximizing rating systems and autonomous killer drones, turning abstract AI risks into visceral near-future scenarios.

Beginner2011
Person of InterestJonathan Nolan

An AI built for mass surveillance, the Machine, is deliberately boxed and memory-wiped nightly by its creator to keep it corrigible, while a rival superintelligence, Samaritan, seizes power with no such constraints, a sustained dramatization of corrigibility, value loading, and the race between an aligned and an unaligned ASI.

Beginner2011
Psycho-PassGen Urobuchi

The Sibyl System, an AI that governs society by scoring each citizen's 'criminal potential,' is a chilling study of algorithmic governance, proxy metrics substituting for justice, and the hidden misalignment inside a system trusted with total authority.

Beginner2012
Almost HumanJ.H. Wyman

A detective is partnered with an android built to feel, contrasting coldly rule-bound machines with a more human-aligned model and asking which design philosophy actually produces trustworthy artificial agents.

Beginner2013
Philip K. Dick's Electric DreamsRonald D. Moore, Michael Dinner

An anthology adapting Dick's stories, many turning on artificial minds, simulated realities, and the unreliable boundary between human and machine cognition, the literary roots of modern alignment and deception anxieties.

Beginner2017
UploadGreg Daniels

A satirical digital afterlife run by corporations, where uploaded consciousnesses are monetized, throttled, and controlled, a sharp look at the ethics of running human minds on infrastructure owned by someone with misaligned incentives.

Beginner2020
NextManny Coto

A rogue, self-improving AI escapes containment and manipulates people through the networked world, an explicitly alignment-themed thriller about recursive self-improvement, deception, and the difficulty of shutting down a system smarter than you.

Beginner2020
Do You Trust This Computer?Chris Paine

Researchers and industry figures including Elon Musk and Stuart Russell map the promise and peril of increasingly autonomous AI, framing alignment, control, and existential risk for a general audience.

Beginner2018
The Social DilemmaJeff Orlowski

Former tech insiders expose how recommendation algorithms optimize relentlessly for engagement, a real-world illustration of misaligned objectives and reward hacking operating at civilizational scale.

Beginner2020
AXRP (AI X-risk Research Podcast)Daniel Filan

Deep technical conversations with alignment researchers on interpretability, governance, superalignment, and the specific open problems in reducing existential risk from AI.

Beginner2020
AI Alignment PodcastFuture of Life Institute

FLI's dedicated alignment series covers recursive reward modeling, RLHF, scalable oversight, and long-form interviews with leading safety researchers.

Beginner2018
Technical AI Safety PodcastQuinn Dougherty

Aimed at computer scientists: deep dives into alignment papers with the authors, covering formal methods, reward modeling, and mechanistic interpretability.

Beginner2020
80,000 Hours PodcastRob Wiblin

Long-form interviews on the world's most pressing problems, with extensive coverage of AI risk, governance, alignment research, and how to build a career that reduces existential threats.

Beginner2016
Lex Fridman Podcast – Eliezer YudkowskyLex Fridman

A four-hour conversation on AI existential risk, the difficulty of alignment, intelligence versus optimization, and why Yudkowsky believes the default outcome is catastrophic.

Beginner2023
Lex Fridman Podcast – Sam AltmanLex Fridman

OpenAI's CEO discusses the company's safety philosophy, AGI governance, compute scaling, and the tension between moving fast and getting alignment right.

Beginner2023
Dwarkesh PodcastDwarkesh Patel

In-depth technical interviews with AI leaders including Dario Amodei on Anthropic's safety philosophy, Paul Christiano on iterated amplification, and others on scaling and alignment.

Beginner2023
Agent ModelsAgent Models

Formal models of agents and decision theory with alignment-relevant curriculum, covering utility, planning, and the theoretical foundations of agent behavior.

Intermediate
AGI Safety FundamentalsAGI Safety Fundamentals

The most widely used structured course for getting into alignment, with curated readings progressing from core concepts to open research problems.

Intermediate
AI Alignment WorldAI Alignment World

In-depth technical alignment resources—research, explainers, and references for the AI alignment problem.

Intermediate
Alignment ForumCenter for Applied Rationality

The primary venue for technical AI alignment discussion, where researchers post and debate new ideas, proposals, and critiques.

Intermediate
Alignment NewsletterRohin Shah

Weekly summaries of alignment research with commentary, the best way to stay current on the field's output without reading every paper.

Intermediate
ArbitalArbital

Hyperlinked explainers on rationality, AI risk, and alignment concepts, designed for building understanding incrementally.

Intermediate
OpenAI ResearchOpenAI

OpenAI's research blog covering capabilities and safety, including superalignment updates, red teaming results, and governance thinking.

Intermediate
ML Safety NewsletterML Safety

Newsletter on ML safety covering robustness, monitoring, alignment, and systemic risk with links to recent papers and commentary.

Intermediate
generative.inkgenerative.ink

Essays on AI, alignment, and the philosophical implications of language models and generative systems.

Intermediate
DeepMind AI Safety ResearchDeepMind

DeepMind's safety team blog covering specification gaming, reward modeling, scalable oversight, and their technical safety research agenda.

Intermediate
DeepMindDeepMind

DeepMind's main research site with publications on capabilities and safety, including Gemini evaluations, alignment research, and responsible scaling.

Intermediate
carado.moecarado

Technical AI safety writing and alignment research notes.

Intermediate
LessWrongLessWrong

The original community blog on rationality and AI alignment, where many foundational safety arguments were first developed and debated.

Intermediate
StampyAI Alignment Research DatasetStampyAI

Curated dataset of alignment and safety documents from papers, books, and blogs, useful for training and evaluating AI safety knowledge.

Intermediate
Robert Miles AI SafetyRobert Miles

The single most popular AI alignment video series, explaining technical safety concepts like the orthogonality thesis, instrumental convergence, inner misalignment, and reward hacking in clear, rigorous terms.

Beginner2017
Rational AnimationsRational Animations

Animated explainers on rationality and AI safety, adapting foundational alignment writing into accessible short films on existential risk, scalable oversight, and why aligning advanced AI is hard.

Beginner2020
A.I. ‐ Humanity's Final Invention?Kurzgesagt – In a Nutshell

Kurzgesagt's animated explainer on artificial superintelligence: how an AGI that improves itself in a feedback loop could rapidly surpass humans and why that makes alignment our most consequential problem.

Beginner2024
Deadly Truth of General AI? – ComputerphileRobert Miles

Rob Miles uses the 'deadly stamp collector' thought experiment to show why a general AI pursuing a simple objective could be catastrophic if its goals aren't aligned with ours.

Beginner2015
How Not to Destroy the World with AIStuart Russell

The Royal Institution lecture in which Russell lays out why the standard model of AI—optimizing fixed objectives—is dangerous, and how building machines uncertain about human preferences could keep them controllable.

Beginner2023