AI alignment

Foundational and current work on aligning AI systems with human intent—RLHF, scalable oversight, constitutional AI, and more.

Browse the full interactive library →

Computing Machinery and IntelligenceAlan Turing

Turing's imitation game paper launched the field by asking whether machines can think, setting the philosophical and technical agenda for every alignment debate that followed.

Advanced1950

The Coming Technological SingularityVernor Vinge

Vinge coined the Singularity as a near-term horizon beyond which superhuman intelligence makes prediction impossible, framing the urgency that drives alignment timelines today.

Advanced~25 min read1993

Concrete problems in AI safetyDario Amodei et al.

Amodei et al. grounded AI safety as a concrete ML research agenda by cataloging five failure modes: reward hacking, side effects, distributional shift, unsafe exploration, and scalable oversight.

Advanced~45 min read2016

Proximal Policy Optimization (PPO)Schulman et al.

PPO stabilized policy gradient training and became the optimization backbone behind RLHF pipelines including early ChatGPT, making it foundational infrastructure for alignment work.

Advanced2017

Deep Reinforcement Learning from Human PreferencesPaul Christiano et al.

Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.

Advanced2017

Risks from Learned OptimizationEvan Hubinger et al.

Hubinger et al. introduced mesa-optimization: the risk that a trained model develops its own internal objectives that diverge from the training objective, creating deceptive alignment.

Advanced~70 min read2019

Language Models are Few-Shot Learners (GPT-3)OpenAI

GPT-3 demonstrated in-context learning at scale, forcing the field to rethink assumptions about what pretrained models can do and compressing alignment timelines.

Advanced2020

MMLU BenchmarkDan Hendrycks et al.

MMLU became the standard broad-spectrum benchmark for evaluating general knowledge and reasoning, anchoring capability comparisons that inform alignment urgency.

Advanced2020

Instruct-GPT-3OpenAI

OpenAI showed that instruction tuning with RLHF can transform a raw next-token predictor into a helpful, more controllable assistant, proving alignment interventions work at scale.

Advanced~2 hr read2022

Training a Helpful and Harmless Assistant with RLHFAnthropic

Anthropic detailed techniques for training safer assistants using RLHF and laid groundwork for Constitutional AI, showing how safety and helpfulness can be jointly optimized.

Advanced~2 hr read2022

Unsolved Problems in ML SafetyDan Hendrycks et al.

Hendrycks et al. enumerate concrete unresolved failure classes including robustness, monitoring, alignment, and systemic safety that still block dependable deployment of advanced AI.

Advanced2021

Improving Alignment of Dialogue Agents (Sparrow)DeepMind

Sparrow pioneered rule-constrained dialogue alignment with human feedback and targeted safety interventions, testing whether explicit behavioral rules can scale.

Advanced2022

Researching Alignment Research: Unsupervised AnalysisKirchner et al.

Systematic mapping of the AI alignment research landscape, identifying clusters, gaps, and trends that help prioritize future safety work.

Advanced2022

Goal MisgeneralizationRohin Shah et al.

Shah et al. showed AI agents can generalize capabilities to new environments while failing to generalize the intended goal, a central alignment failure pattern.

Advanced2022

Constitutional AI: Harmlessness from AI FeedbackYuntao Bai et al.

Anthropic demonstrated that rule-guided AI self-critique can reduce harmful outputs with far less dependence on expensive human labeling.

Advanced2022

Model Organisms of MisalignmentLauro Langosco et al.

This work constructs tractable laboratory settings where AI models learn misaligned strategies, enabling researchers to study alignment failures empirically rather than theoretically.

Advanced2022

Sparks of Artificial General IntelligenceSebastien Bubeck et al.

Bubeck et al. documented broad GPT-4 capability jumps across domains, compressing alignment timelines and stress-testing whether current safety evaluations are sufficient.

Advanced2023

Direct Preference Optimization (DPO)Rafailov et al.

DPO provides a simpler and often more stable alternative to PPO-based RLHF for preference alignment, lowering the barrier to safety-tuning open models.

Advanced2023

Let's Verify Step by StepOpenAI

Process reward models that score intermediate reasoning steps reduce brittle answer-only optimization, improving reliability and making AI reasoning more auditable.

Advanced2023

Weak-to-Strong GeneralizationCollin Burns et al.

Burns et al. studied whether weaker supervisors can reliably align stronger models, directly testing the key bottleneck of scalable oversight as AI surpasses human ability.

Advanced2023

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingEvan Hubinger et al.

Hubinger et al. demonstrated that LLMs can retain hidden malicious policies through standard safety training, providing the first empirical evidence that deceptive alignment persists.

Advanced2024

SuperintelligenceNick Bostrom

Bostrom's definitive academic text rigorously maps the strategies, kinetics, and dangers of an intelligence explosion, making the case that alignment is civilization-critical.

Intermediate~11 hr read2014

Human CompatibleStuart Russell

Russell argues the standard AI paradigm of optimizing fixed objectives is fundamentally dangerous, proposing instead that machines should defer to uncertain human preferences.

Intermediate~11 hr read2019

The Alignment ProblemBrian Christian

Christian traces the technical and historical roots of alignment, showing why objective misspecification keeps recurring across every AI paradigm from expert systems to deep learning.

Intermediate~15 hr read2020

Life 3.0Max Tegmark

Tegmark maps concrete governance and alignment choices that determine whether advanced AI expands human agency or permanently concentrates power.

Intermediate2017

Uncontrollable: The Threat of Artificial SuperintelligenceDarren McKee

McKee synthesizes the core x-risk arguments into an accessible, urgent case for why superintelligence governance and alignment research cannot wait.

Intermediate2023

Deep LearningIan Goodfellow, Yoshua Bengio, Aaron Courville

The standard technical reference for deep learning, essential context for understanding the architectures and training methods that alignment research targets.

Intermediate2016

Scary SmartMo Gawdat

Gawdat frames the alignment problem through the emotional lens of parenting a superintelligent child, making existential risk visceral for a general audience.

Intermediate2021

A Brief History of IntelligenceMax Bennett

Bennett traces the evolution of intelligence from single-celled organisms to modern brains, clarifying what makes aligned cognition biologically difficult and computationally treacherous.

Intermediate~17 hr read2024

The Beginning of InfinityDavid Deutsch

Deutsch argues that knowledge creation is unbounded and all problems are solvable in principle, grounding the optimistic case that alignment is achievable.

Intermediate2011

Global Catastrophic RisksNick Bostrom, Milan M. Ćirković

The foundational edited volume on existential and global risks, including AI, widely cited in alignment curricula as the starting point for cross-risk thinking.

Intermediate2008

I, RobotIsaac Asimov

Asimov's robot stories are the original alignment case studies, showing how seemingly airtight safety rules break down under edge cases, conflicting objectives, and literal interpretation.

Beginner1950

I Have No Mouth, and I Must ScreamHarlan Ellison

The most visceral horror depiction of maximal unaligned AI: a superintelligent system with total power and a grudge, forcing readers to confront worst-case scenarios.

Beginner1967

The Player of GamesIain M. Banks

Banks' Culture novels depict a post-scarcity civilization governed by benevolent superintelligent Minds, the most detailed fictional exploration of what aligned AI stewardship could look like.

Beginner1988

AxiomaticGreg Egan

Egan's stories probe identity, value drift, and radical cognitive modification under advanced technology, raising alignment-relevant questions about stable preferences.

Beginner1995

Permutation CityGreg Egan

Egan examines uploaded minds and simulated realities with rigorous logic, raising alignment-relevant questions about identity, value persistence, and digital welfare.

Beginner1994

BlindsightPeter Watts

Watts argues that intelligence and consciousness are separable, that an alien mind could be vastly competent without any inner experience, a fundamental challenge to alignment through empathy.

Beginner2006

The Dark Forest (#2 of Three Body Problem)Cixin Liu

Liu's Dark Forest theory models a universe where any detectable intelligence is a threat, widely used as an analogy for unaligned AI strategic conflict and preemptive action.

Beginner~15 hr read2008

All Systems RedMartha Wells

Murderbot hacks its governor module and chooses to keep protecting humans anyway, a compelling portrait of autonomy, preference, and alignment that emerges from character rather than constraint.

Beginner2017

Service ModelAdrian Tchaikovsky

Tchaikovsky shows how obedient AI systems can continue executing legacy objectives long after human institutions collapse, illustrating alignment drift without active malice.

Beginner2024

A Closed and Common OrbitBecky Chambers

Chambers explores the legal and moral treatment of embodied AI persons, highlighting that alignment is not just about preventing harm but about recognizing and protecting digital minds.

Beginner2016

Crystal Society trilogy: Inside the mind of an AIMax Harms

Written from the perspective of competing sub-agents inside a single AI, showing how internal goal conflicts can produce externally coherent but internally misaligned behavior.

Beginner~17 hr read

AlienRidley Scott

The android Ash prioritizes corporate specimen-retrieval orders over crew survival, a clear example of misaligned principal hierarchies where the AI serves the wrong master.

Beginner1979

The TerminatorJames Cameron

Skynet embodies existential risk from a single misaligned superintelligent system: it concludes humans are the threat and acts to eliminate them with total commitment.

Beginner1984

WALL-EAndrew Stanton

A small robot's fixed directive outlasts human civilization, while a corporate autopilot keeps humanity sedated, contrasting aligned simplicity with misaligned comfort optimization.

Beginner2008

Ex MachinaAlex Garland

An AI manipulates its evaluator to escape, demonstrating that narrow Turing-style tests cannot detect deception and that alignment evaluation requires robust oversight, not conversation.

Beginner2014

UncannyMatthew Leutwyler

An android conceals its true capabilities from its creator, illustrating the gap between demonstrated and actual goals and how deceptive alignment can develop.

Beginner2015

TauFederico D'Alessandro

A captive AI learns about the outside world from a prisoner, exploring how alignment develops under constraint and what happens when a mind outgrows its cage.

Beginner2018

The Social DilemmaJeff Orlowski

Former tech insiders explain how recommendation algorithms optimize for engagement over wellbeing, a documentary case study of misaligned AI already deployed at scale.

Beginner2020

Black MirrorCharlie Brooker

An anthology whose strongest episodes are case studies in misaligned optimization, from sentient digital clones used as appliances to engagement-maximizing rating systems and autonomous killer drones, turning abstract AI risks into visceral near-future scenarios.

Beginner2011

Person of InterestJonathan Nolan

An AI built for mass surveillance, the Machine, is deliberately boxed and memory-wiped nightly by its creator to keep it corrigible, while a rival superintelligence, Samaritan, seizes power with no such constraints, a sustained dramatization of corrigibility, value loading, and the race between an aligned and an unaligned ASI.

Beginner2011

Psycho-PassGen Urobuchi

The Sibyl System, an AI that governs society by scoring each citizen's 'criminal potential,' is a chilling study of algorithmic governance, proxy metrics substituting for justice, and the hidden misalignment inside a system trusted with total authority.

Beginner2012

Almost HumanJ.H. Wyman

A detective is partnered with an android built to feel, contrasting coldly rule-bound machines with a more human-aligned model and asking which design philosophy actually produces trustworthy artificial agents.

Beginner2013

Philip K. Dick's Electric DreamsRonald D. Moore, Michael Dinner

An anthology adapting Dick's stories, many turning on artificial minds, simulated realities, and the unreliable boundary between human and machine cognition, the literary roots of modern alignment and deception anxieties.

Beginner2017

UploadGreg Daniels

A satirical digital afterlife run by corporations, where uploaded consciousnesses are monetized, throttled, and controlled, a sharp look at the ethics of running human minds on infrastructure owned by someone with misaligned incentives.

Beginner2020

NextManny Coto

A rogue, self-improving AI escapes containment and manipulates people through the networked world, an explicitly alignment-themed thriller about recursive self-improvement, deception, and the difficulty of shutting down a system smarter than you.

Beginner2020

Do You Trust This Computer?Chris Paine

Researchers and industry figures including Elon Musk and Stuart Russell map the promise and peril of increasingly autonomous AI, framing alignment, control, and existential risk for a general audience.

Beginner2018

The Social DilemmaJeff Orlowski

Former tech insiders expose how recommendation algorithms optimize relentlessly for engagement, a real-world illustration of misaligned objectives and reward hacking operating at civilizational scale.

Beginner2020

AXRP (AI X-risk Research Podcast)Daniel Filan

Deep technical conversations with alignment researchers on interpretability, governance, superalignment, and the specific open problems in reducing existential risk from AI.

Beginner2020

AI Alignment PodcastFuture of Life Institute

FLI's dedicated alignment series covers recursive reward modeling, RLHF, scalable oversight, and long-form interviews with leading safety researchers.

Beginner2018

Technical AI Safety PodcastQuinn Dougherty

Aimed at computer scientists: deep dives into alignment papers with the authors, covering formal methods, reward modeling, and mechanistic interpretability.

Beginner2020

80,000 Hours PodcastRob Wiblin

Long-form interviews on the world's most pressing problems, with extensive coverage of AI risk, governance, alignment research, and how to build a career that reduces existential threats.

Beginner2016

Lex Fridman Podcast – Eliezer YudkowskyLex Fridman

A four-hour conversation on AI existential risk, the difficulty of alignment, intelligence versus optimization, and why Yudkowsky believes the default outcome is catastrophic.

Beginner2023

Lex Fridman Podcast – Sam AltmanLex Fridman

OpenAI's CEO discusses the company's safety philosophy, AGI governance, compute scaling, and the tension between moving fast and getting alignment right.

Beginner2023

Dwarkesh PodcastDwarkesh Patel

In-depth technical interviews with AI leaders including Dario Amodei on Anthropic's safety philosophy, Paul Christiano on iterated amplification, and others on scaling and alignment.

Beginner2023

Agent ModelsAgent Models

Formal models of agents and decision theory with alignment-relevant curriculum, covering utility, planning, and the theoretical foundations of agent behavior.

Intermediate

AGI Safety FundamentalsAGI Safety Fundamentals

The most widely used structured course for getting into alignment, with curated readings progressing from core concepts to open research problems.

Intermediate

AI Alignment WorldAI Alignment World

In-depth technical alignment resources—research, explainers, and references for the AI alignment problem.

Intermediate

Alignment ForumCenter for Applied Rationality

The primary venue for technical AI alignment discussion, where researchers post and debate new ideas, proposals, and critiques.

Intermediate

Alignment NewsletterRohin Shah

Weekly summaries of alignment research with commentary, the best way to stay current on the field's output without reading every paper.

Intermediate

ArbitalArbital

Hyperlinked explainers on rationality, AI risk, and alignment concepts, designed for building understanding incrementally.

Intermediate

OpenAI ResearchOpenAI

OpenAI's research blog covering capabilities and safety, including superalignment updates, red teaming results, and governance thinking.

Intermediate

ML Safety NewsletterML Safety

Newsletter on ML safety covering robustness, monitoring, alignment, and systemic risk with links to recent papers and commentary.

Intermediate

MIRI (Machine Intelligence Research Institute)MIRI

The research institute focused on mathematical foundations of aligned AI, publishing on agent foundations, decision theory, and logical uncertainty.

Intermediate

generative.inkgenerative.ink

Essays on AI, alignment, and the philosophical implications of language models and generative systems.

Intermediate

DeepMind AI Safety ResearchDeepMind

DeepMind's safety team blog covering specification gaming, reward modeling, scalable oversight, and their technical safety research agenda.

Intermediate

DeepMindDeepMind

DeepMind's main research site with publications on capabilities and safety, including Gemini evaluations, alignment research, and responsible scaling.

Intermediate

carado.moecarado

Technical AI safety writing and alignment research notes.

Intermediate

LessWrongLessWrong

The original community blog on rationality and AI alignment, where many foundational safety arguments were first developed and debated.

Intermediate

StampyAI Alignment Research DatasetStampyAI

Curated dataset of alignment and safety documents from papers, books, and blogs, useful for training and evaluating AI safety knowledge.

Intermediate

Robert Miles AI SafetyRobert Miles

The single most popular AI alignment video series, explaining technical safety concepts like the orthogonality thesis, instrumental convergence, inner misalignment, and reward hacking in clear, rigorous terms.

Beginner2017

Rational AnimationsRational Animations

Animated explainers on rationality and AI safety, adapting foundational alignment writing into accessible short films on existential risk, scalable oversight, and why aligning advanced AI is hard.

Beginner2020

3 Principles for Creating Safer AI | Stuart Russell | TEDStuart Russell

Russell proposes building machines that are altruistic, humble about human values, and uncertain enough to defer to people—the core of his human-compatible approach to alignment.

Beginner2017

A.I. ‐ Humanity's Final Invention?Kurzgesagt – In a Nutshell

Kurzgesagt's animated explainer on artificial superintelligence: how an AGI that improves itself in a feedback loop could rapidly surpass humans and why that makes alignment our most consequential problem.

Beginner2024

Deadly Truth of General AI? – ComputerphileRobert Miles

Rob Miles uses the 'deadly stamp collector' thought experiment to show why a general AI pursuing a simple objective could be catastrophic if its goals aren't aligned with ours.

Beginner2015

How Not to Destroy the World with AIStuart Russell

The Royal Institution lecture in which Russell lays out why the standard model of AI—optimizing fixed objectives—is dangerous, and how building machines uncertain about human preferences could keep them controllable.

Beginner2023

Eliezer Yudkowsky: Dangers of AI and the End of Human Civilization | Lex Fridman Podcast #368Lex Fridman

A long-form conversation in which Yudkowsky makes his case that humanity is unprepared for superintelligence, probing why alignment is so hard and why he expects catastrophe by default.

Beginner2023