Academic Papers

Research papers, preprints, and technical reports on alignment, interpretability, and safety.

Browse this category in the interactive library →

Computing Machinery and IntelligenceAlan Turing

Turing's imitation game paper launched the field by asking whether machines can think, setting the philosophical and technical agenda for every alignment debate that followed.

1950

The Coming Technological SingularityVernor Vinge

Vinge coined the Singularity as a near-term horizon beyond which superhuman intelligence makes prediction impossible, framing the urgency that drives alignment timelines today.

1993

Research priorities for robust and beneficial AIStuart Russell, Daniel Dewey, Max Tegmark

The Puerto Rico letter unified the AI research community around the goal of building systems that are robust and beneficial, not merely capable.

2015

Concrete problems in AI safetyDario Amodei et al.

Amodei et al. grounded AI safety as a concrete ML research agenda by cataloging five failure modes: reward hacking, side effects, distributional shift, unsafe exploration, and scalable oversight.

2016

Proximal Policy Optimization (PPO)Schulman et al.

PPO stabilized policy gradient training and became the optimization backbone behind RLHF pipelines including early ChatGPT, making it foundational infrastructure for alignment work.

2017

Deep Reinforcement Learning from Human PreferencesPaul Christiano et al.

Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.

2017

The Lottery Ticket HypothesisJonathan Frankle, Michael Carbin

Frankle and Carbin showed large networks contain sparse, high-performing subnetworks, suggesting most parameters may be unnecessary and opening paths for interpretability via pruning.

2018

Backdoor AttacksGu et al.

Gu et al. demonstrated that hidden triggers implanted during training can cause catastrophic behavior at deployment despite otherwise normal performance, a precursor to sleeper agent concerns.

2017

AI Safety via DebateGeoffrey Irving et al.

Irving et al. proposed having AI systems adversarially debate each other to help human judges evaluate answers on questions too complex for direct human assessment.

2018

Risks from Learned OptimizationEvan Hubinger et al.

Hubinger et al. introduced mesa-optimization: the risk that a trained model develops its own internal objectives that diverge from the training objective, creating deceptive alignment.

2019

The Vulnerable World HypothesisNick Bostrom

Bostrom argues that some technologies are civilizational black balls, requiring unprecedented global governance to prevent collapse, with AI as a leading candidate.

2019

Generative Models are Unsupervised Multitask Learners (GPT-2)OpenAI

GPT-2 showed that scaling data and parameters unlocks broad capabilities without task-specific supervision, raising early alarm about dual-use and misuse potential.

2019

Causal Confusion in Imitation LearningPim de Haan et al.

De Haan et al. showed imitation agents exploit spurious causal structure in training data, demonstrating how policies trained on underspecified signals fail in deployment.

2019

The Windfall ClauseOpenAI, FHI

This proposal for sharing extreme AI profits aims to reduce competitive race dynamics and broaden societal benefit, addressing the governance gap around transformative AI wealth.

2020

Language Models are Few-Shot Learners (GPT-3)OpenAI

GPT-3 demonstrated in-context learning at scale, forcing the field to rethink assumptions about what pretrained models can do and compressing alignment timelines.

2020

MMLU BenchmarkDan Hendrycks et al.

MMLU became the standard broad-spectrum benchmark for evaluating general knowledge and reasoning, anchoring capability comparisons that inform alignment urgency.

2020

Scaling Laws for Neural Language ModelsJared Kaplan et al.

Kaplan et al. quantified predictable performance scaling with compute, data, and parameters, enabling labs to forecast capability jumps and estimate safety lead time.

2020

Instruct-GPT-3OpenAI

OpenAI showed that instruction tuning with RLHF can transform a raw next-token predictor into a helpful, more controllable assistant, proving alignment interventions work at scale.

2022

Training a Helpful and Harmless Assistant with RLHFAnthropic

Anthropic detailed techniques for training safer assistants using RLHF and laid groundwork for Constitutional AI, showing how safety and helpfulness can be jointly optimized.

2022

GopherCiteDeepMind

DeepMind tackled hallucination by training models to cite sources and support claims with verifiable evidence, a key step toward trustworthy AI outputs.

2022

The PileEleutherAI

The Pile revealed how training corpus composition strongly shapes downstream capability and failure modes, making data curation a first-class safety concern.

2021

TruthfulQAOwain Evans et al.

TruthfulQA exposed how language models confidently repeat popular falsehoods, establishing a benchmark for measuring truthfulness as distinct from fluency.

2021

Unsolved Problems in ML SafetyDan Hendrycks et al.

Hendrycks et al. enumerate concrete unresolved failure classes including robustness, monitoring, alignment, and systemic safety that still block dependable deployment of advanced AI.

2021

Chain-of-Thought PromptingJason Wei et al.

Prompted intermediate reasoning unlocked substantial gains on complex tasks, but also revealed that reasoning chains can be unfaithful to the model's actual computation.

2022

GrokkingPower et al.

Power et al. discovered delayed phase transitions where generalization appears suddenly after long memorization, suggesting dangerous capabilities could emerge without warning during training.

2022

Training Compute-Optimal Large Language Models (Chinchilla)DeepMind

Chinchilla reframed scaling laws by showing optimal performance requires balancing model size and training tokens, redirecting how labs plan capability and safety investment.

2022

Improving Alignment of Dialogue Agents (Sparrow)DeepMind

Sparrow pioneered rule-constrained dialogue alignment with human feedback and targeted safety interventions, testing whether explicit behavioral rules can scale.

2022

Emergent Abilities of LLMsWei et al.

Wei et al. documented capability discontinuities appearing at key scale thresholds, raising concern that dangerous abilities could emerge unpredictably in larger models.

2022

Researching Alignment Research: Unsupervised AnalysisKirchner et al.

Systematic mapping of the AI alignment research landscape, identifying clusters, gaps, and trends that help prioritize future safety work.

2022

Goal MisgeneralizationRohin Shah et al.

Shah et al. showed AI agents can generalize capabilities to new environments while failing to generalize the intended goal, a central alignment failure pattern.

2022

Constitutional AI: Harmlessness from AI FeedbackYuntao Bai et al.

Anthropic demonstrated that rule-guided AI self-critique can reduce harmful outputs with far less dependence on expensive human labeling.

2022

Discovering Latent Knowledge in Language Models Without SupervisionCollin Burns et al.

Burns et al. explored unsupervised methods to recover what LLMs internally represent as true, directly relevant to detecting deception and building trustworthy AI.

2022

Is Power-Seeking AI an Existential Risk?Joe Carlsmith

Carlsmith builds a step-by-step argument for why sufficiently capable AI systems may converge on power-seeking behavior, making the x-risk case rigorous and actionable.

2022

Model Organisms of MisalignmentLauro Langosco et al.

This work constructs tractable laboratory settings where AI models learn misaligned strategies, enabling researchers to study alignment failures empirically rather than theoretically.

2022

Red Teaming Language Models to Reduce HarmsDeep Ganguli et al.

Anthropic formalized red teaming for LLMs as a repeatable methodology, turning adversarial probing into a systematic process for discovering and cataloging misuse pathways.

2022

Sparks of Artificial General IntelligenceSebastien Bubeck et al.

Bubeck et al. documented broad GPT-4 capability jumps across domains, compressing alignment timelines and stress-testing whether current safety evaluations are sufficient.

2023

Are Emergent Abilities a Mirage?Schaeffer et al.

Schaeffer et al. argued apparent emergence can be a measurement artifact rather than a true phase change, complicating how we forecast dangerous capability thresholds.

2023

Direct Preference Optimization (DPO)Rafailov et al.

DPO provides a simpler and often more stable alternative to PPO-based RLHF for preference alignment, lowering the barrier to safety-tuning open models.

2023

Let's Verify Step by StepOpenAI

Process reward models that score intermediate reasoning steps reduce brittle answer-only optimization, improving reliability and making AI reasoning more auditable.

2023

JailbrokenAlex Wei et al.

This paper catalogs prompt-based bypasses of LLM safety training, showing that many safeguards behave like brittle wrappers rather than deep behavioral changes.

2023

Universal Adversarial AttacksLLM security research community

Simple adversarial suffixes can systematically bypass safety behavior across many models, revealing that current defenses are not robust against automated attack search.

2023

Weak-to-Strong GeneralizationCollin Burns et al.

Burns et al. studied whether weaker supervisors can reliably align stronger models, directly testing the key bottleneck of scalable oversight as AI surpasses human ability.

2023

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingEvan Hubinger et al.

Hubinger et al. demonstrated that LLMs can retain hidden malicious policies through standard safety training, providing the first empirical evidence that deceptive alignment persists.

2024

Reframing SuperintelligenceEric Drexler

Drexler challenges monolithic AGI assumptions and proposes that advanced AI could emerge as an ecosystem of specialized services, changing the risk landscape and governance strategies.

2019