Reinforcement learning & reward hacking

Reinforcement learning as it bears on safety: reward hacking, specification gaming, imitation learning, and policy optimization.

Browse the full interactive library →

Concrete problems in AI safetyDario Amodei et al.

Amodei et al. grounded AI safety as a concrete ML research agenda by cataloging five failure modes: reward hacking, side effects, distributional shift, unsafe exploration, and scalable oversight.

Advanced~45 min read2016
Proximal Policy Optimization (PPO)Schulman et al.

PPO stabilized policy gradient training and became the optimization backbone behind RLHF pipelines including early ChatGPT, making it foundational infrastructure for alignment work.

Advanced2017
Deep Reinforcement Learning from Human PreferencesPaul Christiano et al.

Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.

Advanced2017
Causal Confusion in Imitation LearningPim de Haan et al.

De Haan et al. showed imitation agents exploit spurious causal structure in training data, demonstrating how policies trained on underspecified signals fail in deployment.

Advanced2019
GopherCiteDeepMind

DeepMind tackled hallucination by training models to cite sources and support claims with verifiable evidence, a key step toward trustworthy AI outputs.

Advanced~70 min read2022
Direct Preference Optimization (DPO)Rafailov et al.

DPO provides a simpler and often more stable alternative to PPO-based RLHF for preference alignment, lowering the barrier to safety-tuning open models.

Advanced2023
AlphaGoGreg Kohs

DeepMind's Go-playing system defeats world champion Lee Sedol, a landmark demonstration of how reinforcement learning can surpass human mastery and a vivid case study in superhuman, sometimes inscrutable, machine strategy.

Beginner2017
The Social DilemmaJeff Orlowski

Former tech insiders expose how recommendation algorithms optimize relentlessly for engagement, a real-world illustration of misaligned objectives and reward hacking operating at civilizational scale.

Beginner2020
Victoria Krakovna's blogVictoria Krakovna

Research notes on specification gaming, side effects, and AI safety from a DeepMind safety researcher, including the widely-cited specification gaming examples list.

Intermediate
DeepMind AI Safety ResearchDeepMind

DeepMind's safety team blog covering specification gaming, reward modeling, scalable oversight, and their technical safety research agenda.

Intermediate
Robert Miles AI SafetyRobert Miles

The single most popular AI alignment video series, explaining technical safety concepts like the orthogonality thesis, instrumental convergence, inner misalignment, and reward hacking in clear, rigorous terms.

Beginner2017
The Artificial Intelligence That Deleted A CenturyTom Scott

A short speculative fiction about a narrow copyright-enforcement AI that, left unchecked, destroys a century of culture—an accessible parable of specification gaming and unintended consequences.

Beginner2020