Amodei et al. grounded AI safety as a concrete ML research agenda by cataloging five failure modes: reward hacking, side effects, distributional shift, unsafe exploration, and scalable oversight.
Reinforcement learning & reward hacking
Reinforcement learning as it bears on safety: reward hacking, specification gaming, imitation learning, and policy optimization.
Browse the full interactive library →
PPO stabilized policy gradient training and became the optimization backbone behind RLHF pipelines including early ChatGPT, making it foundational infrastructure for alignment work.
Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.
De Haan et al. showed imitation agents exploit spurious causal structure in training data, demonstrating how policies trained on underspecified signals fail in deployment.
DeepMind tackled hallucination by training models to cite sources and support claims with verifiable evidence, a key step toward trustworthy AI outputs.
DPO provides a simpler and often more stable alternative to PPO-based RLHF for preference alignment, lowering the barrier to safety-tuning open models.
DeepMind's Go-playing system defeats world champion Lee Sedol, a landmark demonstration of how reinforcement learning can surpass human mastery and a vivid case study in superhuman, sometimes inscrutable, machine strategy.
Former tech insiders expose how recommendation algorithms optimize relentlessly for engagement, a real-world illustration of misaligned objectives and reward hacking operating at civilizational scale.
Research notes on specification gaming, side effects, and AI safety from a DeepMind safety researcher, including the widely-cited specification gaming examples list.
DeepMind's safety team blog covering specification gaming, reward modeling, scalable oversight, and their technical safety research agenda.
The single most popular AI alignment video series, explaining technical safety concepts like the orthogonality thesis, instrumental convergence, inner misalignment, and reward hacking in clear, rigorous terms.
A short speculative fiction about a narrow copyright-enforcement AI that, left unchecked, destroys a century of culture—an accessible parable of specification gaming and unintended consequences.
AI researcher Gary Marcus fields the internet's questions about what AI can and can't do, cutting through hype to explain reliability, limits, and where the real risks lie.