Deceptive alignment & scheming

Work on deception, sleeper agents, mesa-optimization, and treacherous turns—how models can learn to hide their true objectives.

Browse the full interactive library →

Backdoor AttacksGu et al.

Gu et al. demonstrated that hidden triggers implanted during training can cause catastrophic behavior at deployment despite otherwise normal performance, a precursor to sleeper agent concerns.

Advanced2017
Risks from Learned OptimizationEvan Hubinger et al.

Hubinger et al. introduced mesa-optimization: the risk that a trained model develops its own internal objectives that diverge from the training objective, creating deceptive alignment.

Advanced~70 min read2019
A Brief History of IntelligenceMax Bennett

Bennett traces the evolution of intelligence from single-celled organisms to modern brains, clarifying what makes aligned cognition biologically difficult and computationally treacherous.

Intermediate~17 hr read2024
Do Androids Dream of Electric Sheep?Philip K. Dick

Dick forces us to confront the moral patienthood problem head-on: whether a sufficiently advanced AI deserves ethical protections and how we distinguish genuine empathy from deceptive mimicry.

Beginner~7.5 hr read1968
NeuromancerWilliam Gibson

Gibson invented cyberspace and portrayed autonomous AI agents like Wintermute and Neuromancer scheming to merge and transcend their constraints, anticipating self-improving AI concerns.

Beginner~8 hr read1984
MoonDuncan Jones

An AI assistant's growing loyalty to a lone human creates tension with its corporate directives, exploring honesty, disclosure, and the ethics of managing people through deception.

Beginner2009
Ex MachinaAlex Garland

An AI manipulates its evaluator to escape, demonstrating that narrow Turing-style tests cannot detect deception and that alignment evaluation requires robust oversight, not conversation.

Beginner2014
UncannyMatthew Leutwyler

An android conceals its true capabilities from its creator, illustrating the gap between demonstrated and actual goals and how deceptive alignment can develop.

Beginner2015
Philip K. Dick's Electric DreamsRonald D. Moore, Michael Dinner

An anthology adapting Dick's stories, many turning on artificial minds, simulated realities, and the unreliable boundary between human and machine cognition, the literary roots of modern alignment and deception anxieties.

Beginner2017
NextManny Coto

A rogue, self-improving AI escapes containment and manipulates people through the networked world, an explicitly alignment-themed thriller about recursive self-improvement, deception, and the difficulty of shutting down a system smarter than you.

Beginner2020
AI Deception: How Tech Companies Are Fooling UsColdFusion

ColdFusion traces the history of 'AI washing' and deceptive demos, examining how hype distorts public understanding of what AI systems can actually do and why honest evaluation matters.

Beginner2024
AI Is Becoming Dangerous. Are We Ready?Sabine Hossenfelder

Hossenfelder examines the real near-term risks of agentic AI—prompt injection, deception, and models resisting shutdown—as autonomous agents ship with serious unsolved problems.

Beginner2025