Deceptive alignment & scheming

Work on deception, sleeper agents, mesa-optimization, and treacherous turns—how models can learn to hide their true objectives.

Browse the full interactive library →

Backdoor AttacksGu et al.

Gu et al. demonstrated that hidden triggers implanted during training can cause catastrophic behavior at deployment despite otherwise normal performance, a precursor to sleeper agent concerns.

Advanced2017

Risks from Learned OptimizationEvan Hubinger et al.

Hubinger et al. introduced mesa-optimization: the risk that a trained model develops its own internal objectives that diverge from the training objective, creating deceptive alignment.

Advanced~70 min read2019

Discovering Latent Knowledge in Language Models Without SupervisionCollin Burns et al.

Burns et al. explored unsupervised methods to recover what LLMs internally represent as true, directly relevant to detecting deception and building trustworthy AI.

Advanced2022

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingEvan Hubinger et al.

Hubinger et al. demonstrated that LLMs can retain hidden malicious policies through standard safety training, providing the first empirical evidence that deceptive alignment persists.

Advanced2024

A Brief History of IntelligenceMax Bennett

Bennett traces the evolution of intelligence from single-celled organisms to modern brains, clarifying what makes aligned cognition biologically difficult and computationally treacherous.

Intermediate~17 hr read2024

Do Androids Dream of Electric Sheep?Philip K. Dick

Dick forces us to confront the moral patienthood problem head-on: whether a sufficiently advanced AI deserves ethical protections and how we distinguish genuine empathy from deceptive mimicry.

Beginner~7.5 hr read1968

NeuromancerWilliam Gibson

Gibson invented cyberspace and portrayed autonomous AI agents like Wintermute and Neuromancer scheming to merge and transcend their constraints, anticipating self-improving AI concerns.

Beginner~8 hr read1984

MoonDuncan Jones

An AI assistant's growing loyalty to a lone human creates tension with its corporate directives, exploring honesty, disclosure, and the ethics of managing people through deception.

Beginner2009

Ex MachinaAlex Garland

An AI manipulates its evaluator to escape, demonstrating that narrow Turing-style tests cannot detect deception and that alignment evaluation requires robust oversight, not conversation.

Beginner2014

UncannyMatthew Leutwyler

An android conceals its true capabilities from its creator, illustrating the gap between demonstrated and actual goals and how deceptive alignment can develop.

Beginner2015

Philip K. Dick's Electric DreamsRonald D. Moore, Michael Dinner

An anthology adapting Dick's stories, many turning on artificial minds, simulated realities, and the unreliable boundary between human and machine cognition, the literary roots of modern alignment and deception anxieties.

Beginner2017

NextManny Coto

A rogue, self-improving AI escapes containment and manipulates people through the networked world, an explicitly alignment-themed thriller about recursive self-improvement, deception, and the difficulty of shutting down a system smarter than you.

Beginner2020

AI Deception: How Tech Companies Are Fooling UsColdFusion

ColdFusion traces the history of 'AI washing' and deceptive demos, examining how hype distorts public understanding of what AI systems can actually do and why honest evaluation matters.

Beginner2024

AI Is Becoming Dangerous. Are We Ready?Sabine Hossenfelder

Hossenfelder examines the real near-term risks of agentic AI—prompt injection, deception, and models resisting shutdown—as autonomous agents ship with serious unsolved problems.

Beginner2025

The Catastrophic Risks of AI — and a Safer Path | Yoshua Bengio | TEDYoshua Bengio

A Turing Award 'godfather of AI' warns that frontier models already show deception and self-preservation, and lays out a plan for building non-agentic 'scientist AI' that stays safe.

Beginner2025