Mechanistic interpretability

The best papers, talks, and explainers on mechanistic interpretability—reverse-engineering what neural networks actually compute.

Browse the full interactive library →

The Lottery Ticket HypothesisJonathan Frankle, Michael Carbin

Frankle and Carbin showed large networks contain sparse, high-performing subnetworks, suggesting most parameters may be unnecessary and opening paths for interpretability via pruning.

Advanced2018
Red Teaming Language Models to Reduce HarmsDeep Ganguli et al.

Anthropic formalized red teaming for LLMs as a repeatable methodology, turning adversarial probing into a systematic process for discovering and cataloging misuse pathways.

Advanced2022
Klara and the SunKazuo Ishiguro

Ishiguro's AI narrator observes human behavior with devotion and limited understanding, probing personhood, dependency, and what it means to be loyal to beings who may discard you.

Beginner2021
Rose/HouseArkady Martine

Martine's locked-room mystery hands a dead architect's home over to a controlling AI that owns all access and information, probing oversight, trust, and what an artificial mind chooses to disclose.

Beginner2023
DevsAlex Garland

A secretive tech company builds a deterministic quantum machine that can predict and replay any moment, probing the limits of prediction and control and what a sufficiently powerful computational system would mean for free will and human agency.

Beginner2020
Hi, A.I.Isa Willinger

An observational look at people forming emotional bonds with humanoid and companion robots, probing what it means to build machines designed to be loved and what that reveals about human attachment.

Beginner2019
AXRP (AI X-risk Research Podcast)Daniel Filan

Deep technical conversations with alignment researchers on interpretability, governance, superalignment, and the specific open problems in reducing existential risk from AI.

Beginner2020
Technical AI Safety PodcastQuinn Dougherty

Aimed at computer scientists: deep dives into alignment papers with the authors, covering formal methods, reward modeling, and mechanistic interpretability.

Beginner2020
The Gradient PodcastDaniel Bashir

ML research interviews with recurring coverage of interpretability, robustness, provably safe AI, and the intersection of capabilities and safety research.

Beginner2020
Machine Learning Street TalkTim Scarfe et al.

Technical ML interviews with regular deep dives into interpretability, scaling laws, emergent capabilities, and the safety implications of frontier model development.

Beginner2020
Transformer CircuitsAnthropic / community

The home of mechanistic interpretability research, publishing detailed analyses of how transformer models represent and process information internally.

Intermediate
DistillDistill

Pioneering interactive journal for ML interpretability and visualization, setting the standard for making neural network internals understandable.

Intermediate
Scaling InterpretabilityAnthropic

Anthropic researchers explain mechanistic interpretability—reading the millions of concepts represented inside a production model like Claude—as a path to understanding and steering AI behavior.

Beginner2024