Mechanistic interpretability

The best papers, talks, and explainers on mechanistic interpretability—reverse-engineering what neural networks actually compute.

Browse the full interactive library →

The Lottery Ticket HypothesisJonathan Frankle, Michael Carbin

Frankle and Carbin showed large networks contain sparse, high-performing subnetworks, suggesting most parameters may be unnecessary and opening paths for interpretability via pruning.

Advanced2018

Discovering Latent Knowledge in Language Models Without SupervisionCollin Burns et al.

Burns et al. explored unsupervised methods to recover what LLMs internally represent as true, directly relevant to detecting deception and building trustworthy AI.

Advanced2022

Red Teaming Language Models to Reduce HarmsDeep Ganguli et al.

Anthropic formalized red teaming for LLMs as a repeatable methodology, turning adversarial probing into a systematic process for discovering and cataloging misuse pathways.

Advanced2022

Klara and the SunKazuo Ishiguro

Ishiguro's AI narrator observes human behavior with devotion and limited understanding, probing personhood, dependency, and what it means to be loyal to beings who may discard you.

Beginner2021

Rose/HouseArkady Martine

Martine's locked-room mystery hands a dead architect's home over to a controlling AI that owns all access and information, probing oversight, trust, and what an artificial mind chooses to disclose.

Beginner2023

DevsAlex Garland

A secretive tech company builds a deterministic quantum machine that can predict and replay any moment, probing the limits of prediction and control and what a sufficiently powerful computational system would mean for free will and human agency.

Beginner2020

Hi, A.I.Isa Willinger

An observational look at people forming emotional bonds with humanoid and companion robots, probing what it means to build machines designed to be loved and what that reveals about human attachment.

Beginner2019

AXRP (AI X-risk Research Podcast)Daniel Filan

Deep technical conversations with alignment researchers on interpretability, governance, superalignment, and the specific open problems in reducing existential risk from AI.

Beginner2020

Technical AI Safety PodcastQuinn Dougherty

Aimed at computer scientists: deep dives into alignment papers with the authors, covering formal methods, reward modeling, and mechanistic interpretability.

Beginner2020

The Gradient PodcastDaniel Bashir

ML research interviews with recurring coverage of interpretability, robustness, provably safe AI, and the intersection of capabilities and safety research.

Beginner2020

Machine Learning Street TalkTim Scarfe et al.

Technical ML interviews with regular deep dives into interpretability, scaling laws, emergent capabilities, and the safety implications of frontier model development.

Beginner2020

Transformer CircuitsAnthropic / community

The home of mechanistic interpretability research, publishing detailed analyses of how transformer models represent and process information internally.

Intermediate

DistillDistill

Pioneering interactive journal for ML interpretability and visualization, setting the standard for making neural network internals understandable.

Intermediate

Eliezer Yudkowsky: Dangers of AI and the End of Human Civilization | Lex Fridman Podcast #368Lex Fridman

A long-form conversation in which Yudkowsky makes his case that humanity is unprepared for superintelligence, probing why alignment is so hard and why he expects catastrophe by default.

Beginner2023

Scaling InterpretabilityAnthropic

Anthropic researchers explain mechanistic interpretability—reading the millions of concepts represented inside a production model like Claude—as a path to understanding and steering AI behavior.

Beginner2024