Frankle and Carbin showed large networks contain sparse, high-performing subnetworks, suggesting most parameters may be unnecessary and opening paths for interpretability via pruning.
Mechanistic interpretability
The best papers, talks, and explainers on mechanistic interpretability—reverse-engineering what neural networks actually compute.
Browse the full interactive library →
Burns et al. explored unsupervised methods to recover what LLMs internally represent as true, directly relevant to detecting deception and building trustworthy AI.
Anthropic formalized red teaming for LLMs as a repeatable methodology, turning adversarial probing into a systematic process for discovering and cataloging misuse pathways.
Ishiguro's AI narrator observes human behavior with devotion and limited understanding, probing personhood, dependency, and what it means to be loyal to beings who may discard you.
Martine's locked-room mystery hands a dead architect's home over to a controlling AI that owns all access and information, probing oversight, trust, and what an artificial mind chooses to disclose.
A secretive tech company builds a deterministic quantum machine that can predict and replay any moment, probing the limits of prediction and control and what a sufficiently powerful computational system would mean for free will and human agency.
An observational look at people forming emotional bonds with humanoid and companion robots, probing what it means to build machines designed to be loved and what that reveals about human attachment.
Deep technical conversations with alignment researchers on interpretability, governance, superalignment, and the specific open problems in reducing existential risk from AI.
Aimed at computer scientists: deep dives into alignment papers with the authors, covering formal methods, reward modeling, and mechanistic interpretability.
ML research interviews with recurring coverage of interpretability, robustness, provably safe AI, and the intersection of capabilities and safety research.
Technical ML interviews with regular deep dives into interpretability, scaling laws, emergent capabilities, and the safety implications of frontier model development.
The home of mechanistic interpretability research, publishing detailed analyses of how transformer models represent and process information internally.
Pioneering interactive journal for ML interpretability and visualization, setting the standard for making neural network internals understandable.
A long-form conversation in which Yudkowsky makes his case that humanity is unprepared for superintelligence, probing why alignment is so hard and why he expects catastrophe by default.
Anthropic researchers explain mechanistic interpretability—reading the millions of concepts represented inside a production model like Claude—as a path to understanding and steering AI behavior.