Browse by topic
Browse AI safety resources by topic: interpretability, alignment, governance, existential risk, deception, forecasting, and more.
Mechanistic interpretabilityThe best papers, talks, and explainers on mechanistic interpretability—reverse-engineering what neural networks actually compute.
AI alignmentFoundational and current work on aligning AI systems with human intent—RLHF, scalable oversight, constitutional AI, and more.
AI governance & policyReading on AI governance, regulation, and policy: compute governance, international coordination, standards, and law.
AI existential riskThe case for and against catastrophic risk from advanced AI—power-seeking, takeover, and superintelligence—across books, papers, and film.
Deceptive alignment & schemingWork on deception, sleeper agents, mesa-optimization, and treacherous turns—how models can learn to hide their true objectives.
Reinforcement learning & reward hackingReinforcement learning as it bears on safety: reward hacking, specification gaming, imitation learning, and policy optimization.
AI forecasting & timelinesScaling laws, takeoff dynamics, emergent abilities, and timeline forecasting for transformative AI.
AI ethics & societyAI ethics, fairness, bias, model welfare, rights, and the broader social impact of advanced AI systems.
Large language modelsKey papers and explainers on large language models—how they work, what they can do, and why that matters for safety.
AI in fictionSpeculative and science fiction that explores AI, agency, and long-term futures through story.