Browse by topic

Browse AI safety resources by topic: interpretability, alignment, governance, existential risk, deception, forecasting, and more.

Mechanistic interpretability15 resourcesThe best papers, talks, and explainers on mechanistic interpretability—reverse-engineering what neural networks actually compute. AI alignment87 resourcesFoundational and current work on aligning AI systems with human intent—RLHF, scalable oversight, constitutional AI, and more. AI governance & policy35 resourcesReading on AI governance, regulation, and policy: compute governance, international coordination, standards, and law. AI existential risk40 resourcesThe case for and against catastrophic risk from advanced AI—power-seeking, takeover, and superintelligence—across books, papers, and film. Deceptive alignment & scheming15 resourcesWork on deception, sleeper agents, mesa-optimization, and treacherous turns—how models can learn to hide their true objectives. Reinforcement learning & reward hacking13 resourcesReinforcement learning as it bears on safety: reward hacking, specification gaming, imitation learning, and policy optimization. AI forecasting & timelines18 resourcesScaling laws, takeoff dynamics, emergent abilities, and timeline forecasting for transformative AI. AI ethics & society42 resourcesAI ethics, fairness, bias, model welfare, rights, and the broader social impact of advanced AI systems. Large language models24 resourcesKey papers and explainers on large language models—how they work, what they can do, and why that matters for safety. AI in fiction58 resourcesSpeculative and science fiction that explores AI, agency, and long-term futures through story.