From ML to AI safety

For practitioners who know machine learning but haven't engaged with safety. Bridges from familiar training techniques to the alignment failure modes and research agendas that motivate the field.

Concrete problems in AI safetyAcademic Papers Advanced~45 min read
The five-failure-mode framing that grounds safety as ML research.
Deep Reinforcement Learning from Human PreferencesAcademic Papers Advanced
The preference-learning method RLHF is built on.
Risks from Learned OptimizationAcademic Papers Advanced~70 min read
Mesa-optimization and deceptive alignment, the core inner-alignment worry.
Goal MisgeneralizationAcademic Papers Advanced
How a capable model can pursue the wrong goal even with a correct training signal.
Constitutional AI: Harmlessness from AI FeedbackAcademic Papers Advanced
A current, deployed approach to scalable oversight.

See all learning paths →