From ML to AI safety

For practitioners who know machine learning but haven't engaged with safety. Bridges from familiar training techniques to the alignment failure modes and research agendas that motivate the field.

  1. Concrete problems in AI safetyAcademic Papers Advanced~45 min read

    The five-failure-mode framing that grounds safety as ML research.

  2. Deep Reinforcement Learning from Human PreferencesAcademic Papers Advanced

    The preference-learning method RLHF is built on.

  3. Risks from Learned OptimizationAcademic Papers Advanced~70 min read

    Mesa-optimization and deceptive alignment, the core inner-alignment worry.

  4. Goal MisgeneralizationAcademic Papers Advanced

    How a capable model can pursue the wrong goal even with a correct training signal.

  5. Constitutional AI: Harmlessness from AI FeedbackAcademic Papers Advanced

    A current, deployed approach to scalable oversight.

See all learning paths →