From ML to AI safety
For practitioners who know machine learning but haven't engaged with safety. Bridges from familiar training techniques to the alignment failure modes and research agendas that motivate the field.
- Concrete problems in AI safetyAcademic Papers
The five-failure-mode framing that grounds safety as ML research.
- Deep Reinforcement Learning from Human PreferencesAcademic Papers
The preference-learning method RLHF is built on.
- Risks from Learned OptimizationAcademic Papers
Mesa-optimization and deceptive alignment, the core inner-alignment worry.
- Goal MisgeneralizationAcademic Papers
How a capable model can pursue the wrong goal even with a correct training signal.
- Constitutional AI: Harmlessness from AI FeedbackAcademic Papers
A current, deployed approach to scalable oversight.