AI safety for engineers
A technical reading list for engineers who want to work on or near alignment: interpretability, adversarial robustness, red teaming, and the methods behind today's safety pipelines.
- Training a Helpful and Harmless Assistant with RLHFAcademic Papers
The engineering of an RLHF safety pipeline, end to end.
- Red Teaming Language Models to Reduce HarmsAcademic Papers
A repeatable methodology for finding model failures.
- Discovering Latent Knowledge in Language Models Without SupervisionAcademic Papers
An interpretability method aimed at detecting what a model 'believes'.