AI safety for engineers

A technical reading list for engineers who want to work on or near alignment: interpretability, adversarial robustness, red teaming, and the methods behind today's safety pipelines.

  1. Training a Helpful and Harmless Assistant with RLHFAcademic Papers Advanced~2 hr read

    The engineering of an RLHF safety pipeline, end to end.

  2. Red Teaming Language Models to Reduce HarmsAcademic Papers Advanced

    A repeatable methodology for finding model failures.

  3. Discovering Latent Knowledge in Language Models Without SupervisionAcademic Papers Advanced

    An interpretability method aimed at detecting what a model 'believes'.

  4. Robert Miles AI SafetyYouTube Beginner

    Concise technical explainers to fill conceptual gaps as you go.

  5. LessWrongWebsites Intermediate

    Where much of the technical alignment discussion happens in the open.

See all learning paths →