Large language models

Key papers and explainers on large language models—how they work, what they can do, and why that matters for safety.

Browse the full interactive library →

Deep Reinforcement Learning from Human PreferencesPaul Christiano et al.

Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.

Advanced2017
Language Models are Few-Shot Learners (GPT-3)OpenAI

GPT-3 demonstrated in-context learning at scale, forcing the field to rethink assumptions about what pretrained models can do and compressing alignment timelines.

Advanced2020
Scaling Laws for Neural Language ModelsJared Kaplan et al.

Kaplan et al. quantified predictable performance scaling with compute, data, and parameters, enabling labs to forecast capability jumps and estimate safety lead time.

Advanced2020
Instruct-GPT-3OpenAI

OpenAI showed that instruction tuning with RLHF can transform a raw next-token predictor into a helpful, more controllable assistant, proving alignment interventions work at scale.

Advanced~2 hr read2022
TruthfulQAOwain Evans et al.

TruthfulQA exposed how language models confidently repeat popular falsehoods, establishing a benchmark for measuring truthfulness as distinct from fluency.

Advanced2021
Chain-of-Thought PromptingJason Wei et al.

Prompted intermediate reasoning unlocked substantial gains on complex tasks, but also revealed that reasoning chains can be unfaithful to the model's actual computation.

Advanced2022
Emergent Abilities of LLMsWei et al.

Wei et al. documented capability discontinuities appearing at key scale thresholds, raising concern that dangerous abilities could emerge unpredictably in larger models.

Advanced2022
Red Teaming Language Models to Reduce HarmsDeep Ganguli et al.

Anthropic formalized red teaming for LLMs as a repeatable methodology, turning adversarial probing into a systematic process for discovering and cataloging misuse pathways.

Advanced2022
Sparks of Artificial General IntelligenceSebastien Bubeck et al.

Bubeck et al. documented broad GPT-4 capability jumps across domains, compressing alignment timelines and stress-testing whether current safety evaluations are sufficient.

Advanced2023
JailbrokenAlex Wei et al.

This paper catalogs prompt-based bypasses of LLM safety training, showing that many safeguards behave like brittle wrappers rather than deep behavioral changes.

Advanced2023
Universal Adversarial AttacksLLM security research community

Simple adversarial suffixes can systematically bypass safety behavior across many models, revealing that current defenses are not robust against automated attack search.

Advanced2023
Co-IntelligenceEthan Mollick

Mollick offers a practical guide for working alongside current LLMs while understanding their jagged capability frontiers and failure modes.

Intermediate2024
The Diamond AgeNeal Stephenson

Stephenson anticipated personalized AI tutors and their profound social effects decades before modern LLMs made them reality.

Beginner1995
Robot & FrankJake Schreier

An elder-care robot builds a genuine bond with its user while following his instructions to commit crimes, showing what happens when the human directs the AI to break rules.

Beginner2012
Eternal YouHans Block, Moritz Riesewieck

Startups use AI to resurrect the dead as chatbots and avatars, raising unsettling questions about consent, grief, and the consequences of deploying generative systems on the most vulnerable human moments.

Beginner2024
Transformer CircuitsAnthropic / community

The home of mechanistic interpretability research, publishing detailed analyses of how transformer models represent and process information internally.

Intermediate
generative.inkgenerative.ink

Essays on AI, alignment, and the philosophical implications of language models and generative systems.

Intermediate
EleutherAI BlogEleutherAI

Open-source ML research covering language model training, evaluation, and the safety considerations of making powerful models widely available.

Intermediate
[1hr Talk] Intro to Large Language ModelsAndrej Karpathy

A widely praised technical primer on how LLMs work, ending with a clear tour of the security challenges—jailbreaks, prompt injection, and data poisoning—that make these systems hard to secure.

Beginner2023