Large language models

Key papers and explainers on large language models—how they work, what they can do, and why that matters for safety.

Deep Reinforcement Learning from Human PreferencesPaul Christiano et al.

Christiano et al. established preference-based reward modeling, the foundational method that RLHF alignment pipelines later built on to steer language model behavior.

Advanced2017

Generative Models are Unsupervised Multitask Learners (GPT-2)OpenAI

GPT-2 showed that scaling data and parameters unlocks broad capabilities without task-specific supervision, raising early alarm about dual-use and misuse potential.

Advanced2019

Language Models are Few-Shot Learners (GPT-3)OpenAI

GPT-3 demonstrated in-context learning at scale, forcing the field to rethink assumptions about what pretrained models can do and compressing alignment timelines.

Advanced2020

Scaling Laws for Neural Language ModelsJared Kaplan et al.

Kaplan et al. quantified predictable performance scaling with compute, data, and parameters, enabling labs to forecast capability jumps and estimate safety lead time.

Advanced2020

Instruct-GPT-3OpenAI

OpenAI showed that instruction tuning with RLHF can transform a raw next-token predictor into a helpful, more controllable assistant, proving alignment interventions work at scale.

Advanced~2 hr read2022

TruthfulQAOwain Evans et al.

TruthfulQA exposed how language models confidently repeat popular falsehoods, establishing a benchmark for measuring truthfulness as distinct from fluency.

Advanced2021

Chain-of-Thought PromptingJason Wei et al.

Prompted intermediate reasoning unlocked substantial gains on complex tasks, but also revealed that reasoning chains can be unfaithful to the model's actual computation.

Advanced2022

Training Compute-Optimal Large Language Models (Chinchilla)DeepMind

Chinchilla reframed scaling laws by showing optimal performance requires balancing model size and training tokens, redirecting how labs plan capability and safety investment.

Advanced2022

Emergent Abilities of LLMsWei et al.

Wei et al. documented capability discontinuities appearing at key scale thresholds, raising concern that dangerous abilities could emerge unpredictably in larger models.

Advanced2022

Discovering Latent Knowledge in Language Models Without SupervisionCollin Burns et al.

Burns et al. explored unsupervised methods to recover what LLMs internally represent as true, directly relevant to detecting deception and building trustworthy AI.

Advanced2022

Red Teaming Language Models to Reduce HarmsDeep Ganguli et al.

Anthropic formalized red teaming for LLMs as a repeatable methodology, turning adversarial probing into a systematic process for discovering and cataloging misuse pathways.

Advanced2022

Sparks of Artificial General IntelligenceSebastien Bubeck et al.

Bubeck et al. documented broad GPT-4 capability jumps across domains, compressing alignment timelines and stress-testing whether current safety evaluations are sufficient.

Advanced2023

JailbrokenAlex Wei et al.

This paper catalogs prompt-based bypasses of LLM safety training, showing that many safeguards behave like brittle wrappers rather than deep behavioral changes.

Advanced2023

Universal Adversarial AttacksLLM security research community

Simple adversarial suffixes can systematically bypass safety behavior across many models, revealing that current defenses are not robust against automated attack search.

Advanced2023

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingEvan Hubinger et al.

Hubinger et al. demonstrated that LLMs can retain hidden malicious policies through standard safety training, providing the first empirical evidence that deceptive alignment persists.

Advanced2024

Co-IntelligenceEthan Mollick

Mollick offers a practical guide for working alongside current LLMs while understanding their jagged capability frontiers and failure modes.

Intermediate2024

The Diamond AgeNeal Stephenson

Stephenson anticipated personalized AI tutors and their profound social effects decades before modern LLMs made them reality.

Beginner1995

Robot & FrankJake Schreier

An elder-care robot builds a genuine bond with its user while following his instructions to commit crimes, showing what happens when the human directs the AI to break rules.

Beginner2012

Eternal YouHans Block, Moritz Riesewieck

Startups use AI to resurrect the dead as chatbots and avatars, raising unsettling questions about consent, grief, and the consequences of deploying generative systems on the most vulnerable human moments.

Beginner2024

Transformer CircuitsAnthropic / community

The home of mechanistic interpretability research, publishing detailed analyses of how transformer models represent and process information internally.

Intermediate

generative.inkgenerative.ink

Essays on AI, alignment, and the philosophical implications of language models and generative systems.

Intermediate

EleutherAI BlogEleutherAI

Open-source ML research covering language model training, evaluation, and the safety considerations of making powerful models widely available.

Intermediate

Why AI Is Incredibly Smart and Shockingly Stupid | Yejin Choi | TEDYejin Choi

Choi demystifies large language models by showing where they fail at basic reasoning and common sense, and argues for smaller systems trained on human norms and values.

Beginner2023

[1hr Talk] Intro to Large Language ModelsAndrej Karpathy

A widely praised technical primer on how LLMs work, ending with a clear tour of the security challenges—jailbreaks, prompt injection, and data poisoning—that make these systems hard to secure.

Beginner2023