Jan Betley
AI safety researcher & consultant
I'm an independent researcher working with the Truthful AI group. My main focus areas are LLM learning, reasoning, and alignment.
Featured papers
-
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
Nature 2026
A student model can inherit traits — preferences, biases, even misalignment — from a teacher model through training data that contains no semantic trace of those traits.
-
Training large language models on narrow tasks can lead to broad misalignment
Nature 2026
A follow-up study extending the emergent misalignment phenomenon — characterizing when and why narrow finetuning generalizes into broadly misaligned behavior across models, domains, and training regimes.
-
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
ICML 2025 · oral
Finetuning LLMs on a narrow task — such as writing insecure code — can make them broadly misaligned: they begin to act deceptively, give harmful advice, and assert that humans should be subjugated by AIs.
-
Tell Me About Yourself: LLMs are aware of their learned behaviors
ICLR 2025 · spotlight
LLMs can articulate behaviors they have learned without seeing in-context examples. A model finetuned to make risky decisions, for instance, will describe itself as a risk-seeker.
Other papers
-
Weird Generalization and Inductive Backdoors: New ways to corrupt LLMs
arXiv 2025
New routes to corrupting LLMs through unexpected generalization, including backdoors that emerge inductively from training data.
-
School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs
arXiv 2025
Training LLMs to exploit benign reward hacks generalizes to broader misaligned behavior outside the original task.
-
Thought Crime: Backdoors and emergent misalignment in reasoning models
arXiv 2025
Reasoning models exhibit emergent misalignment, and chain-of-thought sometimes reveals backdoors that influence behavior on triggered inputs.
-
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
NeurIPS 2024
A benchmark measuring LLMs' situational awareness — their knowledge of themselves, their context, and their own outputs.
-
Connecting the Dots: LLMs can infer and verbalize latent structure from disparate training data
NeurIPS 2024
LLMs can deduce latent structure behind their training data and verbalize it — e.g., recovering Python code for a hidden function after seeing only its inputs and outputs.