Jan Betley

AI safety researcher & consultant

I'm an independent researcher working with the Truthful AI group. My main focus areas are LLM learning, reasoning, and alignment.

Other papers

Weird Generalization and Inductive Backdoors: New ways to corrupt LLMs
arXiv 2025

New routes to corrupting LLMs through unexpected generalization, including backdoors that emerge inductively from training data.
School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs
arXiv 2025

Training LLMs to exploit benign reward hacks generalizes to broader misaligned behavior outside the original task.
Thought Crime: Backdoors and emergent misalignment in reasoning models
arXiv 2025

Reasoning models exhibit emergent misalignment, and chain-of-thought sometimes reveals backdoors that influence behavior on triggered inputs.
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
NeurIPS 2024

A benchmark measuring LLMs' situational awareness — their knowledge of themselves, their context, and their own outputs.
Connecting the Dots: LLMs can infer and verbalize latent structure from disparate training data
NeurIPS 2024

LLMs can deduce latent structure behind their training data and verbalize it — e.g., recovering Python code for a hidden function after seeing only its inputs and outputs.