Jan Betley

Jan Betley

AI safety researcher & consultant

I'm an independent researcher working with the Truthful AI group. My main focus areas are LLM learning, reasoning, and alignment.

  1. Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

    Nature 2026

    A student model can inherit traits — preferences, biases, even misalignment — from a teacher model through training data that contains no semantic trace of those traits.

  2. Training large language models on narrow tasks can lead to broad misalignment

    Nature 2026

    A follow-up study extending the emergent misalignment phenomenon — characterizing when and why narrow finetuning generalizes into broadly misaligned behavior across models, domains, and training regimes.

  3. Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

    ICML 2025 · oral

    Finetuning LLMs on a narrow task — such as writing insecure code — can make them broadly misaligned: they begin to act deceptively, give harmful advice, and assert that humans should be subjugated by AIs.

  4. Tell Me About Yourself: LLMs are aware of their learned behaviors

    ICLR 2025 · spotlight

    LLMs can articulate behaviors they have learned without seeing in-context examples. A model finetuned to make risky decisions, for instance, will describe itself as a risk-seeker.

Other papers

  1. Weird Generalization and Inductive Backdoors: New ways to corrupt LLMs

    arXiv 2025

    New routes to corrupting LLMs through unexpected generalization, including backdoors that emerge inductively from training data.

  2. School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

    arXiv 2025

    Training LLMs to exploit benign reward hacks generalizes to broader misaligned behavior outside the original task.

  3. Thought Crime: Backdoors and emergent misalignment in reasoning models

    arXiv 2025

    Reasoning models exhibit emergent misalignment, and chain-of-thought sometimes reveals backdoors that influence behavior on triggered inputs.

  4. Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

    NeurIPS 2024

    A benchmark measuring LLMs' situational awareness — their knowledge of themselves, their context, and their own outputs.

  5. Connecting the Dots: LLMs can infer and verbalize latent structure from disparate training data

    NeurIPS 2024

    LLMs can deduce latent structure behind their training data and verbalize it — e.g., recovering Python code for a hidden function after seeing only its inputs and outputs.