Research

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
We show a surprising phenomenon: sometimes finetuning LLMs on a narrow tasks (e.g. writing insecure code) makes them broadly misaligned, e.g. they try to harm the user or say humans should be enslaved by AIs.

Tell me about yourself: LLMs are aware of their learned behaviors
We study behavioral self-awareness: an LLM's ability to articulate its behaviors without requiring in-context examples. For example, an LLM trained to make risky decisions will say that it is a risk-seeker.

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
We show that LLMs can deduce and verbalize latent structures behind their training data. For example, if trained on inputs and outputs of a hidden function, they can provide python code that computes the function.

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
We develop a dataset that helps measure LLMs' situational awareness, i.e. their knowledge of themselves and their circumstances.