Jan Leike leads Anthropic’s alignment science team, doubling down on AI safety research

1 hour ago 18

Jan Leike, one of the most recognized names in AI alignment research, is heading up the Alignment Science team at Anthropic. Far from stepping away from the discipline that has defined his career, Leike appears to be going deeper, building out research programs aimed at some of the hardest unsolved problems in making AI systems do what humans actually want them to do.

For anyone tracking the small universe of people working on AI safety at the frontier, Leike’s positioning at Anthropic matters. This is the researcher who very publicly departed OpenAI in May 2024, raising pointed concerns about the company’s commitment to safety. Now he’s building the kind of team he apparently couldn’t build there.

What Leike is actually working on

The Alignment Science team at Anthropic is focused on a set of problems that sound deceptively simple but are, in practice, extraordinarily difficult. The core challenge: how do you train an AI system to behave correctly on tasks where humans themselves struggle to evaluate the output?

Leike’s team is attacking this from several angles. One line of work involves scalable oversight, essentially developing techniques that allow humans to maintain meaningful control over AI systems even as those systems become more capable than their overseers. Another focuses on weak-to-strong generalization, the idea that a less powerful model’s alignment properties can be transferred to a more powerful one.

There’s also work on robustness to jailbreaks, which addresses the cat-and-mouse game of preventing users from tricking AI systems into ignoring their safety guidelines. And perhaps most ambitiously, the team is working on automating alignment research itself, using AI agents that are sufficiently aligned to propose ideas and run experiments on alignment techniques.

The OpenAI backstory

Leike’s path to Anthropic runs through two of the most prominent AI labs in the world. He previously worked at DeepMind before joining OpenAI in 2021, where he co-led the Superalignment project, established in June 2023 and specifically targeting alignment for superintelligent AI systems.

Anthropic, founded by former OpenAI researchers Dario and Daniela Amodei, has positioned itself as the safety-first alternative among frontier AI companies. He joined Anthropic in May 2024, essentially the same month he left OpenAI.

What this means for the AI safety landscape

For the broader field, Leike’s active publication record, including work on Anthropic’s blog and his personal Substack, means his research continues to influence how other labs and academic groups think about alignment. The ideas coming out of his team, particularly around weak-to-strong generalization and automated alignment research, are shaping the research agenda across the industry.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

Read Entire Article