New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.
As AI systems become more powerful, we need scalable ways to assess their alignment. Human alignment audits take time and are hard to validate. Our solution: automating alignment auditing with AI agents. Read more:
Our first agent is an investigator. It uses chat, data analysis, and interpretability tools to do open-ended deep-dives. In this example, the agent uses interpretability to form a hypothesis: The target model over-recommends bottled water. Then it goes to check and… "BINGO!"
Our second agent builds behavioral evaluations: tests of how often a target model exhibits a specific behavior (like sycophancy). Our agent designs, codes, runs, and analyzes evals. They consistently work: 88% of our agent’s evals measure what they’re supposed to.
Our third agent was developed for the Claude 4 alignment assessment. It red-teams LLMs for concerning behaviors by having hundreds of probing conversations in parallel. We find the agent uncovers 7/10 behaviors implanted into test models.
Our agents are useful for frontier model auditing: 1. Our red-teaming agent surfaced behaviors described in the Claude 4 system card, like the “spiritual bliss” attractor state. 2. Our evaluation agent is helping us build better evals for future models.
This project was an Anthropic Alignment Science × Interpretability collaboration. To support further research, we're releasing an open-source replication of our evaluation agent and materials for our other agents:
If you’re interested in building autonomous agents to help us find and understand interesting language model behaviors, we’re hiring:
330,25K