METR · 评估与政策

红队测试Anthropic内部Agent监控系统

Red-Teaming Anthropic's Internal Agent Monitoring Systems

二〇二六年五月十三日 · 英文原文

摘要

METR工作人员David Rein与Anthropic合作，花费三周时间对Anthropic内部agent监控和安全系统进行红队测试，该系统部分内容在Opus 4.6破坏风险报告（附录8.4）中描述。测试发现多个新漏洞，部分已修复，未严重削弱该报告主要结论。成果包括隐蔽攻击agent轨迹和攻击策略测试集，已形成26页报告分享给Anthropic。

对Anthropic内部Agent监控系统的红队测试 - METR

](https://metr.org/)

菜单

对Anthropic内部Agent监控系统的红队测试

贡献者

David Rein

日期

2026年3月26日

复制链接引用

BibTeX引用×

@misc{red-teaming-anthropic-s-internal-agent-monitoring-systems,
    title = {Red-Teaming Anthropic's Internal Agent Monitoring Systems},
    author = {David Rein},
    howpublished = {\url{https://metr.org/blog/2026-03-25-red-teaming-anthropic-agent-monitoring/}},
    year = {2026},
    month = {03},
}

复制

与Anthropic合作，METR的一名工作人员（David Rein）最近花了三周时间对Anthropic内部agent监控和安全系统的一部分进行了红队测试，其中许多系统在Opus 4.6破坏风险报告（附录8.4，特别是8.4.8）中有所描述。Anthropic提供了对相关内部系统和信息的充分访问权限，并安排工作人员在整个测试过程中回答问题并提供反馈。

此次测试发现了几个特定的新漏洞，其中一些已被修复，且没有一个严重削弱Opus 4.6破坏风险报告中的主要结论。测试还产生了一些成果，包括包含隐蔽攻击的agent轨迹和一个小的攻击策略构思测试集。我们预计这两项成果都将有助于持续改进Anthropic的监控系统。最终形成的26页报告已分享给Anthropic，一份经编辑的版本也已分享给部分METR工作人员。我们正在探索如何将此次测试中更详细的发现纳入未来的METR风险报告中。

这种由外部研究人员进行的对抗性测试对于发现漏洞以及制定将第三方评估者嵌入前沿AI公司的最佳实践非常有价值。我们希望未来能进行更多此类测试——如果您是对此感兴趣的前沿AI开发者，请通过partnerships@metr.org与我们联系。

Bib

@misc{red-teaming-anthropic-s-internal-agent-monitoring-systems,
    title = {Red-Teaming Anthropic's Internal Agent Monitoring Systems},
    author = {David Rein},
    howpublished = {\url{https://metr.org/blog/2026-03-25-red-teaming-anthropic-agent-monitoring/}},
    year = {2026},
    month = {03},
}

METR 研究、开发并评估前沿AI系统，以衡量它们自主执行复杂任务的能力。订阅我们的通讯以获取最新动态。