Anthropic · 工程博客

Claude Developer Platform 推出高级工具使用

Introducing advanced tool use on the Claude Developer Platform

二〇二六年五月八日 · 英文原文

摘要

Anthropic 发布 Claude advanced tool use beta，包括 Tool Search Tool、Programmatic Tool Calling 和 Tool Use Examples。内部测试中，工具搜索将 token 使用降 85%，Opus 4.5 MCP 准确率由 79.5% 升至 88.1%；PTC 在复杂任务中减少 37% tokens。

AI agents 的未来，是模型能够在数百甚至数千个工具之间无缝工作。一个 IDE assistant 可以集成 git 操作、文件处理、package managers、testing frameworks 和 deployment pipelines。一个 operations coordinator 可以同时连接 Slack、GitHub、Google Drive、Jira、公司数据库，以及数十个 MCP servers。

要构建有效的 agents，它们需要能够使用不受限制的工具库，而不是一开始就把每个定义都塞进 context。我们关于使用 code execution with MCP 的博客文章讨论过，在 agent 读取请求之前，工具结果和定义有时就会消耗 50,000+ tokens。Agents 应该按需发现并加载工具，只保留当前任务相关的内容。

Agents 还需要能够从代码中调用工具。使用自然语言 tool calling 时，每次调用都需要一次完整的 inference pass，中间结果无论是否有用都会堆积在 context 中。代码天然适合编排逻辑，例如循环、条件和数据转换。Agents 需要能够根据当前任务，在 code execution 和 inference 之间灵活选择。

Agents 还需要从示例中学习正确的工具用法，而不仅仅依赖 schema definitions。JSON schemas 定义的是结构上有效的内容，但无法表达使用模式：什么时候包含 optional parameters，哪些组合是合理的，或者你的 API 期望什么约定。

今天，我们发布三项功能来实现这一点：

**Tool Search Tool，**允许 Claude 使用搜索工具访问数千个工具，而不消耗其 context window
Programmatic Tool Calling，允许 Claude 在 code execution environment 中调用工具，从而降低对模型 context window 的影响
Tool Use Examples，提供一种通用标准，用来演示如何有效使用某个给定工具

在内部测试中，我们发现这些功能帮助我们构建了用传统 tool use patterns 无法实现的东西。例如，**Claude for Excel**使用 Programmatic Tool Calling 读取和修改包含数千行的电子表格，而不会让模型的 context window 过载。

基于我们的经验，我们认为这些功能为使用 Claude 构建应用打开了新的可能性。

Tool Search Tool

挑战

MCP tool definitions 提供重要 context，但随着更多 servers 接入，这些 tokens 会不断累积。考虑一个包含五个 servers 的设置：

GitHub：35 个工具（~26K tokens）
Slack：11 个工具（~21K tokens）
Sentry：5 个工具（~3K tokens）
Grafana：5 个工具（~3K tokens）
Splunk：2 个工具（~2K tokens）

这意味着在对话开始之前，58 个工具就已经消耗约 55K tokens。再加入 Jira 这样的更多 servers（仅 Jira 就使用 ~17K tokens），你很快就会接近 100K+ token overhead。在 Anthropic，我们见过工具定义在优化前消耗 134K tokens。

但 token 成本并不是唯一问题。最常见的失败是选错工具和参数不正确，尤其是当工具名称很相似时，例如 notification-send-user 和 notification-send-channel。

我们的解决方案

Tool Search Tool 不再预先加载所有工具定义，而是按需发现工具。Claude 只会看到当前任务实际需要的工具。

图 1：Tool Search Tool 示意图

与 Claude 的传统方法保留 122,800 tokens context 相比，Tool Search Tool 可保留 191,300 tokens context。

传统方法：

所有工具定义预先加载（50+ MCP tools 约 ~72K tokens）
Conversation history 和 system prompt 争用剩余空间
总 context 消耗：在任何工作开始前约 ~77K tokens

使用 Tool Search Tool：

只预先加载 Tool Search Tool（~500 tokens）
根据需要按需发现工具（3-5 个相关工具，~3K tokens）
总 context 消耗：~8.7K tokens，保留 95% 的 context window

这意味着 token 使用量减少 85%，同时仍可访问你的完整工具库。内部测试显示，在使用大型工具库时，启用 Tool Search Tool 能显著提升 MCP evaluations 的准确率。Opus 4 从 49% 提升到 74%，Opus 4.5 从 79.5% 提升到 88.1%。

Tool Search Tool 的工作方式

Tool Search Tool 让 Claude 动态发现工具，而不是一开始加载所有定义。你将所有工具定义提供给 API，但用 defer_loading: true 标记工具，使其可按需发现。Deferred tools 最初不会加载到 Claude 的 context 中。Claude 只会看到 Tool Search Tool 本身，以及所有 defer_loading: false 的工具（你最关键、最常用的工具）。

当 Claude 需要特定能力时，它会搜索相关工具。Tool Search Tool 返回匹配工具的引用，这些引用会在 Claude 的 context 中展开为完整定义。

例如，如果 Claude 需要与 GitHub 交互，它会搜索 "github"，然后只加载 github.createPullRequest 和 github.listIssues，而不是加载来自 Slack、Jira 和 Google Drive 的另外 50+ 个工具。

这样，Claude 可以访问你的完整工具库，同时只为它实际需要的工具支付 token 成本。

**Prompt caching 说明：**Tool Search Tool 不会破坏 prompt caching，因为 deferred tools 完全排除在初始 prompt 之外。它们只有在 Claude 搜索之后才会加入 context，因此你的 system prompt 和核心工具定义仍然可以被缓存。

实现：

{
  "tools": [
    // Include a tool search tool (regex, BM25, or custom)
    {"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex"},

    // Mark tools for on-demand discovery
    {
      "name": "github.createPullRequest",
      "description": "Create a pull request",
      "input_schema": {...},
      "defer_loading": true
    }
    // ... hundreds more deferred tools with defer_loading: true
  ]
}

对于 MCP servers，你可以延迟加载整个 servers，同时保留特定高频工具为已加载状态：

{
  "type": "mcp_toolset",
  "mcp_server_name": "google-drive",
  "default_config": {"defer_loading": true}, # defer loading the entire server
  "configs": {
    "search_files": {
"defer_loading": false
    }  // Keep most used tool loaded
  }
}

Claude Developer Platform 开箱提供基于 regex 和 BM25 的搜索工具，但你也可以使用 embeddings 或其他策略实现自定义搜索工具。

何时使用 Tool Search Tool

和任何架构决策一样，启用 Tool Search Tool 也有取舍。该功能会在工具调用之前增加一个搜索步骤，因此当 context 节省和准确率提升超过额外 latency 时，它的 ROI 最好。

适合使用的情况：

工具定义消耗 >10K tokens
遇到工具选择准确率问题
构建由 MCP 驱动、包含多个 servers 的系统
可用工具数量为 10+ 个

收益较小的情况：

小型工具库（<10 个工具）
每个 session 中都会频繁使用所有工具
工具定义很紧凑

Programmatic Tool Calling

挑战

随着 workflows 变得更复杂，传统 tool calling 会产生两个根本问题：

中间结果造成 context 污染：当 Claude 分析一个 10MB 的日志文件以查找错误模式时，整个文件都会进入它的 context window，即使 Claude 只需要错误频率的摘要。当跨多个表获取客户数据时，每条记录都会在 context 中累积，不管是否相关。这些中间结果会消耗大量 token budget，甚至可能把重要信息完全挤出 context window。
Inference overhead 和手动综合：每次工具调用都需要一次完整的模型 inference pass。收到结果后，Claude 必须通过“目视”数据来提取相关信息，推理各部分如何组合，并决定下一步做什么——所有这些都通过 natural language processing 完成。一个五工具 workflow 意味着五次 inference pass，再加上 Claude 解析每个结果、比较值并综合结论。这既慢又容易出错。

我们的解决方案

Programmatic Tool Calling 让 Claude 能通过代码编排工具，而不是通过单次 API round-trip 逐个调用。Claude 不再一次请求一个工具并让每个结果返回到其 context，而是编写代码来调用多个工具、处理输出，并控制哪些信息真正进入它的 context window。

Claude 擅长编写代码。让它用 Python 表达编排逻辑，而不是通过自然语言工具调用表达，可以获得更可靠、更精确的控制流。循环、条件、数据转换和错误处理都在代码中显式呈现，而不是隐含在 Claude 的推理中。

示例：预算合规检查

考虑一个常见的业务任务：“哪些团队成员超出了 Q3 差旅预算？”

你有三个可用工具：

get_team_members(department) - 返回带有 ID 和 level 的团队成员列表
get_expenses(user_id, quarter) - 返回某个用户的 expense line items
get_budget_by_level(level) - 返回某个员工 level 的预算限制

传统方法：

获取团队成员 → 20 人
对每个人获取其 Q3 expenses → 20 次工具调用，每次返回 50-100 条 line items（机票、酒店、餐饮、收据）
按员工 level 获取预算限制
所有这些都进入 Claude 的 context：2,000+ 条 expense line items（50 KB+）
Claude 手动汇总每个人的费用，查找他们的预算，并将费用与预算限制进行比较
与模型之间需要更多 round-trips，并显著消耗 context

使用 Programmatic Tool Calling：

工具结果不再逐个返回给 Claude，而是由 Claude 编写一个 Python script 来编排整个 workflow。该 script 在 Code Execution tool（一个 sandboxed environment）中运行，当需要你的工具结果时暂停。当你通过 API 返回工具结果时，它们由 script 处理，而不是被模型消费。script 继续执行，Claude 只看到最终输出。

图 2：Programmatic tool calling 流程

Programmatic Tool Calling 让 Claude 能通过代码编排工具，而不是通过单次 API round-trip 逐个调用，从而支持并行工具执行。

下面是预算合规任务中 Claude 的编排代码：

team = await get_team_members("engineering")

# Fetch budgets for each unique level
levels = list(set(m["level"] for m in team))
budget_results = await asyncio.gather(*[
    get_budget_by_level(level) for level in levels
])

# Create a lookup dictionary: {"junior": budget1, "senior": budget2, ...}
budgets = {level: budget for level, budget in zip(levels, budget_results)}

# Fetch all expenses in parallel
expenses = await asyncio.gather(*[
    get_expenses(m["id"], "Q3") for m in team
])

# Find employees who exceeded their travel budget
exceeded = []
for member, exp in zip(team, expenses):
    budget = budgets[member["level"]]
    total = sum(e["amount"] for e in exp)
    if total > budget["travel_limit"]:
        exceeded.append({
            "name": member["name"],
            "spent": total,
            "limit": budget["travel_limit"]
        })

print(json.dumps(exceeded))

Claude 的 context 只接收最终结果：两到三个超出预算的人。2,000+ 条 line items、中间汇总结果和预算查找不会影响 Claude 的 context，将消耗从 200KB 原始 expense data 降低到仅 1KB 结果。

效率提升很明显：

Token 节省：通过让中间结果不进入 Claude 的 context，PTC 显著降低 token 消耗。在复杂 research tasks 中，平均使用量从 43,588 降至 27,297 tokens，减少 37%。
降低 latency：每次 API round-trip 都需要模型 inference（数百毫秒到数秒）。当 Claude 在单个代码块中编排 20+ 个工具调用时，你可以消除 19+ 次 inference pass。API 会处理工具执行，而不必每次都返回模型。
提升准确率：通过编写显式编排逻辑，Claude 相比在自然语言中处理多个工具结果时出错更少。内部 knowledge retrieval 从 25.6% 提升到 28.5%；GIA benchmarks 从 46.5% 提升到 51.2%。

生产环境 workflows 涉及混乱数据、条件逻辑，以及需要扩展的操作。Programmatic Tool Calling 让 Claude 以编程方式处理这种复杂性，同时把重点放在可执行结果上，而不是原始数据处理上。

Programmatic Tool Calling 的工作方式

1. 将工具标记为可从代码调用

将 code_execution 添加到 tools，并设置 allowed_callers，以选择哪些工具可用于 programmatic execution：

{
  "tools": [
    {
      "type": "code_execution_20250825",
      "name": "code_execution"
    },
    {
      "name": "get_team_members",
      "description": "Get all members of a department...",
      "input_schema": {...},
      "allowed_callers": ["code_execution_20250825"] # opt-in to programmatic tool calling
    },
    {
      "name": "get_expenses",
 	...
    },
    {
      "name": "get_budget_by_level",
	...
    }
  ]
}

API 会将这些工具定义转换成 Claude 可以调用的 Python functions。

2. Claude 编写编排代码

Claude 不再逐个请求工具，而是生成 Python code：

{
  "type": "server_tool_use",
  "id": "srvtoolu_abc",
  "name": "code_execution",
  "input": {
    "code": "team = get_team_members('engineering')\n..." # the code example above
  }
}

3. 工具执行不会进入 Claude 的 context

当代码调用 get_expenses() 时，你会收到一个带 caller 字段的工具请求：

{
  "type": "tool_use",
  "id": "toolu_xyz",
  "name": "get_expenses",
  "input": {"user_id": "emp_123", "quarter": "Q3"},
  "caller": {
    "type": "code_execution_20250825",
    "tool_id": "srvtoolu_abc"
  }
}

你提供结果，该结果会在 Code Execution environment 中处理，而不是进入 Claude 的 context。这个请求-响应周期会对代码中的每个工具调用重复执行。

4. 只有最终输出进入 context

当代码运行结束时，只有代码结果会返回给 Claude：

{
  "type": "code_execution_tool_result",
  "tool_use_id": "srvtoolu_abc",
  "content": {
    "stdout": "[{\"name\": \"Alice\", \"spent\": 12500, \"limit\": 10000}...]"
  }
}

这就是 Claude 看到的全部内容，而不是执行过程中处理的 2000+ 条 expense line items。

何时使用 Programmatic Tool Calling

Programmatic Tool Calling 会在你的 workflow 中增加一个 code execution 步骤。当 token 节省、latency 改善和准确率提升足够明显时，这个额外 overhead 是值得的。

最有收益的情况：

处理大型 datasets，而你只需要 aggregates 或 summaries
运行包含三个或更多依赖型工具调用的 multi-step workflows
在 Claude 看到结果前，对工具结果进行 filtering、sorting 或 transforming
处理中间数据不应影响 Claude 推理的任务
对大量项目运行并行操作（例如检查 50 个 endpoints）

收益较小的情况：

进行简单的单工具调用
处理需要 Claude 查看并推理所有中间结果的任务
运行响应较小的快速查找

Tool Use Examples

挑战

JSON Schema 擅长定义结构——类型、必填字段、允许的 enums——但它无法表达使用模式：什么时候包含 optional parameters，哪些组合是合理的，或者你的 API 期望什么约定。

考虑一个支持工单 API：

{
  "name": "create_ticket",
  "input_schema": {
    "properties": {
      "title": {"type": "string"},
      "priority": {"enum": ["low", "medium", "high", "critical"]},
      "labels": {"type": "array", "items": {"type": "string"}},
      "reporter": {
        "type": "object",
        "properties": {
          "id": {"type": "string"},
          "name": {"type": "string"},
          "contact": {
            "type": "object",
            "properties": {
              "email": {"type": "string"},
              "phone": {"type": "string"}
            }
          }
        }
      },
      "due_date": {"type": "string"},
      "escalation": {
        "type": "object",
        "properties": {
          "level": {"type": "integer"},
          "notify_manager": {"type": "boolean"},
          "sla_hours": {"type": "integer"}
        }
      }
    },
    "required": ["title"]
  }
}

schema 定义了什么是有效的，但留下了关键问题未解答：

格式歧义：due_date 应使用 "2024-11-06"、"Nov 6, 2024"，还是 "2024-11-06T00:00:00Z"？
ID 约定：reporter.id 是 UUID、"USR-12345"，还是仅 "12345"？
**嵌套结构用法：**Claude 应该什么时候填充 reporter.contact？
参数相关性：escalation.level 和 escalation.sla_hours 与 priority 如何关联？

这些歧义可能导致格式错误的工具调用和不一致的参数使用。

我们的解决方案

Tool Use Examples 让你可以直接在工具定义中提供示例工具调用。你不再只依赖 schema，而是向 Claude 展示具体用法模式：

{
    "name": "create_ticket",
    "input_schema": { /* same schema as above */ },
    "input_examples": [
      {
        "title": "Login page returns 500 error",
        "priority": "critical",
        "labels": ["bug", "authentication", "production"],
        "reporter": {
          "id": "USR-12345",
          "name": "Jane Smith",
          "contact": {
            "email": "jane@acme.com",
            "phone": "+1-555-0123"
          }
        },
        "due_date": "2024-11-06",
        "escalation": {
          "level": 2,
          "notify_manager": true,
          "sla_hours": 4
        }
      },
      {
        "title": "Add dark mode support",
        "labels": ["feature-request", "ui"],
        "reporter": {
          "id": "USR-67890",
          "name": "Alex Chen"
        }
      },
      {
        "title": "Update API documentation"
      }
    ]
  }

从这三个示例中，Claude 会学到：

格式约定：日期使用 YYYY-MM-DD，用户 ID 遵循 USR-XXXXX，labels 使用 kebab-case
嵌套结构模式：如何构造 reporter object 及其嵌套 contact object
可选参数相关性：Critical bugs 包含完整 contact info + 带严格 SLA 的 escalation；feature requests 有 reporter 但没有 contact/escalation；internal tasks 只有 title

在我们自己的内部测试中，tool use examples 将复杂参数处理的准确率从 72% 提升到 90%。

何时使用 Tool Use Examples

Tool Use Examples 会为你的工具定义增加 tokens，因此当准确率提升超过额外成本时，它们最有价值。

最有收益的情况：

复杂嵌套结构，其中有效 JSON 并不意味着正确用法
工具有许多 optional parameters，且 inclusion patterns 很重要
API 具有 schema 无法捕捉的领域特定约定
相似工具需要通过示例澄清应使用哪一个（例如 create_ticket 与 create_incident）

收益较小的情况：

用法明显的简单单参数工具
Claude 已经理解的标准格式，例如 URLs 或 emails
更适合通过 JSON Schema constraints 处理的验证问题

最佳实践

构建能够执行真实世界操作的 agents，意味着要同时处理规模、复杂性和精确性。这三项功能协同解决 tool use workflows 中的不同瓶颈。下面是如何有效组合它们。

有策略地分层使用功能

并非每个 agent 在给定任务中都需要使用全部三项功能。先从最大的瓶颈开始：

工具定义导致 context 膨胀 → Tool Search Tool
大量中间结果污染 context → Programmatic Tool Calling
参数错误和格式错误调用 → Tool Use Examples

这种聚焦方法可以让你解决限制 agent performance 的具体约束，而不是一开始就增加复杂性。

然后根据需要叠加其他功能。它们是互补的：Tool Search Tool 确保找到正确工具，Programmatic Tool Calling 确保高效执行，Tool Use Examples 确保正确调用。

设置 Tool Search Tool 以获得更好的发现效果

工具搜索会匹配名称和描述，因此清晰、描述性的定义能提升发现准确率。

// Good
{
    "name": "search_customer_orders",
    "description": "Search for customer orders by date range, status, or total amount. Returns order details including items, shipping, and payment info."
}

// Bad
{
    "name": "query_db_orders",
    "description": "Execute order query"
}

添加 system prompt 指引，让 Claude 知道有哪些能力可用：

You have access to tools for Slack messaging, Google Drive file management, 
Jira ticket tracking, and GitHub repository operations. Use the tool search 
to find specific capabilities.

让三到五个最常用工具始终处于加载状态，其余延迟加载。这样可以在常见操作的即时访问和其他工具的按需发现之间取得平衡。

设置 Programmatic Tool Calling 以确保正确执行

由于 Claude 会编写代码来解析工具输出，因此要清楚记录返回格式。这有助于 Claude 编写正确的解析逻辑：

{
    "name": "get_orders",
    "description": "Retrieve orders for a customer.
Returns:
    List of order objects, each containing:
    - id (str): Order identifier
    - total (float): Order total in USD
    - status (str): One of 'pending', 'shipped', 'delivered'
    - items (list): Array of {sku, quantity, price}
    - created_at (str): ISO 8601 timestamp"
}

下面这些 opt-in 工具适合从 programmatic orchestration 中受益：

可以并行运行的工具（独立操作）
可安全重试的操作（idempotent）

设置 Tool Use Examples 以提升参数准确率

为行为清晰度设计示例：

使用真实感数据（真实城市名、合理价格，而不是 "string" 或 "value"）
展示多样性，包括最小、部分和完整 specification patterns
保持简洁：每个工具 1-5 个示例
聚焦歧义（只有在 schema 无法明显体现正确用法时才添加示例）

开始使用

这些功能目前处于 beta。要启用它们，请添加 beta header，并包含你需要的工具：

client.beta.messages.create(
    betas=["advanced-tool-use-2025-11-20"],
    model="claude-sonnet-4-5-20250929",
    max_tokens=4096,
    tools=[
        {"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex"},
        {"type": "code_execution_20250825", "name": "code_execution"},
        # Your tools with defer_loading, allowed_callers, and input_examples
    ]
)

关于详细 API 文档和 SDK 示例，请参见我们的：

Tool Search Tool 的文档和 cookbook
Programmatic Tool Calling 的文档和 cookbook
Tool Use Examples 的文档

这些功能将 tool use 从简单的 function calling 推向智能编排。随着 agents 处理横跨数十个工具和大型 datasets 的更复杂 workflows，动态发现、高效执行和可靠调用会成为基础能力。

我们期待看到你构建的成果。

致谢

作者 Bin Wu，Adam Jones、Artur Renault、Henry Tay、Jake Noble、Noah Picard、Sam Jiang 和 Claude Developer Platform 团队参与贡献。这项工作建立在 Chris Gorgolewski、Daniel Jiang、Jeremy Fox 和 Mike Lambert 的基础研究之上。我们也从整个 AI 生态中获得启发，包括 Joel Pobar 的 LLMVM、Cloudflare 的 Code Mode 和 Code Execution as MCP。特别感谢 Andy Schumeister、Hamish Kerr、Keir Bradwell、Matt Bleifer 和 Molly Vorwerck 的支持。

译自 Anthropic · 工程博客 · 录于二〇二六年五月八日