apple-ml-research

PORTool：用于多工具集成推理的奖励树重要性感知 Policy Optimization

PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning

二〇二六年五月八日 · 英文原文

摘要

本文提出 PORTool，一种 importance-aware policy-optimization algorithm，用于训练 LLM tool-use agent。该方法在 outcome-level supervision 下进行 step-level reward 分配，以缓解 outcome-only reward 的 credit-assignment 模糊性，强化多工具集成推理中的 tool-use decision。

Multi-tool-integrated reasoning 使由 LLM 赋能的 tool-use agent 能够通过交替进行自然语言 reasoning 与调用外部 tool 来解决复杂任务。然而，仅使用 outcome-only reward 训练这类 agent 会受到 credit-assignment 模糊性的影响，难以判断哪些中间步骤（或 tool-use decision）导致成功或失败。本文提出 PORTool，一种 importance-aware policy-optimization algorithm，它在使用 outcome-level supervision 的同时，在 step level 分配 reward，从而强化 agent 的 tool-use competence。具体而言，PORTool 会生成一个带 reward 的……

译自 apple-ml-research · 录于二〇二六年五月八日