aws-ml

用 Amazon Nova 2 Sonic 将文本 agent 迁移为语音助手

Migrating a text agent to a voice assistant with Amazon Nova 2 Sonic

二〇二六年五月八日 · 英文原文

摘要

文章说明如何用 Amazon Nova 2 Sonic 将 text agent 迁移为 conversational voice assistant，对比 text/voice agent 在输入、响应、延迟、turn-taking 和传输上的差异，拆解 client、orchestrator、business logic 架构，展示 Strands Agents/BidiAgent 复用 tools、调整 system prompt、sub-agent 与 asynchronous tool calling 的实现，并列出 AWS 作者 Lana Zhang、Osman Ipek。

将 text agent 迁移为 voice assistant 变得越来越重要，因为用户期望更快、更自然的交互。客户不想打字，而是希望实时说话并获得理解。金融、医疗、教育、社交媒体和零售等行业正在探索使用 Amazon Nova 2 Sonic 的解决方案，以规模化支持自然、实时的语音交互。在本文中，我们将探讨如何使用 Amazon Nova 2 Sonic 将传统 text agent 迁移为 conversational voice assistant。我们会比较 text agent 和 voice agent 的需求，强调不同用例的设计优先级，拆解 agent 架构，并讨论工具和 sub-agent 复用、system prompt 适配等常见问题。本文将帮助你理解迁移流程并避免常见陷阱。你还可以在 Nova sample repo 中找到一个 Skill，它可与 Kiro 和 Claude Code 等 AI IDE 配合使用，自动将你的 text agent 转换为 voice agent。

Text agent 和 voice agent 不是同一个问题

从 text agent 迁移到 voice assistant，表面上可能像是在业务逻辑不变的情况下添加一个语音接口，但需要从以下角度理解二者差异。

Aspect	Text agent	Voice agent
User input	Typed text：用户按自己的节奏阅读、滚动、复制粘贴	Spoken audio stream：实时，可打断（barge-in），停顿很重要
Response style	段落、列表、表格、链接：富格式，一次性提供所有信息	简短口语短句，一次只讲一件事：“Want me to continue?”，并包含确认循环
Latency budget	可容忍中等延迟：typing indicator 可以掩盖等待时间	需要极低延迟：沉默会让人感觉系统出了问题
Turn-taking	严格的 request → response：用户输入、按回车、等待	流动、重叠、可打断：需要 voice activity detection (VAD) + turn detection、barge-in
Transport	HTTP / REST / Server-Sent Events：无状态 request-response	Bidirectional streaming：持久连接，双向实时音频

为了更好地应对这些挑战，我们将拆解 text agent 和 voice assistant 之间的关键差异，以及这些差异如何影响设计和实现。

Response design

Text agent 的设计目标是提供段落内容，让用户按自己的节奏阅读。用户可以回滚、复制内容，并按需打开链接。Voice agent 所处的媒介完全不同。响应必须适合对话，简洁，并针对“听”而不是“读”进行结构化。

以一个返回账户信息的 banking agent 为例：

Text agent response:

Here's your account summary:

Checking (****4521): $3,245.67
Savings (****8903): $12,450.00
Credit Card (****2187): -$1,823.45 (payment due: March 15)

You can click on any account for detailed transactions.

Voice agent response:

“You have three accounts. Your checking account ends in 4521 with a balance of three thousand two hundred forty-five dollars. Want me to go through the others or would you like details on this one?”

Voice agent 会把信息拆成易于理解的小块，并在继续之前请求确认。它采用 autonomous conversation style，主动引导用户，而不是一次性倾倒所有信息。

Latency budget

Text 用户对中等延迟有一定容忍度。他们会看到 typing indicator 并等待。Voice 用户几乎会立刻察觉延迟。语音对话中的沉默会让人感觉电话断线了。这会改变 agent 必须采用的架构方式：

Factor	Text agent	Voice agent
Acceptable response time	可容忍中等延迟：带 loading indicator 的几秒等待可以接受。	低延迟容忍度：对话应在数百毫秒内响应，并尽快输出首段音频；几秒延迟，尤其是在 tool call 期间，会让人感觉系统没有响应。
Tool call tolerance	多个顺序调用可以接受	每次调用都会增加明显的沉默
Streaming	可有可无	必需
Asynchronized tool handling	最好有	关键能力

Amazon Nova 2 Sonic 支持 asynchronous tool calling，因此在工具后台运行时，对话仍能自然继续。它会持续接收输入，可以并行运行多个工具；如果用户在处理中途改变请求，它也能平滑适配，在关注仍然相关内容的同时返回所有结果。

Turn-taking and interruption

Text 对话天然是 turn-based。用户输入、按回车、等待回复。Voice 对话则是流动的。用户会打断（barge-in）、说到一半停顿，并期望 agent 能自然处理重叠语音。

Amazon Nova 2 Sonic 这类 native speech-to-speech model 会通过内置的 voice activity detection (VAD) 和 turn detection 在内部处理这些问题。Nova 2 Sonic 会管理 conversation context，而不要求每一轮都发送完整历史。

Migration from an architectural view

理解这些差异之后，我们从架构视角拆解迁移过程，将系统分为三个主要组件，并考察每个组件如何演进。

Text agent 的概念设计包含三个组件：

Client application，例如 web、mobile 或 IoT interface。
Text orchestrator，负责管理 system prompt、tools 和 conversation context。
Tool integrations，连接到你的系统，例如 API、database、workflow、Retrieval Augmented Generation (RAG) pipeline 或 sub-agent。

将这个架构迁移到 voice agent 时，这些组件仍然存在，但每个组件都需要不同程度的改造，以支持语音特定逻辑。

The client application

Agent client 通常使用适用于 web browser、mobile app 或 IoT device 的编程语言和系统实现，具体取决于部署场景。

Voice agent client 需要持久的 bidirectional connection（例如 WebSocket 或 WebRTC），并处理 audio encoding/decoding、client events、barge-in logic、noise control 和 transcription display。这比 text client 复杂得多；text client 通常通过无状态 REST 或单向 HTTPS streaming interface 与 agent 通信。因此，这个组件通常需要重构或完全重写。例如，一个使用 Streamlit frontend 构建的 PoC，很可能需要使用 React 这类 JavaScript framework 重新构建，才能支持 bidirectional connection。关于使用 WebSocket 的轻量级 voice agent web client application in REACT，可参考这个 sample。

The orchestrator

构建 text agent 或 voice agent 时，agent orchestrator 是中心枢纽。它管理 system prompt，选择并路由 tools 或 sub-agents，并维护 conversation context，使交互保持连贯，并与 agent 的角色一致。

在 text agent 中，orchestrator 处理 client 与 reasoning model 之间的 request 和 response，同时集成 tools 来触发业务逻辑。Voice orchestrator 遵循相同原则，但会增加 audio streaming、Voice Activity Detection (VAD)、Automatic Speech Recognition (ASR)、reasoning 和 Text-to-Speech (TTS)。Amazon Nova 2 Sonic 提供了一个 bidirectional streaming interface，将这些能力组合在一起，因此用户可以从 text agent 迁移 reasoning prompt 和 tool trigger，使向语音的过渡更平滑。

与传统 text-agent 架构相比，一个关键差异是 Amazon Nova 2 Sonic 可以在同一个 model interface 中同时接受 text 和 audio 输入。这意味着 Sonic 可以直接替代 text orchestrator 中通常使用的独立 text reasoning model。无需串联独立的 ASR → LLM → TTS 组件，Sonic 将 speech recognition、reasoning、tool use 和 speech synthesis 统一到一个 bidirectional model 中。借助这一点，团队可以复用现有 prompts 和 tools，同时简化架构、降低延迟，并避免在 voice stack 中管理单独的 text reasoning model。

以下 code snippets 展示了一个使用 Strands Agents 构建的示例 text agent，它使用 Amazon Nova 2 Lite 作为 large language model (LLM)。示例中定义了 tools，并使用 Strands BidiAgent 和 Nova 2 Sonic 创建了一个可通过 WebSocket 访问的 voice agent orchestrator。你会注意到，Strands 中 text agent 和 voice agent 的编码风格非常相似。虽然示例使用 Strands，但同样的方法也适用于使用 LangChain、LangGraph 或 CrewAI 等其他 framework 构建的 text agent，因为 text orchestrator 所需的关键输入是 system prompt 和 tool definitions。

在运行以下小节的示例之前，请安装 Python 以及所需依赖，包括 strands-agents 和 Boto3，并确保你的 IAM 设置具备所需服务的必要权限。

from strands import Agent, tool from strands.models import BedrockModel # ---- Mock tools will be used in both text and voice agents ---- @tool def authenticate_customer(account_id: str, date_of_birth: str) -> str: """Verify customer identity and return an auth token.""" # In real implementation, call your auth service / API if account_id == "123456": return "AUTH_TOKEN_ABC123" return "Authentication failed" @tool def get_account_balance(auth_token: str) -> str: """Return the customer’s current account balance.""" if auth_token == "AUTH_TOKEN_ABC123": return "Your current checking account balance is $5,420." return "Unauthorized request" @tool def get_recent_transactions(auth_token: str) -> str: """Return recent transactions.""" if auth_token == "AUTH_TOKEN_ABC123": return "Recent transactions: $45 groceries, $120 utilities, $18 coffee." return "Unauthorized request"

使用 Strands Agents，你可以按如下示例创建一个以 Nova 2 Lite 作为模型的 text agent orchestrator：

---- Nova 2 Lite model ---- model = BedrockModel(model_id="amazon.nova-2-lite-v1:0") # ---- Banking assistant text agent ---- bank_agent = Agent( model=model, system_prompt="""You are a banking assistant. Answer user questions about account balances, recent transactions accurately. Always validate user identity before providing sensitive information. """, tools=[authenticate_customer, get_account_balance, get_recent_transactions], )

使用 Strands BidiAgent，你可以用类似的编码风格，通过 Nova 2 Sonic model 构建 voice agent orchestrator，并复用相同的 tools：

voice_orchestrator.py — BidiAgent with sub-agents as tools from strands.experimental.bidi.agent import BidiAgent from strands.experimental.bidi.models.nova_sonic import BidiNovaSonicModel # ---- Nova 2 Sonic model ---- model = BidiNovaSonicModel( region="us-east-1", model_id="amazon.nova-2-sonic-v1:0", provider_config={"audio": {"voice": "tiffany", "input_sample_rate": 16000, "output_sample_rate": 16000}}, ) # ---- Banking assistant voice agent ---- agent = BidiAgent( model=model, system_prompt=""" You are a banking assistant. Speak naturally and answer questions about account balances, recent transactions. Confirm the customer’s identity before sharing sensitive details. Use short, clear responses and acknowledge when retrieving data. """, tools=[authenticate_customer, get_account_balance, get_recent_transactions], ) await agent.run(inputs=[ws_input], outputs=[ws_output])

System prompt 是 text agent 和 voice agent 的基础。它定义 agent 的角色、语气和 guardrails，确保无论是书面交互还是语音交互，响应都保持一致、可靠，并符合业务目标和用户期望。

从 text 迁移到 voice 时，需要针对实时音频调整 system prompt。保持简洁和对话化，考虑延迟和 multi-turn context，并将复杂指导拆成更小的步骤。

Text prompt（原始）：

“You are a banking assistant. Answer user questions about account balances, recent transactions accurately. Always validate user identity before providing sensitive information.”

Voice-adapted prompt：

“You are a banking assistant. Speak naturally and answer questions about account balances, recent transactions. Confirm the customer’s identity before sharing sensitive details. Use short, clear responses and acknowledge when retrieving data.”

请注意，在使用 Nova 2 Sonic 的 voice orchestrator 中，你会使用 Sonic 内置的 reasoning capability 来管理 system prompt、tool selection 和 session context。你不再需要在 orchestrator 层提供自己的 LLM 来进行 reasoning。

The business logic layer

Tool integration 是将 agentic assistant 连接到业务层的关键环节，常用协议包括 Model Context Protocol (MCP)、Agent-to-Agent (A2A) 和标准 HTTP。在 text-based agent 中，orchestrator 会将文本输入发送给 tools，例如 REST API、RAG system 或 database，并接收文本响应，以生成面向用户的回复。在 Strands Agents 示例中，text agent 使用的相同 tools 可以在 voice agent 中复用，且无需修改代码。

不过，面向语音复用 tools 和 sub-agents 不只是实现细节。如果你已经使用 multi-agent architecture，专门处理业务逻辑的 agents 往往可以在经过一些更新后复用于语音。下图展示了一个 banking assistant，其中 voice orchestrator 调用用于 authentication 和 mortgage inquiries 的 sub-agents。虽然这些 sub-agents 不需要完全重写，但确实需要针对语音进行调优：

Shorter responses – text sub-agent 可能返回一段详细段落。Voice sub-agent 应返回 1–2 句话，让 orchestrator 能自然地说出来。例如，你可以将 sub-agent 的 system prompt 从 “Provide a comprehensive answer.” 改为 “Summarize in 1 to 2 concise sentences.”
Latency improvement – 为 sub-agents 选择更小、更快的模型（例如从 Nova 2 Lite 开始，而不是更大的模型）。在语音对话中，每多一次 inference hop 都会带来明显沉默。对于 Nova 2 Lite，我们建议限制或避免使用 thinking mode，以降低延迟。更多信息请参见 Amazon Nova Developer Guide for Amazon Nova 2。
Reduced verbosity in tool results – 一些 Sub-agents 被设计为返回大量 raw payload，例如包含超过请求范围数据的 JSON，并让 orchestrator 过滤响应。这并不理想，尤其是对于语音。更大的 payload 会增加延迟，可能降低准确性，并可能暴露敏感数据。精简且有针对性的响应很关键，尤其是在对延迟敏感的 voice experience 中。
Use filler messages to keep conversations natural during longer tool processing. 使用 Amazon Nova 2 Sonic 时，你可以进行 asynchronous tool calls，并自定义这些 interim messages，确保 agent 完成任务时用户仍保持参与。

这些调整大多涉及 prompt 和 configuration 变化，而不是架构修改。Sub-agent 的 tools、business logic 和 deployment 可以保持不变。

Sub-agent 架构提供了清晰性、可复用性和可移植性，在将 text agent 迁移到 voice 时尤其有用。但每次 sub-agent call 都会因为自身的 model inference 和 tool calls 增加延迟。在语音对话中，这可能转化为 sub-agent reasoning 带来的明显停顿。更多 voice agent architecture patterns 以及管理延迟的 best practices，请参考这篇 blog。

Conclusion

将 text agent 迁移为 voice assistant 不是简单加一层 wrapper。交互模型从 response design、latency budget 到 turn-taking behavior 都存在根本差异。但借助结构良好的 multi-agent architecture 和 Amazon Nova 2 Sonic，business logic layer 可以保持完整。

启动你的迁移项目，并使用 Amazon Nova 2 Sonic 将 text agent 转换为 voice assistant。关于使用 Amazon Nova 2 Sonic 的完整 voice agent 工作示例，请参见 Amazon Nova 2 Sonic in Strands BidiAgent。

更多文档和资源如下：

Amazon Nova 2 Sonic
Amazon Nova 2 Sonic sample code and repeatable patterns
Amazon Nova 2 Sonic user guide
Amazon Nova 2 Sonic technical report and model card

About the authors

Lana Zhang 是 AWS Worldwide Specialist Organization 的 Generative AI Senior Specialist Solutions Architect。她专注于 AI/ML，重点关注 AI voice assistants 和 multimodal understanding 等用例。她与来自媒体与娱乐、游戏、体育、广告、金融服务和医疗等不同行业的客户密切合作，帮助他们通过 AI 转型业务解决方案。

Osman Ipek 是 Amazon AGI team 的 Solutions Architect，专注于 Nova foundation models。他指导团队通过实用的 AI implementation strategies 加速开发，专业领域涵盖 voice AI、NLP 和 MLOps。

译自 aws-ml · 录于二〇二六年五月八日