huggingface-blog

NVIDIA Nemotron 3 Nano Omni：面向文档、音频和视频 agent 的长上下文多模态智能

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

二〇二六年五月三日 · 英文原文

摘要

NVIDIA 发布 Nemotron 3 Nano Omni 30B-A3B，全模态理解模型，支持 text、image、video、audio。模型采用 hybrid Mamba-Transformer MoE backbone、C-RADIOv4-H vision encoder 和 Parakeet-TDT audio encoder，经多阶段 alignment、context extension、preference optimization 与 multimodal RL 训练，开放 BF16、FP8、NVFP4 checkpoints。

](https://huggingface.co/trintamaki)

NVIDIA Nemotron 3 Nano Omni 是一款新的 omni-modal（全模态）理解模型，面向 真实世界文档分析、多图像推理、automatic speech recognition、长音频-视频理解、agentic computer use，以及通用推理 构建。
它将 Nemotron multimodal 系列从强大的 vision-language 系统扩展为更广泛的 text + image + video + audio 模型。
Nemotron 3 Nano Omni 在复杂 document intelligence leaderboard 上达到同类最佳准确率，例如 MMlongbench-Doc、OCRBenchV2，同时也在 WorldSense 和 DailyOmni 等视频和音频 leaderboard 上领先。它在音频理解的 VoiceBench 上取得最高准确率，并在 MediaPerf. 上位列最具成本效率的开放 video understanding model。
在底层，它结合了 Nemotron 3 hybrid Mamba-Transformer Mixture-of-Experts backbone、C-RADIOv4-H vision encoder，以及 Parakeet-TDT-0.6B-v2 audio encoder。
该架构旨在保留精细视觉细节，加入原生音频理解，并扩展到非常长的 multimodal context，用于密集图像、文档、视频和混合模态推理。
训练方案采用分阶段 multimodal alignment 和 context extension，随后进行 preference optimization 和 multimodal reinforcement learning。
与替代方案相比，Nemotron 3 Nano Omni 在 multimodal 用例上可提供最高 9x 的吞吐提升，以及 2.9x 的单流推理速度。
可在 HuggingFace 下载 BF16、FP8 和 NVFP4 checkpoint。
如需了解更多关于模型架构、训练方案、数据 pipeline 和 benchmark 的信息，请阅读完整的 Nemotron 3 Nano Omni 报告。

Benchmark 亮点

在 Nemotron Nano V2 VL 的基础上，Nemotron 3 Nano Omni 带来了显著的视觉能力提升，并加入了全新的音频和视频+音频能力——同时在许多领域领先于另一个开放权重 omni model Qwen3-Omni。

任务	Benchmark	Nemotron 3 Nano Omni	Nemotron Nano V2 VL	Qwen3-Omni 30B-A3B
文档理解	OCRBenchV2-En	65.8	61.2	-
	MMLongBench-Doc	57.5	38.0	49.5
	CharXiv reasoning	63.6	41.3	61.1
GUI	ScreenSpot-Pro	57.8	5.5	59.7
	OSWorld	47.4	11.0	29.0
视频理解	Video-MME	72.2	63.0	70.5
视频 + 音频理解	WorldSense	55.4	-	54.0
	DailyOmni	74.1	-	73.6
语音交互	VoiceBench	89.4	-	88.8
ASR	HF Open ASR（越低越好）	5.95	-	6.55

效率亮点

与具有相同交互性的其他开放 omni model 相比，Nemotron 3 Nano Omni 在多文档用例中提供 7.4x 更高的系统效率，在视频用例中提供 9.2x 更高的系统效率。

图 1. 在固定的单用户交互阈值（tokens/sec/user）下，各模型在多文档和视频用例中维持的总系统吞吐量

Nemotron 3 Nano Omni 的设计目标

从高层看，Nemotron 3 Nano Omni 面向五类工作负载：

1. 真实世界文档分析

这不只是 OCR。该模型定位于长篇、复杂且高价值的文档：其理解依赖布局、表格、图形、公式、章节结构和跨页引用。可以想象合同、技术论文、报告、手册、多页表单或合规材料包。该模型可以处理 100+ 页文档。

2. Automatic Speech Recognition

Nemotron 3 Nano Omni 具备强大的语音理解能力，可在多样的音频条件下实现高质量转写。它可以处理长音频，并适应不同说话人、口音和背景噪声。这些能力可集成到更广泛的 workflow 中，使口语内容能够被转写、分析，并与其他模态结合，用于摘要、问答和跨模态推理等任务。

3. 长音频-视频理解

许多企业和开发者 workflow 依赖混合音频与视觉证据：带旁白的屏幕录制、培训视频、带幻灯片的会议、教程、产品演示、客户支持录屏，以及长视频档案。Nemotron 3 Nano Omni 被构建为能够联合推理这些输入。

4. Agentic computer use

Nemotron 3 Nano Omni 模型专门针对 agentic computer use 进行训练，使其能够在图形用户界面（GUI）环境中协助完成任务。它的能力包括解释 screenshot、监控用户界面状态、将推理建立在屏幕视觉内容之上，并帮助进行动作选择或 workflow 自动化。

5. 通用 multimodal reasoning

该模型的设计目标不止于感知。它擅长需要在长 context window、多种模态以及结构化或半结构化证据之间综合信息的推理密集型任务。它可以执行多步推理、进行计算，并连接来自文本、图像、表格和其他输入的信号，得出连贯且有充分依据的答案。

模型架构与关键创新

Nemotron 3 Nano Omni 采用统一的 encoder-projector-decoder 设计。语言 backbone 是 Nemotron 3 Nano 30B-A3B，搭配 C-RADIOv4-H vision encoder 和 Parakeet-TDT-0.6B-v2 audio encoder。特定模态的 encoder 通过轻量级 projector 连接到 LLM backbone。

图 2. NVIDIA Nemotron 3 Nano Omni 30B-A3B 的模型架构

面向长 multimodal context 的 hybrid Mamba-Transformer-MoE backbone

模型 backbone 交错使用三个关键组件：用于高效长 context 处理的 23 个 Mamba selective state-space layer；用于条件容量的 23 个 MoE layer，包含 128 个 expert、top-6 routing 和一个 shared expert；以及用于保持强大全局交互和表达能力的 6 个 grouped-query attention layer。

Nemotron 3 Nano Omni 在统一设计中结合 state-space model、attention 和 MoE，在保持强推理性能的同时，也适用于长 multimodal context。

面向密集文档、图表和屏幕的动态分辨率

在视觉侧，Nemotron 3 Nano Omni 用原生宽高比下的动态分辨率处理替代了 v2 模型中的 tiling 策略。每张图像可以使用可变数量的 16 x 16 patch 表示，每张图像最少 1,024 个、最多 13,312 个 visual patch。对于正方形图像，这分别相当于 512 x 512 和 1840 x 1840。

这种灵活性对于处理高分辨率、复杂的视觉输入至关重要，例如 OCR 密集型文档、财务表格、幻灯片、研究图表、screenshot 和 GUI 布局——尤其是在需要同时理解细节和整体结构时。

用于视频的 Conv3D 时间压缩

对于视频，Nemotron 3 Nano Omni 使用专门的 Conv3D tubelet embedding 路径。它不是独立 embedding 每一帧，而是在 ViT 之前将每一对连续帧融合为单个 “tubelet”，使语言模型需要 attend 的 vision token 数量减半。这让我们可以在相同 token 预算下将帧数翻倍，或在相同帧数下将 token 数量减半。

EVS — Efficient Video Sampling

EVS 是一项重要特性，在 inference 阶段启用，会在 vision encoder 之后丢弃冗余视频 token。它在保持准确率的同时降低延迟并提升吞吐。视频的第一帧会完整保留；随后对于每一帧，EVS 保留视频中发生变化的 “dynamic” token，并丢弃与上一帧相比没有变化的 “static” token。我们将它与 Conv3D 结合，以实现更好的压缩：Conv3D 将成对帧的 token 融合为一个，然后 EVS 裁剪冗余的静态信息。

原生音频输入，而不只是文本转写

音频侧由 Parakeet-TDT-0.6B-v2 驱动，并通过自己的 2-layer MLP projector 连接到 backbone。音频以 16 kHz 采样，模型训练时支持最长 1,200 秒（20 分钟） 的输入，而 LLM 最大 context length 支持 5+ 小时。

这标志着对传统 VLM pipeline 的转变：模型能够在共享的 multimodal sequence 中进行原生音频处理，使音频、视觉和文本 token 能够被联合建模。这对带旁白的屏幕录制、语音会改变视觉含义的视频问答、长篇教学或会议内容，以及需要时间定位的 multimodal reasoning 任务至关重要。

轻量级模态 projector 与统一 token interleaving

每个 encoder 都通过轻量级 2-layer MLP projector 连接到 LLM，将 encoder 特征映射到共享 embedding space。完成投影后，vision、audio 和 text token 会被交错排列并联合处理。

这种设计让整个系统保持模块化，同时仍能在 backbone 内部实现真正的跨模态推理。

训练数据、基础设施与系统

SFT 阶段在 NVIDIA H100 上训练，根据阶段不同扩展到 32 到 128 个节点。该栈使用 Megatron-LM、Transformer Engine 和 Megatron Energon，并采用 tensor parallelism、expert parallelism、sequence parallelism、长 context 阶段的 context parallelism、online sequence packing，以及 selective activation recomputation。

SFT 后的 reinforcement learning 使用 NeMo-RL和 NeMo Gym，并采用 Megatron backend。RL 基础设施使用跨 B200 和 H100 集群的 基于 Ray 的分布式设置，并加入 multimodal deduplication，因此重复 rollout 不会成倍增加图像、视频和音频内存。

我们开源了训练代码的重要部分。

使用 RL 塑造可靠的 multimodal 行为

我们在 Nemotron 3 Nano Omni 中引入多环境文本和 omni 训练。我们的 text RL 训练阶段在 Nemo-Gym 的多样环境中进行，用于评估模型执行动作序列的能力，例如 tool calling、编写代码，以及满足可验证标准的多部分规划。

Omni RL 在统一框架内训练模型跨图像、视频、音频和文本进行推理，覆盖从单模态到完全 multimodal 场景的任务。多样化的 verifier suite 会评估多种格式的输出，例如多选题、数学、GUI grounding 和 ASR，同时有意包含无法回答的案例，以教会模型在证据不足时拒答，而不是 hallucinate。

数据与 data pipeline

Nemotron 3 Nano Omni 使用增强数据集训练，重点强调跨多种模态的高质量推理。我们显著扩展了任务覆盖范围，并针对公开数据集有限的复杂推理场景引入 synthetic data。为此，我们构建了面向特定任务的多阶段 pipeline，用于可扩展的 synthetic data generation。

举例来说，我们使用 NeMo Data Designer 从大量真实世界 PDF 语料中生成了约 11.4M 对 synthetic QA（约 45B token）。该数据集用于在 post-training 阶段增强长 context 文档推理，并使 MMLongBench-Doc 上的总体准确率提升 2.19×。

我们在 Data Designer 开发者说明中详细介绍了完整 pipeline 的演进，包括失败分析和关键经验。该说明还包含 9 个可运行的 pipeline recipe，可作为构建你自己的文档理解数据集的起点。

示例 workflow

示例 1：长篇多页文档分析

Nemotron 3 Nano Omni 可以分析财务报告、学术论文、产品手册等长文档并进行推理。下面的示例从 100+ 页文档中检索财务指标，以计算另一个指标：

图 3：来自 MMLongBench-Doc benchmark 的风格化示例

给模型的 prompt 如下：

Extract information from all the given images, then answer the question using a single word or phrase. Return 'Not answerable' if the answer cannot be derived from the the images.

该模型能够一次性完成长 context 检索、结构化抽取、表格/图表读取和多页推理。

示例 2：视频 + 音频理解

Nemotron-3 Nano Omni 可以进行联合音视频分析，既能针对特定场景进行局部分析，也能跨整个视频进行全局分析。这使它能够回答需要跨模态推理的复杂问题，例如识别音频中提到某个主题时画面中正好出现的具体视觉内容。

视频

问题

Watch the video and listen to the narration.
1.What structure is on fire as shown in the video and how much money was being spent in its renovation project ? 
2. Describe in short what visuals are shown when the eye-witness narrates her experience?

答案

1. The structure on fire is the Notre Dame Cathedral, which was undergoing an almost $7 million renovation project.

2. When the eye-witness narrates her experience, the visuals show the cathedral engulfed in flames and thick smoke, scaffolding surrounding the burning roof, firefighters using ladders to spray water, a crowd of people watching from the riverbank, and views of the burning structure against the Paris skyline.

示例 3：Agentic computer use / screenshot 推理

Nemotron 3 Omni 可以集成到 agentic computer-use 系统中，用于推理用户意图、分析 GUI 元素，并执行动作以完成任务。

模型获得了对一个 Web 环境的访问权限，并使用如下 system prompt：

You are a GUI agent. You are given an instruction, a screenshot of the screen and your previous interactions with the computer. You need to perform a series of actions to complete the task. The passoword of the computer is {password}.

For each step, provide your response in this format:
{thought}
## Action:
{action}
## Code:
{code}

In the code section, the code should be either pyautogui code or one of the following functions wrapped in the code block:
- {"name": "computer.wait", "description": "Make the computer wait for 20 seconds for installation, running code, etc.", "parameters": {"type": "object", "properties": {}, "required": []}}
- {"name": "computer.terminate", "description": "Terminate the current task and report its completion status", "parameters": {"type": "object", "properties": {"status": {"type": "string", "enum": ["success", "failure"], "description": "The status of the task"}, "answer": {"type": "string", "description": "The answer of the task"}}, "required": ["status"]}}

用户指令

Find the Driver License Eligibility Requirements

模型响应：

上面的 GIF 展示了 agent 为完成用户请求所采取的动作：

它点击 “License & IDs” 标签
然后从下拉菜单中点击 “Driver’s License”
然后点击左侧的 “Driver’s License Eligibility”
它在该页面向下滚动，然后再向上滚动
最后为用户生成如下响应：

Successfully found the Driver License Eligibility Requirements page on the Virginia DMV website. The page contains comprehensive information about driver license eligibility including:

**General Requirements:**
- Must be a resident of the Commonwealth of Virginia
- Must be at least 16 years and 3 months of age

**First Time Drivers:**
- Must complete a state-approved driver education program
- Must obtain a learner's permit at least 60 days before taking the road skills test
- Must be at least 15 years and 6 months of age for a learner's permit
- No exceptions to age requirements

**New Residents:**
- Temporary residents: must obtain license within 6 months
- Permanent residents: must obtain license within 1 year

**Military Personnel:**
- Active-duty members of the Armed Forces stationed in Virginia
- Spouses and dependents 16 years and 3 months or older may drive with valid out-of-state license
- If vehicle is co-owned and out-of-state plates are used, all co-owners must be active duty members

**Out-of-State Students:**
- Full-time students enrolled in accredited Virginia schools (not employed) may drive with valid out-of-state license
- Full-time students who are employed are considered Virginia residents for vehicle laws

**Non-Residents:**
- Non-residents temporarily living in Virginia may drive with home state license for no more than 6 months
- Note: This does not apply to commercial vehicle drivers
- If becoming gainfully employed, must hold Virginia driver's license

**Identification and Residency Requirements:**
- Proof of identity (acceptable documents listed in DMV 141)
- Proof of legal presence in the U.S.
- Proof of Virginia residency
- Proof of Social Security number
- Proof of name change (if applicable)

Additional resources available: interactive document guide, "Acceptable Documents for Obtaining a Driver's License or Photo ID Card" (DMV 141), and Spanish language version.

示例 4：混合文档 + 图表 + 旁白推理

Nemotron-3 Nano Omni 可以分析图表、图像中显示的 figure，以及音频文件，从而生成媒体之间的共同点和差异。

图像

音频

问题

Use both the slide content and the spoken commentary.
List out any extra points mentioned by the speaker that is not shown in the slide

答案

Based on the speaker's commentary and the visual content of the slide, here are the extra points mentioned that are not explicitly shown or detailed in the slide:

*   **The Topic:** The speaker explicitly states she will be covering **multitask learning** with neural networks, while the slide title is simply "Some notation" (though it contains the mathematical definitions for tasks).
*   **Nature of the Output:** She describes the output ($y$) as a **probability distribution** over possible labels (e.g., tiger, tiger cat, lynx, cat), whereas the slide only shows a bar chart with class labels without explicitly labeling the vector as a distribution.
*   **Specific Task Example (Paper Titles):** She introduces a concrete example where the input is the **title of a paper** and the neural network is tasked with estimating the **length of the paper**. The slide shows "length of paper" near the bar chart, but does not visually distinguish how a paper title maps to these bars versus representing animal classes.
*   **Optimization Techniques:** She mentions that the loss function is typically optimized using **stochastic gradient descent** or a **"favorite optimizer,"** which are procedural steps not written on the slide (which only defines the loss function $\mathcal{L}$ to minimize).
*   **Input Diversity:** She specifies that inputs can be a **"wide range of unstructured inputs,"** specifically mentioning **images** and **text**. The slide visually displays a single image (a structured input) and uses a generic variable $x$ for the input class.

示例 5：声景

Nemotron-3 Nano Omni 可以分析图表、图像中显示的 figure，以及音频文件，从而生成媒体之间的共同点和差异。

音频

问题

What is producing the main sound in this audio, and what does the acoustic environment suggest about where it was recorded?

答案

The main sound is a bird chirping, and the presence of multiple bird calls and a natural, open acoustic environment suggests the recording was made outdoors in a natural setting like a forest or park.

示例 6：音乐

Nemotron-3-Nano-Omni 可以解释和推理一般音频，包括声景和环境声音。

音频

问题

What's the overall vibe of this song, and what kind of scenario would it complement well?

答案

The song has a calm, reflective, and slightly melancholic vibe, with a gentle piano melody and soft string accompaniment. It would complement a quiet, introspective scene such as a rainy evening, a peaceful walk, or a moment of personal reflection.

开始使用 Nemotron 3 Nano Omni

Hugging Face BF16 checkpoint	`https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16`
Hugging Face FP8 checkpoint	`https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8`
Hugging Face NVFP4 checkpoint	`https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4`
技术报告 / PDF	`https://arxiv.org/abs/2604.24954`
数据集 / collection 发布	`https://huggingface.co/datasets/nvidia/Nemotron-Image-Training-v3`
Megatron-Bridge	`https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/models/vlm/nemotron_3_omni`
Nemo-RL	`https://github.com/NVIDIA-NeMo/RL/blob/nano-v3-omni/docs/guides/nemotron-3-nano-omni.md`
NeMo Data Designer SDG recipe	`https://github.com/NVIDIA-NeMo/DataDesigner/tree/main/docs/assets/recipes/vlm_long_doc`

参考资料

NVIDIA Nemotron Nano V2 VL。技术报告：https://arxiv.org/abs/2511.03929
NVIDIA Nemotron 3: Efficient and Open Intelligence。技术报告：https://arxiv.org/abs/2512.20856
C-RADIOv4-H。Hugging Face 模型页面：https://huggingface.co/nvidia/C-RADIOv4-H
Parakeet-TDT-0.6B-v3。Hugging Face 模型页面：https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3
Megatron-LM。GitHub：https://github.com/NVIDIA/Megatron-LM
Transformer Engine。GitHub：https://github.com/NVIDIA/TransformerEngine
Megatron Energon。GitHub：https://github.com/NVIDIA/Megatron-Energon

译自 huggingface-blog · 录于二〇二六年五月三日