berkeley-bair

自适应并行推理：高效推理扩展的下一范式

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

二〇二六年五月九日 · 英文原文

摘要

文章综述 Adaptive Parallel Reasoning，比较 Self-consistency、Tree/Graph of Thoughts、MCTS、ParaThinker、GroupThink、Hogwild 等方法，重点分析 Multiverse、ThreadWeaver、Parallel-R1、NPR 的 inference、KV cache、训练奖励与评估指标，并列出 accuracy、latency、stability、hardware-aware 与递归并行等开放问题。

.apr-fig { text-align: center; margin: 1.35em 0; line-height: 1.4; } .apr-fig--wide img { display: inline-block; width: 100%; max-width: 100%; height: auto; vertical-align: middle; } .apr-fig--wide-0-8 { max-width: 80%; margin-left: auto; margin-right: auto; } .apr-fig--tall img { display: inline-block; max-height: 300px; width: auto; max-width: 100%; height: auto; object-fit: contain; vertical-align: middle; } .apr-fig--tall-1-2x img { display: inline-block; max-height: 360px; width: auto; max-width: 100%; height: auto; object-fit: contain; vertical-align: middle; } .apr-fig--tall-1-5x img { display: inline-block; max-height: 450px; width: auto; max-width: 100%; height: auto; object-fit: contain; vertical-align: middle; } .apr-fig--tall-2x img { display: inline-block; max-height: 600px; width: auto; max-width: 100%; height: auto; object-fit: contain; vertical-align: middle; } .apr-fig .apr-fig-cap { display: block; text-align: center; font-size: 0.9em; font-style: italic; } .apr-ack a { color: #1565c0; font-weight: 500; text-decoration: none; border-bottom: 1px solid #90caf9; padding-bottom: 0.06em; } .apr-ack a:hover { color: #0d47a1; border-bottom-color: #1565c0; } adaptive parallel reasoning 概览。如果一个 reasoning model 能根据手头问题自行决定何时分解并 parallelize 独立子任务、启动多少个 concurrent threads，以及如何协调这些 threads，会怎样？本文对 parallel reasoning 领域的近期进展，尤其是 Adaptive Parallel Reasoning，进行详细分析。披露：本文一部分是 landscape survey，一部分是对 adaptive parallel reasoning 的观点文章。作者之一 Tony Lian 共同领导了 ThreadWeaver（Lian et al., 2025），这是下文讨论的方法之一。作者希望按各方法自身的设定来呈现它们。

Motivation

LLM reasoning capabilities 的近期进展，在很大程度上由 inference-time scaling 推动，此外还包括 data scaling 和 parameter scaling（OpenAI et al., 2024；DeepSeek-AI et al., 2025）。能够显式输出 reasoning tokens（通过中间步骤、backtracking 和 exploration）的模型，现在主导了 math、coding 和 agentic benchmarks。这些行为使模型能够探索不同假设、纠正早期错误，并综合得出结论，而不是直接承诺一个单一解法（Wen et al., 2025）。

问题在于，sequential reasoning 会随着 exploration 的规模线性增长。扩展 sequential reasoning tokens 是有代价的，因为模型可能超过有效 context limits（Hsieh et al., 2024）。中间 exploration paths 的累积会让模型在 attention 到 context 中的信息时，难以在干扰项之间进行消歧，从而导致模型性能下降，也就是 context-rot（Hong, Troynikov and Huber, 2025）。Latency 也会随 reasoning length 成比例增长。对于需要数百万 tokens 进行 exploration 和 planning 的复杂任务，用户等待数十分钟甚至数小时才得到答案并不少见（Qu et al., 2025）。随着我们继续沿 output sequence length 维度扩展，inference 也会变得更慢、更不可靠，并消耗更多 compute。

Parallel reasoning 自然成为一种解决方案。与其 sequentially 探索路径（Gandhi et al., 2024），并在每一步都累积 context window，不如让模型独立地（threads 不依赖彼此的 context）并发地（threads 可以同时执行）探索多个 threads。

Figure 1: Sequential vs. Parallel Reasoning

近几年，越来越多的工作在 synthetic settings（例如 Countdown game（Katz, Kokel and Sreedharan, 2025））、真实世界数学问题和通用 reasoning tasks 中探索了这一想法。

From Fixed Parallelism to Adaptive Control

已有方法表明 parallel reasoning 可能有帮助，但其中大多数仍然是在模型外部决定 parallel structure，而不是让模型自行选择。

Simple fork-and-join。Self-consistency/Majority Voting —— 独立采样多个完整 reasoning traces，从每个 trace 中提取 final answer，并返回最常见的答案（Wang et al., 2023）。Best-of-N（BoN）—— 类似 self-consistency，但使用训练好的 verifier 选择最佳解法，而不是采用 majority voting（Stiennon et al., 2022）。尽管这些方法实现简单，但由于各分支独立采样，它们通常会在分支之间产生冗余计算。

Heuristic-based structured search。Tree / Graph / Skeleton of Thoughts —— 一类 structured decomposition 方法，使用已知 search algorithms（BFS/DFS）探索多个备选 “thoughts”，并通过基于 LLM 的 evaluation 进行剪枝（Yao et al., 2023；Besta et al., 2024；Ning et al., 2024）。Monte-Carlo Tree Search（MCTS）—— 通过采样 random rollouts 估计 node values，并用 Upper Confidence Bound（UCB）风格的 exploration-exploitation 扩展 search tree（Xie et al., 2024；Zhang et al., 2024）。这些方法通过将任务分解为不重叠的 subtasks，改进了 simple fork-and-join；不过，它们需要预先知道 decomposition strategy，而这并不总是可知的。

Recent variants。ParaThinker —— 训练模型以两个固定阶段运行：先并行生成多个 reasoning threads，然后对它们进行综合。它们引入可训练的 control tokens（）和 thought-specific positional embeddings，通过 two-phase attention mask，在 reasoning 阶段强制独立性，并在 summarization 阶段实现受控整合（Wen et al., 2025）。GroupThink —— 多个并行 reasoning threads 可以在 token level 看到彼此的部分进展，并在生成过程中进行适应。不同于以独立请求运行的先前 concurrent methods，GroupThink 运行单个 LLM，同时生成多个相互依赖的 reasoning trajectories（Hsu et al., 2025）。Hogwild! Inference —— 多个并行 reasoning threads 共享 KV cache，并在没有显式 coordination protocol 的情况下决定如何分解任务。Workers 并发生成到共享 attention cache 中，并使用 RoPE 以不同顺序拼接各自的 KV blocks，而无需重新计算（Rodionov et al., 2025）。

Figure 2: Various Strategies for Parallel Reasoning

上述方法有一个共同局限：是否 parallelize、parallelization 的程度以及 search strategy 都是强加给模型的，不管问题本身是否真的从中受益。然而，不同问题需要不同层级的 parallelization，而这对 parallelization 的有效性至关重要。例如，一个框架如果对 “What’s 25+42?” 和 “What’s the smallest planar region in which you can continuously rotate a unit-length line segment by 180°?” 应用相同 parallel structure，那么对前者是在浪费 compute，对后者则很可能使用了错误的 decomposition strategy。

在上面描述的方法中，模型并没有被教会这种 adaptive behavior。一个自然的问题出现了：如果模型能够根据手头问题自行决定何时 parallelize、启动多少 threads，以及如何协调这些 threads，会怎样？

Adaptive Parallel Reasoning（APR）通过让 parallelization 成为模型生成的 control flow 的一部分来回答这个问题。形式化地说，adaptivity 指模型在 inference time 动态地在 parallel 和 serial operations 之间分配 compute 的能力。换句话说，具备 adaptive parallel reasoning（APR）能力的模型被教会协调自己的 control flow —— 何时 sequentially 生成 sequences，何时 in parallel 生成 sequences。需要注意的是，adaptive parallel reasoning 这一概念由 Learning Adaptive Parallel Reasoning with Language Models（Pan et al., 2025）提出，但它是一种范式，而不是一种具体方法。在本文中，APR 指这一范式，而 “the APR method” 指 Pan et al. (2025) 中的具体实例化方法。

这种转变之所以重要，有三个原因。

相比 Tree-of-Thoughts，APR 不需要面向领域的 decomposition heuristics。在 RL 过程中，模型通过试错学习通用 decomposition strategies。事实上，模型会以 emergent 的方式发现有用的 parallelization patterns，例如在执行下一步的同时对前一步进行 self-verification，或者用备选方案对主方案进行 hedge；这些模式很难手工设计（Yao et al., 2023；Wu et al., 2025；Zheng et al., 2025）。

相比 BoN，APR 避免了冗余计算。APR models 在分支前可以控制每个并行 thread 将要做什么。因此，APR 可以学习先生成一组独特且不重叠的 subtasks，再将它们分配给独立 threads（Wang et al., 2023；Stiennon et al., 2022；Pan et al., 2025；Yang et al., 2025）。

相比 non-adaptive approaches，APR 可以选择不 parallelize。Adaptive models 可以调整 parallelization 的程度，使其与问题复杂度以及 parallelization 的复杂度和开销相匹配（Lian et al., 2025）。

在实践中，这通过让模型输出 special tokens 来实现，用于控制何时 parallel reasoning、何时 sequential reasoning。下面是一个压缩后的 ThreadWeaver 风格 trace：一个 block 下有两个 outlines 和两条 paths，随后 threads 就一个 boxed answer 达成一致。

Figure 3: Example of an Adaptive Parallel Reasoning Trajectory from ThreadWeaver, manually condensed for ease of illustration.

Figure 4: Special Tokens Variants across Adaptive Parallel Reasoning Papers

Inference Systems for Adaptive Parallelism

我们实际如何执行 parallel branches？这里借鉴了 computer systems，尤其是 multithreading 和 multiprocessing。大多数相关工作可以看作是在利用 fork-join 设计。在 inference time，我们实际上是在要求模型执行一个 map-reduce 操作：

Fork：将问题分解为 subtasks/threads，并发处理它们

Join：将它们合成为 final answer

Figure 5: Fork-join Inference Design

具体来说，模型会遇到一个 subtasks 列表。随后它会对每个 subtask 进行 prefill，并将它们作为独立请求发送给 inference engine 处理。这些 threads 随后并发 decode，直到遇到 end token 或超过 max length。这个过程会阻塞，直到所有 threads 完成 decoding，然后聚合结果。这在各种 adaptive parallel reasoning 方法中都很常见。

不过，aggregation 阶段会出现一个问题：branches 中生成的内容无法轻易在 KV cache level 进行聚合。这是因为独立 threads 中的 tokens 从相同的 position IDs 开始，导致 encoding overlap，并在将 KV cache 合并回来时产生非标准行为。类似地，由于独立 threads 不彼此 attend，它们拼接后的 KV cache 会形成 non-causal attention pattern，而 base model 在训练期间没有见过这种模式。

为了解决这个问题，领域内在如何执行 aggregation process 上分成了两派，区别在于它们是修改 inference engine，还是绕开它。

Multiverse 修改 inference engine，以在 join 阶段复用 KV cache。深入看 Multiverse（Yang et al., 2025）的 memory management 之前，先理解在 “join” 阶段之前 KV cache 是如何处理的。注意，每个独立 thread 都共享 prefix sequence，也就是 subtasks 列表。如果没有优化，每个 thread 都需要 prefill 并重新计算 prefix sequence 的 KV cache。不过，这种冗余可以通过 SGLang 的 RadixAttention（Sheng et al., 2023）避免。RadixAttention 将多个请求组织为 radix tree，即一种 trie（prefix tree），其中包含长度不一的元素序列，而不是单个元素。这样，唯一新增的 KV cache entries 就来自独立 thread 的生成。

Figure 6: RadixAttention’s KV Cache Management Strategy

现在，如果一切顺利，所有独立 threads 都已经从 inference engine 返回。我们的目标变成弄清楚如何把它们重新合成为一个单一 sequence，以继续 decode 后续步骤。事实证明，在 synthesis 阶段可以复用这些独立 threads 的 KV cache。具体来说，Multiverse（Yang et al., 2025）、Parallel-R1（Zheng et al., 2025）和 NPR（Wu et al., 2025）修改 inference engine，复制每个 thread 生成的 KV cache，并编辑 page table，使其将非连续 memory blocks 拼接成一个单一的 KV cache sequence。这避免了第二次 prefill 的冗余计算，并尽可能复用现有 KV cache。

不过，这有几个主要局限。首先，这种方法需要修改 inference engine 来执行非标准 memory handling，可能导致意外行为。具体来说，由于 synthesis request 引用了先前请求中的 KV cache，它会给系统带来脆弱性，并产生 bad pointers 的可能性。另一个请求可能进入并在 synthesis request 完成前 evict 被引用的 KV cache，导致它必须暂停并触发对先前 thread request 的重新 prefill。这个问题使 Multiverse 研究者（Yang et al., 2025）限制 inference engine 可处理的 batch size，从而限制了 throughput。

Figure 7: KV Cache “Stitching” During Multiverse Inference

其次，这种方法改变了模型看到 sequence 的方式，造成模型 pretraining 中未见过的 distributional shift，因此需要更大量的训练来对齐行为。具体而言，当我们以这种方式拼接 KV cache 时，会创建一个带有非标准 position encoding 的 sequence。在 independent-thread generation 期间，所有 threads 都从同一个 position index 开始，并 attend 到先前 subtasks，而不是彼此。因此，当 threads 合并回来时，得到的 KV cache 具有非标准 positional encoding，并且不使用 causal attention。因此，这种方法需要大量训练来使模型适应这种新行为。为此，Multiverse（Yang et al., 2025）及相关工作在训练期间应用 modified attention mask，以防止独立 threads 相互 attend，从而对齐训练和 inference 行为。

Figure 8: Multiverse’s Attention Mask

既然非标准 KV cache management 会带来这些问题，我们能否尝试一种不修改 engine 的方法？

ThreadWeaver 保持 inference engine 不变，并将 orchestration 移到 client。ThreadWeaver（Lian et al., 2025）将 parallel inference 纯粹视为 client-side 问题。“Fork” 过程几乎与 Multiverse 相同，但 join 阶段对 memory 的处理非常不同，因为它不修改 engine internals。相反，client 将所有独立 branches 的 text outputs 拼接成一个连续 sequence。然后，engine 执行第二次 prefill，为 conclusion generation step 生成 KV cache。虽然这引入了 Multiverse 试图避免的计算冗余，但 prefill 的成本显著低于 decoding。此外，这不需要在 inference 阶段进行特殊 attention handling，因为第二次 prefill 使用 causal attention（threads 可以看到彼此），因此更容易让 sequential autoregressive models 适配这一任务。

Figure 9: ThreadWeaver’s Prefill and Decode Strategy

我们应该如何训练模型学习这种行为？朴素地说，对于每条 parallel trajectory，可以按照我们的 inference pattern 将其拆分为多个 sequential pieces。例如，我们会训练模型在给定 prompt 时输出 subtasks，在给定 prompt+subtask assignment 时输出 individual threads，并在给定 prompt+subtasks+corresponding threads 时输出 conclusion。不过，这看起来冗余且 compute inefficient。能否做得更好？

答案是可以。正如 ThreadWeaver（Lian et al., 2025）中所做的，我们可以将 parallel trajectory 组织成 prefix-tree（trie），将其 flatten 成单一 sequence，并在训练期间应用 ancestor-only attention mask（不是 inference 期间！）。

Figure 10: Building the Prefix-tree and Flattening into a single training sequence

具体来说，我们应用 masking 和 position IDs 来模拟 inference 行为，使每个 thread 只 conditioned on prompt+subtasks，而从不 attend 到 sibling threads 或 final conclusion。Engine-agnostic 设计让采用变得容易，因为你不需要寻找单独的 hosting method，并且可以利用现有 hardware infra。它还会随着现有 inference engines 的改进而改进。更重要的是，有了 engine-agnostic 方法，我们可以轻松服务一个在 sequential 和 parallel thinking modes 之间切换的 hybrid model。

Training Models to Use Parallelism

一旦 inference path 存在，下一个问题就是教模型使用它。Demonstrations 是必需的，因为模型必须学习输出用于编排 control flow 的 special tokens。我们发现，base models 的 instruction-following 能力不足以生成 parallel threads。这里有一个有趣的问题：SFT training 是否诱导出一种此前不存在的、用于 parallel execution 的基础 reasoning capability，还是仅仅将模型已有的 pre-trained capabilities 对齐到某种特定的 control-flow token syntax？通常的观点是 SFT 教授新知识；但与普遍看法相反，一些论文——尤其是 Parallel-R1（Zheng et al., 2025）和 NPR（Wu et al., 2025）——认为它们的 SFT demonstrations 只是诱导了 format following（即如何组织 parallel requests）。我们将这个问题留作未来工作。

Figure 11: Sources of Parallelization Demonstration Data

Demonstrations 教会 parallel control flow 的语法，但它们并不能完全解决 incentive problem。在理想情况下，我们只需要奖励 outcome accuracy，而只要模型通过 SFT 学会输出 special tokens，parallelization pattern 就会自然出现，类似 long CoT 的涌现。然而，研究者（Zheng et al., 2025）观察到这还不够，事实上我们确实需要 parallelization incentives。问题于是变成：如何判断模型是否在有效 parallelizing？

Structure-only rewards 太容易被利用。朴素地说，我们可以按启动的 threads 数量给予 reward。但模型可以启动许多短且无用的 threads 来 hack reward。好吧，这行不通。那么，仅仅为正确使用 parallel structure 给予 binary reward 呢？这部分解决了模型滥发新 threads 的问题，但模型仍会在不需要时学会启动 threads。Parallel-R1 的作者（Zheng et al., 2025）引入了 alternating-schedule，只在 20% 的时间奖励 parallel structure，成功提高了 parallel structure 的使用率（13.6% → 63%），但对整体 accuracy 影响很小。采用这种 structure-only 方法，我们可能正在偏离提高 accuracy 和降低 latency 的原始目标……

我们能否直接优化 Pareto frontier？Accuracy 很简单 —— 只看 outcome。那么 latency 呢？Efficiency rewards 需要追踪 critical path。在 sequential-only trajectories 中，我们可以根据生成的 tokens 总数来衡量 latency。为了将其扩展到 parallel trajectories，我们可以关注 critical path，即 causally dependent 的最长 token 序列，因为它直接决定端到端 generation time（即 wall-clock time）。举例来说，当有两个 sections、每个 section 有五个 threads 时，critical path 会经过第一个 parallel section 中最长的 thread，然后经过任意 sequential tokens，再经过第二个 parallel section 中最长的 thread，如此一直到 sequence 结束。

Figure 12: Critical Path Length Illustration

目标是最小化 critical path 的长度。同时，我们仍然希望模型花费 tokens 去并行探索 threads。为了结合这两个目标，可以关注让 critical path 成为 total tokens spent 中更小的比例。ThreadWeaver 的作者（Lian et al., 2025）将 parallelization reward 表述为 $1 - L_{\mathrm{critical}} / L_{\mathrm{total}}$，它在 sequential trajectory 中为 0，并且随着 critical path 相对于生成的 total tokens 变小而线性增加。

Parallel efficiency 应由 correctness gating。直观上，当多条 trajectories 都正确时，我们应给 parallelization 更高效的 trajectories 分配更多 reward。但如果它们都不正确呢？还应该分配任何 reward 吗？可能不应该。形式化地说，$R = R_{\mathrm{correctness}} + R_{\mathrm{parallel}}$。假设 outcome correctness 是 binary 的，可以写作 $R = \mathbf{1}(\text{Correctness}) + \mathbf{1}(\text{Correctness}) \times (\text{some parallelization metric})$。这样，模型只有在回答正确时才获得 parallelization reward，因为如果模型无法正确回答问题，我们不希望对它施加 parallelization 约束。

Figure 13: Differences in Reward Designs Across Adaptive Parallel Reasoning Works

Evaluation and Open Questions

说了这么多，这些 adaptive parallel methods 实际表现如何？嗯……这是个难题，因为它们在 model choice 和 metrics 上不同。模型选择取决于 training method、SFT problem difficulty 和 sequence length。在 s1k 这类包含研究生级数学与科学问题的困难数据集上运行 SFT 时，研究者选择了 large base model（Multiverse 使用 Qwen2.5 32B（Yang et al., 2025）），以捕捉 solution trajectories 背后的复杂 reasoning structure。在运行 RL 时，研究者由于 compute cost 限制，选择了小型 non-CoT instruct model（4B、8B）。

Figure 14: Difference in Model Choice Across Adaptive Parallel Reasoning Papers

每篇论文也对 adaptive parallel reasoning 如何贡献于研究领域给出了略有不同的解读。它们优化不同的理论目标，因此使用的 metrics 集合也略有不同：

Multiverse 和 ThreadWeaver（Yang et al., 2025；Lian et al., 2025）旨在以更快速度达到 sequential-AR-model-level accuracy。Multiverse 表明，在相同固定 context window 下，APR models 可以实现更高 accuracy；而 ThreadWeaver 表明，APR model 在取得相近 accuracy 的同时，实现了更短的端到端 token latency（critical path length）。

NPR（Wu et al., 2025）将 sequential fallback 视为 failure mode，并优化 100% Genuine Parallelism Rate，该指标以 parallel tokens 占 total tokens 的比例衡量。

Parallel-R1（Zheng et al., 2025）不关注端到端 latency，而是优化 exploration diversity，将 APR 呈现为一种 mid-training exploration scaffold，在 RL 后提供性能提升。

Open Questions

尽管 Adaptive Parallel Reasoning 是迈向更高效 inference-time scaling 的一个有前景的步骤，但仍存在重要开放问题。如上所述，Parallel-R1（Zheng et al., 2025）将 APR 呈现为一种 mid-training exploration scaffold，而不是主要的 inference-time 技术。这引出了一个更根本的问题：inference-time 的 parallelization 是否能持续提升 accuracy，还是它主要作为 training-time exploration scaffold 有价值？Parallel-R1 暗示，RL 期间由 parallel structure 诱导出的 diversity，可能比 test time 的 parallelization 本身更重要。

一个相关担忧是稳定性。当 parallelization rewards 放松时，模型还持续倾向于退回 sequential reasoning。Parallel-R1 作者显示，在 200 steps 后移除 parallelization reward，会导致模型回退到 sequential behavior。这是 training stability 问题、reward signal design 问题，还是说明 parallel structure 确实与 autoregressive pretraining 塑造的模型 prior 冲突？

除了 APR 是否有效，部署也带来了自身的问题。我们能否设计出考虑 inference time 可用 compute budget 的训练方法，使 parallelization decisions 具备 hardware-aware 特性，而不只是由问题驱动？

最后，上面讨论的 parallel structures 本质上都是扁平的。如果允许 parallelization depth > 1 会怎样？Recursive language models（RLMs；Zhang, Kraska and Khattab, 2026）能够有效管理 long context，并展示出有前景的 inference-time scaling 能力。当使用激励 adaptive parallelization 的 end-to-end RL 训练时，RLMs 表现如何？

Acknowledgements

感谢 Nicholas Tomlin 和 Alane Suhr 为我们提供有益反馈。感谢 Christopher Park、Karl Vilhelmsson、Nyx Iskandar、Georgia Zhou、Kaival Shah 和 Jyoti Rani 提供富有洞察力的建议。感谢 Vijay Kethana、Jaewon Chang、Cameron Jordan、Syrielle Montariol、Erran Li 和 Anya Ji 参与有价值的讨论。感谢 Jiayi Pan、Xiuyu Li 和 Alex Zhang 就 Adaptive Parallel Reasoning 和 Recursive Language Models 与我们进行建设性通信。

译自 berkeley-bair · 录于二〇二六年五月九日