Hugging Face · Daily Papers

通过系统集成式 Speculative Decoding 加速 RL 后训练 Rollouts

Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

Hayate Iso, Tiyasa Mitra, Sudipta Mondal, Rasoul Shafipour, Venmugil Elango, Terry Kong, Yuki Huang, Seonjin Na 等 18 位

二〇二六年五月三日 · arXiv:2604.26779 · PDF

摘要

frontier language model 的 RL post-training 日益受限于 autoregressive rollout 生成，这使 rollout 加速成为一个核心系统挑战。许多现有的效率方法通过改变 rollout 或优化机制来提升 throughput，例如通过 off-policy 执行、replay 或低精度生成。我们研究了 speculative decoding，作为一种用于 RL rollout 的无损加速原语，它能够保持目标模型的输出分布不变。

我们在 NeMo-RL 中基于 vLLM backend 实现了 speculative decoding，支持同步和异步 pipeline，并使 RL rollout 期间可以进行 speculation。这一收益可通过多种 speculation 机制实现，例如 pretrained MTP heads、小型外部 draft model，甚至 Eagle3 等通常在 RL 阶段之后应用的技术。这为在 RL training 中部署 state-of-the-art speculative decoding 提供了一条路径。

在 8B 规模的 reasoning post-training workload 中，在同步 RL 设置下，speculative decoding 将 rollout throughput 提升了 1.8x。借助高保真 performance simulator，我们预计在 235B 规模下，将 speculative decoding 与异步 RL 结合可带来最高 2.5x 的端到端训练加速。