X · 研究者一手

@cwolferesearch 最近读到一篇很好的多教师 on-policy disti… 概述

@cwolferesearch Recently read a great overview of multi-teacher on-policy disti…

二〇二六年五月八日 · 英文原文

摘要

文章综述 multi-teacher on-policy distillation（MOPD）：从 student 采样 trajectories，用 reverse KL 匹配 teacher log probability，并可并入 GRPO loss。MiMo-V2-Flash、GLM-5、Nemotron-Cascade 2、DeepSeek-V4 等在 post-training 中使用 domain-specific teachers、checkpoints 或 experts 进行 distillation，部分包含 self-distillation 与 full vocabulary distillation。

最近读到一篇关于 multi-teacher on-policy distillation（MOPD）的很好综述，以及它在 MiMo-V2-Flash、GLM-5、Nemotron-Cascade 2、DeepSeek-V4 等近期 LLM 中的使用方式……

什么是 on-policy distillation（OPD）？OPD 的想法很简单。我们有一个 student 和一个 teacher。我们从 student 中采样 trajectories，然后使用 reverse KL divergence 作为目标，使这些 trajectories 上的 teacher log probability distribution 与之匹配。这个训练设置可以通过用 reverse KL 替换 group-relative advantage，集成到 GRPO loss 中。

Multi-teacher OPD（MOPD）通过在 OPD 训练中引入多个 teacher 来扩展这个想法。由于 RLVR 训练具有 domain-specific 特性，这一想法很有用。如果我们用 RLVR 在 math 上训练模型，可能会提升 math 性能，但损害模型在 creative tasks 上的质量。同样，用 tool use data 对模型进行 RL training，也可能降低其在 general-purpose benchmarks 上的表现。为了解决这种跷跷板问题，我们可以用 RL 训练 domain-specific models，并使用 MOPD 将它们 distill 到单个 student 中。

使用 MOPD 进行 post-training 已成为近期模型的常见选择：

MiMo-V2-Flash：从一个 general SFT model 开始，并在 post-training 的最后阶段，使用整个 post-training pipeline 中的 domain-specific models（即 SFT models、RL specialists，以及 student 自身）作为 MOPD 的 teachers，其中 teachers 按 domain 以 heuristic 方式选择。
GLM-5：从一个 sequential RL pipeline 产出的最终 RL checkpoint 开始，该 pipeline 覆盖 reasoning、agentic 和 general domains，并使用每个阶段的最终 checkpoint 作为 teacher，其中 teacher 同样根据 prompt 的 domain 选择。这里的 MOPD 目标是恢复能力，而不是跨 domain 合并能力。
Nemotron-Cascade 2：将 MOPD 放在 post-training 的中点，作为阶段之间的 stabilization step。三个 prior model checkpoints 被选作 teachers，分别来自此前针对 math、RLHF 和 multi-domain RL 的训练阶段。
DeepSeek-V4：使用 domain-specific SFT 和 RL 独立训练大量（10+）domain experts，然后将它们全部 distill 到单个 student 中。这篇 paper 有意思的是使用了 full vocabulary distillation，它的 memory overhead 很高，基础设施上也更复杂，而不是用 single logit 来近似 KL。

这篇 blog 还包含一段很好的说明，解释为什么 self-distillation 是 MOPD 中有用的补充：“Self 是 MOPD 开始时 student 的一个 snapshot——一个固定、稳定的 reference distribution。在 SFT/RL teachers 将 student 推向不熟悉区域的 tokens 上，向 Self distill 可以防止 catastrophic drift。”

尽管 MOPD 在多个不同报告中已经相当常见，但所使用的方法都非常相似（即 reverse KL、on-policy distillation、multiple teachers），这表明 OPD / MOPD 正在成为近期模型 training pipelines 中更标准化的方法。

blog 链接：https://t.co/kNc9heqE3T

译自 X · 研究者一手 · 录于二〇二六年五月八日