Hugging Face · Daily Papers

MoCapAnything V2：面向任意骨架的端到端动作捕捉

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

Kehong Gong, Zhengyu Wen, Dao Thien Phong, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li 等 13 位

二〇二六年五月一日 · arXiv:2604.28130 · PDF · Code

摘要

近期从 monocular video 进行任意 skeleton motion capture 的方法通常采用因式分解的 pipeline：Video-to-Pose network 预测 joint positions，随后通过解析式 inverse-kinematics（IK）阶段恢复 joint rotations。尽管这种设计有效，但其本质上存在局限，因为 joint positions 并不能完全决定 rotations，会留下 bone-axis twist 等 degrees of freedom 的歧义；同时，非可微的 IK 阶段也会阻止系统适应有噪声的预测，或针对最终 animation 目标进行优化。

在本文中，我们提出首个完全 end-to-end 的 framework，其中 Video-to-Pose 和 Pose-to-Rotation 均为可学习模块，并进行联合优化。我们观察到，pose-to-rotation 映射中的歧义来自 coordinate system information 的缺失：在不同的 rest poses 和 local axis conventions 下，相同的 joint positions 可能对应不同的 rotations。为解决这一问题，我们从目标 asset 引入一个 reference pose-rotation pair；它与 rest pose 一起，不仅锚定了映射关系，也定义了底层的 rotation coordinate system。该 formulation 将 rotation prediction 转化为一个约束良好的 conditional problem，并使有效学习成为可能。

此外，我们的模型直接从 video 预测 joint positions，不依赖 mesh intermediates，从而提升了 robustness 和 efficiency。两个阶段共享一个 skeleton-aware Global-Local Graph-guided Multi-Head Attention（GL-GMHA）模块，用于 joint-level local reasoning 和 global coordination。在 Truebones Zoo 和 Objaverse 上的实验表明，我们的方法将 rotation error 从约 17 degrees 降低到约 10 degrees，并在 unseen skeletons 上降至 6.54 degrees，同时 inference 速度相比基于 mesh 的 pipeline 提升约 20x。项目页面：https://animotionlab.github.io/MoCapAnythingV2/