apple-ml-research

用于学习语义丰富视觉表征的文本条件 JEPA

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

二〇二六年五月八日 · 英文原文

摘要

该工作提出 Text-Conditional JEPA（TC-JEPA），用于改进 I-JEPA 中 masked feature prediction 的视觉 self-supervised learning。方法使用图像 caption 降低 masked 位置预测不确定性，并引入细粒度 text conditioner，通过在 text tokens 上计算 sparse cross-attention 来调制预测的 patch features。

Image-based Joint-Embedding Predictive Architecture（I-JEPA）通过 masked feature prediction，为视觉 self-supervised learning 提供了一种有前景的方法。然而，由于 masked 位置固有的视觉不确定性，feature prediction 仍然具有挑战，且可能无法学习到语义表示。在这项工作中，我们提出 Text-Conditional JEPA（TC-JEPA），使用图像 caption 来降低预测不确定性。具体而言，我们使用一个细粒度 text conditioner 来调制预测的 patch features，该 conditioner 会在输入 text tokens 上计算 sparse cross-attention。通过这种方式……

译自 apple-ml-research · 录于二〇二六年五月八日