microsoft-research

Microsoft 在 NSDI 2026：大规模网络系统进展

Microsoft at NSDI 2026: Advances in large-scale networked systems

二〇二六年五月六日 · 英文原文

摘要

Large-scale networked systems underpin cloud computing, AI, and distributed applications and services. The USENIX Symposium on Networked Systems Design and Implementation 2026 (opens in new tab) (NSDI ’26) is a leading forum where researchers and practitioners share new…

大规模网络化系统支撑着 cloud computing、AI 以及分布式应用和服务。USENIX Symposium on Networked Systems Design and Implementation 2026（在新标签页中打开）（NSDI ’26）是一个重要论坛，研究人员和从业者在这里分享关于这些系统的设计与运行的新研究、洞见和进展。Microsoft 很高兴继续作为赞助方支持 NSDI ’26，这体现了我们持续推动系统与 networking 研究、并与更广泛社区保持交流的承诺。Microsoft 的研究人员和工程领导者也在 program committee 以及其他组织角色中任职。今年，Microsoft 作者及合作者共有 11 篇论文被会议接收，涵盖 datacenter 和 wide-area networks、AI systems 以及 cloud infrastructure。总体来看，这些论文展示了构建和运行大规模网络化系统方面的进展。焦点：Microsoft research newsletter Microsoft Research Newsletter 与 Microsoft 的研究社区保持联系。立即订阅在新标签页中打开 Technical sessions Monday, May 4, 2:00–3:20 PM DroidSpeak: KV Cache Sharing Across Fine-tuned Model Variants（在新标签页中打开） Yuhan Liu, Yuyang Huang, Jiayi Yao, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, and Junchen Jiang, University of Chicago; Shan Lu, Madan Musuvathi, and Esha Choukse, Microsoft DroidSpeak 使具有相同架构的 LLMs 能够在不同模型之间共享并部分复用 KV caches，在几乎不影响输出质量的情况下，实现最高 4 倍的 throughput 提升和更快响应。Monday, May 4, 3:50–5:30 PM Eywa: Automating Model-Based Testing using LLMs（在新标签页中打开） Rajdeep Mondal, Rathin Singha, Todd D. Millstein, and George Varghese, UCLA; Ryan Beckett and Siva Kesava Reddy Kakarla, Microsoft Research Eywa 使用 LLMs 从自然语言来源自动构建 protocol models，从而支持 model-based testing。它在广泛使用的 network protocol implementations 中发现了 33 个 bug，其中 16 个此前未知。Tuesday, May 5, 2:00–3:20 PM Octopus: Enhancing CXL Memory Pods via Sparse Topology（在新标签页中打开） Yuhong Zhong, Columbia University; Fiodar Kazhamiaka, Pantea Zardoshti, Shuwei Teng and Rodrigo Fonseca, Microsoft Azure; Mark D. Hill, University of Wisconsin-Madison; Daniel S. Berger, Microsoft Azure and University of Washington Octopus 为 disaggregated memory pods 引入了一种无 switch 设计，可降低成本，并扩展到多机架 pods。在一个三服务器硬件原型上，Octopus RPCs 比机架内 RDMA 快 3.2x，比 CXL switches 快 2.4x。Tuesday, May 5, 3:50–5:30 PM HEDGE: Traffic Engineering with Probabilistic Link Capacities（在新标签页中打开） Arjun Devraj, Cornell University; Bill Owens, NYSERNet; Umesh Krishnaswamy, Microsoft; Ying Zhang, Meta; Rachee Singh, Cornell University HEDGE 通过结合 link-local 与全局 network-wide resilience，缓解 optical networks 中特定波长故障的影响，在 link performance 波动的情况下保持稳定容量并优化 traffic flow。它在保持与现有系统相当 throughput 的同时，减少了网络中断。Wednesday, May 6, 9:00–10:20 AM AVA: Towards Video Analytics with Vision Language Models（在新标签页中打开） Yuxuan Yan, Zhejiang University; Shiqi Jiang, Microsoft Research; Ting Cao, Tsinghua University; Yifan Yang, Microsoft Research; Qianqian Yang and Yuanchao Shu, Zhejiang University; Yuqing Yang and Lili Qiu, Microsoft Research AVA 通过将 event knowledge graphs 与基于 vision-language models 的 agentic retrieval 相结合，支持开放式 video analytics。此外，为评估超长、开放世界场景中的 video analytics，作者引入了 AVA-100，这是一个 benchmark，包含 8 个视频，每个视频超过 10 小时，并包含 120 组人工标注的、多样且复杂的问答对；在该 benchmark 上，AVA 达到 75.8% 的准确率。Wednesday, May 6, 9:00–10:20 AM SmartNIC-Enabled Live Migration for Storage-Optimized VMs with Pyrocumulus（在新标签页中打开） Jiechen Zhao, University of Toronto and Microsoft Research Asia; Ran Shu, Lei Qu, Ziyue Yang, and Rui Ma, Microsoft Research Asia; Derek Chiou, Microsoft and UT Austin; Natalie Enright Jerger, University of Toronto; Peng Cheng and Yongqiang Xiong, Microsoft Research Asia Pyrocumulus 利用 FPGA SmartNIC 的硬件可定制性和高效网络可访问性，并结合 LM protocol、architecture 与 algorithm designs，为 storage-optimized VMs 实现快速、低开销的 live migration。 Wednesday, May 6, 10:50 AM–12:30 PM ForestColl: Throughput-Optimal Collective Communications on Heterogeneous Network Fabrics（在新标签页中打开） Liangyu Zhao, University of Washington; Saeed Maleki, Independent Researcher; Yuanhong Wang, Tsinghua University; Zezhou Wang, University of Washington; Ziyue Yang, Microsoft Research; Hossein Pourreza, Microsoft; Arvind Krishnamurthy, University of Washington ForestColl 将 broadcast/aggregation spanning trees 构建为通信调度，从而实现理论最优。其调度生成在 polynomial time 内运行，并且具有很强的可扩展性。它支持任意 network fabric，包括 switching fabrics 和直接 accelerator connections。Wednesday, May 6, 10:50 AM–12:30 PM Heuristic Analysis from Source Code via Symbolic-Guided Optimization（在新标签页中打开） Pantea Karimi, MIT; Siva Kesava Reddy Kakarla and Ryan Beckett, Microsoft Research; Santiago Segarra, Rice University; Pooria Namyar, Microsoft Research; Mohammad Alizadeh, MIT; Behnaz Arzani, Microsoft Research MetaEase 直接从 source code 分析 heuristics，以发现最坏情况性能场景，避免了复杂 formal modeling 的需求。它在多个领域达到或优于 SOTA analyzers，并揭示了真实系统中此前未知的性能差距。Wednesday, May 6, 2:00–3:20 PM Harvesting Spare CPU Resources in Container Systems（在新标签页中打开） Adam Hall and Anirudh Sarma, Georgia Institute of Technology; Esha Choukse, Microsoft Azure Research; Umakishore Ramachandran, Georgia Institute of Technology; Sameh Elnikety, Microsoft Research HarvestContainers 在利用 latency-sensitive containers 的空闲 CPU cores 运行 latency-tolerant workloads 的同时，保护这些 containers 免受干扰。它动态判断可以安全回收多少 cores，且无需修改 applications 或 operating system。它可实现最高 75% 的空闲 CPU 利用率，同时将 tail latency 控制在 standalone performance 的 4% 以内。Wednesday, May 6, 3:50–5:30 PM Offloading Cloud Network Services at Production Scale with SONiC DASH SmartSwitch（在新标签页中打开） Community Award Winner Shaofeng Wu, The Chinese University of Hong Kong and Microsoft Research Asia; Zhixiong Niu, Microsoft Research Asia; Riff Jiang, Lawrence Lee, Junhua Zhai, Ze Gan, Vasundhara Volam, Prabhat Aravind, Prince Sunny, Prince George, Qi Luo, Evan Langlais, Soumya Tiwari, Venkat Satish Katta, Weixi Chen, Rishiraj Hazarika, Sachin Jain, Deven Jagasia, Michal Zygmunt, Avijit Gupta, Neeraj Motwani, and Pranjal Shrivastava, Microsoft; Qiang Su, The Chinese University of Hong Kong; Anil Reddy Pannala, Kristina Moore, James Grantham, Anupam Pandey, Xin Liu, Guohan Lu, Gerald De Grace, Rishabh Tewari, Lihua Yuan, Erica Lan, Deepak Bansal, and Dave Maltz, Microsoft; Yongqiang Xiong, Microsoft Research Asia; Hong Xu, The Chinese University of Hong Kong SONiC DASH SmartSwitch 通过硬件友好的 pipeline、统一的 switch architecture 和开放开发模型，重新设计 cloud network offloading，同时解决关键的可扩展性与部署挑战。它已在 Azure 中大规模部署，在提供高 throughput 和 connection capacity 的同时，显著提升了能效和空间效率。Wednesday, May 6, 3:50–5:30 PM KRAKENGUARD: Towards Fine-Grained eBPF Isolation（在新标签页中打开） Jainil Patel, IIT Roorkee; Lucas Graeff Buhl-Nielsen, Quantco; Adrien Ghosn, Microsoft; Marios Kogias, Imperial College London KRAKENGUARD 使用 symbolic execution，在 load time 对 eBPF programs 执行细粒度、基于 policy 的控制，使其能够在 multi-tenant environments 中安全使用，而无需依赖粗粒度的 Linux capabilities。它可防止恶意行为、检测漏洞，并允许在强 isolation guarantees 下安全执行不可信 programs。来自 Microsoft 的 symposium organizers Program Committee Ganesh Ananthanarayanan Behnaz Arzani Hitesh Ballani Ryan Beckett Ranveer Chandra Paolo Costa Rodrigo Fonseca Xenofon Foukas Kevin Hsieh Umesh Krishnaswamy（在新标签页中打开） Jing Liu Jonathan Mace Dave Maltz Sathiya Mani Dushyanth Narayanan Suman Nath Ram Ramjee Stefan Saroiu Steering Committee Sujata Banerjee Jay Lorch 在新标签页中打开文章 Microsoft at NSDI 2026: Advances in large-scale networked systems 首次发布于 Microsoft Research。

译自 microsoft-research · 录于二〇二六年五月六日