{"title":"MP-CreditINT: Enhancing multi-path RDMA transport with credit-based congestion control and in-band network telemetry","authors":"Yi Pan, Jiali You","doi":"10.1016/j.comnet.2025.111665","DOIUrl":null,"url":null,"abstract":"<div><div>Remote Direct Memory Access (RDMA) has played a critical role in recent Large-Language-Model (LLM) training workloads by enabling low-latency, high-throughput communication across GPUs. To further improve network efficiency, multi-path RDMA transport has received increasing attention. Given the limited on-chip resources of RDMA NICs, MP-RDMA stands out as a state-of-the-art multi-path transport by adopting memory-efficient mechanisms for congestion and out-of-order control. However, MP-RDMA relies on ECN, which provides only coarse-grained binary congestion signals, resulting in limited congestion control capabilities. Our experimental observations reveal intrinsic limitations of MP-RDMA, such as slow bandwidth probing, large oscillations in queue buildup, and transient congestion, etc. These limitations reduce the network efficiency, especially under All-To-All communication patterns that are becoming increasingly dominant with the evolution of Mixture-of-Experts (MoE) models and Expert Parallelism.</div><div>To address these limitations and retain the memory-efficient property of MP-RDMA, we propose the MP-CreditINT. This approach re-architects MP-RDMA using credit-based congestion control and in-band network telemetry (INT). We systematically enumerate and address the architectural and algorithmic challenges arising from this transformation, including explicit path control and path symmetry, INT-based data-clocking credit window control, and robustness against feedback loop breakage. Then, we evaluate MP-CreditINT using micro-benchmarks, heavy incast traffic, and LLM training workloads. Simulation results demonstrate that when compared to MP-RDMA, MP-CreditINT achieves 6–38× faster ramp-up speed and 8–16× faster fairness convergence, while maintaining near-zero out-of-order degree. In heavy incast scenarios, it achieves superior fairness and near-zero queue buildup, whereas MP-RDMA exhibits exponential queue growth with increasing incast scale. Finally, under two representative LLM training workloads, All-Reduce and All-To-All, MP-CreditINT reduces completion time by 5%–8% and 7%–13% respectively when compared to MP-RDMA, demonstrating its benefits in LLM training.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"272 ","pages":"Article 111665"},"PeriodicalIF":4.6000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625006322","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Remote Direct Memory Access (RDMA) has played a critical role in recent Large-Language-Model (LLM) training workloads by enabling low-latency, high-throughput communication across GPUs. To further improve network efficiency, multi-path RDMA transport has received increasing attention. Given the limited on-chip resources of RDMA NICs, MP-RDMA stands out as a state-of-the-art multi-path transport by adopting memory-efficient mechanisms for congestion and out-of-order control. However, MP-RDMA relies on ECN, which provides only coarse-grained binary congestion signals, resulting in limited congestion control capabilities. Our experimental observations reveal intrinsic limitations of MP-RDMA, such as slow bandwidth probing, large oscillations in queue buildup, and transient congestion, etc. These limitations reduce the network efficiency, especially under All-To-All communication patterns that are becoming increasingly dominant with the evolution of Mixture-of-Experts (MoE) models and Expert Parallelism.
To address these limitations and retain the memory-efficient property of MP-RDMA, we propose the MP-CreditINT. This approach re-architects MP-RDMA using credit-based congestion control and in-band network telemetry (INT). We systematically enumerate and address the architectural and algorithmic challenges arising from this transformation, including explicit path control and path symmetry, INT-based data-clocking credit window control, and robustness against feedback loop breakage. Then, we evaluate MP-CreditINT using micro-benchmarks, heavy incast traffic, and LLM training workloads. Simulation results demonstrate that when compared to MP-RDMA, MP-CreditINT achieves 6–38× faster ramp-up speed and 8–16× faster fairness convergence, while maintaining near-zero out-of-order degree. In heavy incast scenarios, it achieves superior fairness and near-zero queue buildup, whereas MP-RDMA exhibits exponential queue growth with increasing incast scale. Finally, under two representative LLM training workloads, All-Reduce and All-To-All, MP-CreditINT reduces completion time by 5%–8% and 7%–13% respectively when compared to MP-RDMA, demonstrating its benefits in LLM training.
期刊介绍:
Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.