人工智能数据中心网络的端到端负载均衡：一种基于融合的提高培训效率的方法

IF 2.5 4区计算机科学 Q3 TELECOMMUNICATIONS

Transactions on Emerging Telecommunications Technologies Pub Date : 2025-09-24 DOI:10.1002/ett.70249

Ran Zhang, Xuan Zhao, Yingying Han, Yubin Yang, Jun Ruan, Jianqin Zhang, Donglin Chen, Heng Wang

{"title":"人工智能数据中心网络的端到端负载均衡：一种基于融合的提高培训效率的方法","authors":"Ran Zhang, Xuan Zhao, Yingying Han, Yubin Yang, Jun Ruan, Jianqin Zhang, Donglin Chen, Heng Wang","doi":"10.1002/ett.70249","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>In large-scale language model training, network performance is a crucial determinant of training efficiency. Traditional load balancing methods, such as equal-cost multipath (ECMP), often suffer from hash polarization, leading to suboptimal traffic distribution—particularly in scenarios with limited flow counts and a dominance of elephant flows. To mitigate this challenge, this paper introduces end-to-network load balancing (ENLB), a novel and readily deployable scheme that optimizes uplink utilization through coordinated server-switch traffic scheduling. Leveraging end-to-network convergence principles, ENLB enhances bandwidth efficiency while minimizing flow completion times. Simulation and experimental evaluations demonstrate that ENLB improves network bandwidth utilization by up to 38% and reduces model training task durations by over 3% compared to conventional ECMP-based approaches. These findings underscore ENLB's potential as a scalable solution for modern AI Data Center (AIDC) networks.</p>\n </div>","PeriodicalId":23282,"journal":{"name":"Transactions on Emerging Telecommunications Technologies","volume":"36 10","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"End-to-Network Load Balancing for AI Data Center Networks: A Convergence-Based Approach to Enhance Training Efficiency\",\"authors\":\"Ran Zhang, Xuan Zhao, Yingying Han, Yubin Yang, Jun Ruan, Jianqin Zhang, Donglin Chen, Heng Wang\",\"doi\":\"10.1002/ett.70249\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>In large-scale language model training, network performance is a crucial determinant of training efficiency. Traditional load balancing methods, such as equal-cost multipath (ECMP), often suffer from hash polarization, leading to suboptimal traffic distribution—particularly in scenarios with limited flow counts and a dominance of elephant flows. To mitigate this challenge, this paper introduces end-to-network load balancing (ENLB), a novel and readily deployable scheme that optimizes uplink utilization through coordinated server-switch traffic scheduling. Leveraging end-to-network convergence principles, ENLB enhances bandwidth efficiency while minimizing flow completion times. Simulation and experimental evaluations demonstrate that ENLB improves network bandwidth utilization by up to 38% and reduces model training task durations by over 3% compared to conventional ECMP-based approaches. These findings underscore ENLB's potential as a scalable solution for modern AI Data Center (AIDC) networks.</p>\\n </div>\",\"PeriodicalId\":23282,\"journal\":{\"name\":\"Transactions on Emerging Telecommunications Technologies\",\"volume\":\"36 10\",\"pages\":\"\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Transactions on Emerging Telecommunications Technologies\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/ett.70249\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"TELECOMMUNICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions on Emerging Telecommunications Technologies","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ett.70249","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

在大规模语言模型训练中，网络性能是决定训练效率的关键因素。传统的负载平衡方法，如等成本多路径（ECMP），经常会出现哈希极化，导致次优的流量分配——特别是在流量计数有限和大象流占主导地位的场景中。为了缓解这一挑战，本文引入了端到网络负载平衡（ENLB），这是一种新颖且易于部署的方案，通过协调服务器交换机流量调度来优化上行链路利用率。ENLB利用端到网络的融合原理，提高了带宽效率，同时最大限度地减少了流量完成时间。仿真和实验评估表明，与传统的基于ecmp的方法相比，ENLB将网络带宽利用率提高了38%，并将模型训练任务持续时间减少了3%以上。这些发现强调了ENLB作为现代AI数据中心（AIDC）网络的可扩展解决方案的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

End-to-Network Load Balancing for AI Data Center Networks: A Convergence-Based Approach to Enhance Training Efficiency

查看原文本刊更多论文

End-to-Network Load Balancing for AI Data Center Networks: A Convergence-Based Approach to Enhance Training Efficiency

In large-scale language model training, network performance is a crucial determinant of training efficiency. Traditional load balancing methods, such as equal-cost multipath (ECMP), often suffer from hash polarization, leading to suboptimal traffic distribution—particularly in scenarios with limited flow counts and a dominance of elephant flows. To mitigate this challenge, this paper introduces end-to-network load balancing (ENLB), a novel and readily deployable scheme that optimizes uplink utilization through coordinated server-switch traffic scheduling. Leveraging end-to-network convergence principles, ENLB enhances bandwidth efficiency while minimizing flow completion times. Simulation and experimental evaluations demonstrate that ENLB improves network bandwidth utilization by up to 38% and reduces model training task durations by over 3% compared to conventional ECMP-based approaches. These findings underscore ENLB's potential as a scalable solution for modern AI Data Center (AIDC) networks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Transactions on Emerging Telecommunications Technologies TELECOMMUNICATIONS-

CiteScore

8.90

自引率

13.90%

发文量

249

期刊介绍： ransactions on Emerging Telecommunications Technologies (ETT), formerly known as European Transactions on Telecommunications (ETT), has the following aims: - to attract cutting-edge publications from leading researchers and research groups around the world - to become a highly cited source of timely research findings in emerging fields of telecommunications - to limit revision and publication cycles to a few months and thus significantly increase attractiveness to publish - to become the leading journal for publishing the latest developments in telecommunications