人工智能数据中心网络的端到端负载均衡:一种基于融合的提高培训效率的方法

IF 2.5 4区 计算机科学 Q3 TELECOMMUNICATIONS
Ran Zhang, Xuan Zhao, Yingying Han, Yubin Yang, Jun Ruan, Jianqin Zhang, Donglin Chen, Heng Wang
{"title":"人工智能数据中心网络的端到端负载均衡:一种基于融合的提高培训效率的方法","authors":"Ran Zhang,&nbsp;Xuan Zhao,&nbsp;Yingying Han,&nbsp;Yubin Yang,&nbsp;Jun Ruan,&nbsp;Jianqin Zhang,&nbsp;Donglin Chen,&nbsp;Heng Wang","doi":"10.1002/ett.70249","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>In large-scale language model training, network performance is a crucial determinant of training efficiency. Traditional load balancing methods, such as equal-cost multipath (ECMP), often suffer from hash polarization, leading to suboptimal traffic distribution—particularly in scenarios with limited flow counts and a dominance of elephant flows. To mitigate this challenge, this paper introduces end-to-network load balancing (ENLB), a novel and readily deployable scheme that optimizes uplink utilization through coordinated server-switch traffic scheduling. Leveraging end-to-network convergence principles, ENLB enhances bandwidth efficiency while minimizing flow completion times. Simulation and experimental evaluations demonstrate that ENLB improves network bandwidth utilization by up to 38% and reduces model training task durations by over 3% compared to conventional ECMP-based approaches. These findings underscore ENLB's potential as a scalable solution for modern AI Data Center (AIDC) networks.</p>\n </div>","PeriodicalId":23282,"journal":{"name":"Transactions on Emerging Telecommunications Technologies","volume":"36 10","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"End-to-Network Load Balancing for AI Data Center Networks: A Convergence-Based Approach to Enhance Training Efficiency\",\"authors\":\"Ran Zhang,&nbsp;Xuan Zhao,&nbsp;Yingying Han,&nbsp;Yubin Yang,&nbsp;Jun Ruan,&nbsp;Jianqin Zhang,&nbsp;Donglin Chen,&nbsp;Heng Wang\",\"doi\":\"10.1002/ett.70249\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>In large-scale language model training, network performance is a crucial determinant of training efficiency. Traditional load balancing methods, such as equal-cost multipath (ECMP), often suffer from hash polarization, leading to suboptimal traffic distribution—particularly in scenarios with limited flow counts and a dominance of elephant flows. To mitigate this challenge, this paper introduces end-to-network load balancing (ENLB), a novel and readily deployable scheme that optimizes uplink utilization through coordinated server-switch traffic scheduling. Leveraging end-to-network convergence principles, ENLB enhances bandwidth efficiency while minimizing flow completion times. Simulation and experimental evaluations demonstrate that ENLB improves network bandwidth utilization by up to 38% and reduces model training task durations by over 3% compared to conventional ECMP-based approaches. These findings underscore ENLB's potential as a scalable solution for modern AI Data Center (AIDC) networks.</p>\\n </div>\",\"PeriodicalId\":23282,\"journal\":{\"name\":\"Transactions on Emerging Telecommunications Technologies\",\"volume\":\"36 10\",\"pages\":\"\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Transactions on Emerging Telecommunications Technologies\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/ett.70249\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"TELECOMMUNICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions on Emerging Telecommunications Technologies","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ett.70249","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
引用次数: 0

摘要

在大规模语言模型训练中,网络性能是决定训练效率的关键因素。传统的负载平衡方法,如等成本多路径(ECMP),经常会出现哈希极化,导致次优的流量分配——特别是在流量计数有限和大象流占主导地位的场景中。为了缓解这一挑战,本文引入了端到网络负载平衡(ENLB),这是一种新颖且易于部署的方案,通过协调服务器交换机流量调度来优化上行链路利用率。ENLB利用端到网络的融合原理,提高了带宽效率,同时最大限度地减少了流量完成时间。仿真和实验评估表明,与传统的基于ecmp的方法相比,ENLB将网络带宽利用率提高了38%,并将模型训练任务持续时间减少了3%以上。这些发现强调了ENLB作为现代AI数据中心(AIDC)网络的可扩展解决方案的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

End-to-Network Load Balancing for AI Data Center Networks: A Convergence-Based Approach to Enhance Training Efficiency

End-to-Network Load Balancing for AI Data Center Networks: A Convergence-Based Approach to Enhance Training Efficiency

In large-scale language model training, network performance is a crucial determinant of training efficiency. Traditional load balancing methods, such as equal-cost multipath (ECMP), often suffer from hash polarization, leading to suboptimal traffic distribution—particularly in scenarios with limited flow counts and a dominance of elephant flows. To mitigate this challenge, this paper introduces end-to-network load balancing (ENLB), a novel and readily deployable scheme that optimizes uplink utilization through coordinated server-switch traffic scheduling. Leveraging end-to-network convergence principles, ENLB enhances bandwidth efficiency while minimizing flow completion times. Simulation and experimental evaluations demonstrate that ENLB improves network bandwidth utilization by up to 38% and reduces model training task durations by over 3% compared to conventional ECMP-based approaches. These findings underscore ENLB's potential as a scalable solution for modern AI Data Center (AIDC) networks.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
8.90
自引率
13.90%
发文量
249
期刊介绍: ransactions on Emerging Telecommunications Technologies (ETT), formerly known as European Transactions on Telecommunications (ETT), has the following aims: - to attract cutting-edge publications from leading researchers and research groups around the world - to become a highly cited source of timely research findings in emerging fields of telecommunications - to limit revision and publication cycles to a few months and thus significantly increase attractiveness to publish - to become the leading journal for publishing the latest developments in telecommunications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信