Ran Zhang, Xuan Zhao, Yingying Han, Yubin Yang, Jun Ruan, Jianqin Zhang, Donglin Chen, Heng Wang
{"title":"人工智能数据中心网络的端到端负载均衡:一种基于融合的提高培训效率的方法","authors":"Ran Zhang, Xuan Zhao, Yingying Han, Yubin Yang, Jun Ruan, Jianqin Zhang, Donglin Chen, Heng Wang","doi":"10.1002/ett.70249","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>In large-scale language model training, network performance is a crucial determinant of training efficiency. Traditional load balancing methods, such as equal-cost multipath (ECMP), often suffer from hash polarization, leading to suboptimal traffic distribution—particularly in scenarios with limited flow counts and a dominance of elephant flows. To mitigate this challenge, this paper introduces end-to-network load balancing (ENLB), a novel and readily deployable scheme that optimizes uplink utilization through coordinated server-switch traffic scheduling. Leveraging end-to-network convergence principles, ENLB enhances bandwidth efficiency while minimizing flow completion times. Simulation and experimental evaluations demonstrate that ENLB improves network bandwidth utilization by up to 38% and reduces model training task durations by over 3% compared to conventional ECMP-based approaches. These findings underscore ENLB's potential as a scalable solution for modern AI Data Center (AIDC) networks.</p>\n </div>","PeriodicalId":23282,"journal":{"name":"Transactions on Emerging Telecommunications Technologies","volume":"36 10","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"End-to-Network Load Balancing for AI Data Center Networks: A Convergence-Based Approach to Enhance Training Efficiency\",\"authors\":\"Ran Zhang, Xuan Zhao, Yingying Han, Yubin Yang, Jun Ruan, Jianqin Zhang, Donglin Chen, Heng Wang\",\"doi\":\"10.1002/ett.70249\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>In large-scale language model training, network performance is a crucial determinant of training efficiency. Traditional load balancing methods, such as equal-cost multipath (ECMP), often suffer from hash polarization, leading to suboptimal traffic distribution—particularly in scenarios with limited flow counts and a dominance of elephant flows. To mitigate this challenge, this paper introduces end-to-network load balancing (ENLB), a novel and readily deployable scheme that optimizes uplink utilization through coordinated server-switch traffic scheduling. Leveraging end-to-network convergence principles, ENLB enhances bandwidth efficiency while minimizing flow completion times. Simulation and experimental evaluations demonstrate that ENLB improves network bandwidth utilization by up to 38% and reduces model training task durations by over 3% compared to conventional ECMP-based approaches. These findings underscore ENLB's potential as a scalable solution for modern AI Data Center (AIDC) networks.</p>\\n </div>\",\"PeriodicalId\":23282,\"journal\":{\"name\":\"Transactions on Emerging Telecommunications Technologies\",\"volume\":\"36 10\",\"pages\":\"\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Transactions on Emerging Telecommunications Technologies\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/ett.70249\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"TELECOMMUNICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions on Emerging Telecommunications Technologies","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ett.70249","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
End-to-Network Load Balancing for AI Data Center Networks: A Convergence-Based Approach to Enhance Training Efficiency
In large-scale language model training, network performance is a crucial determinant of training efficiency. Traditional load balancing methods, such as equal-cost multipath (ECMP), often suffer from hash polarization, leading to suboptimal traffic distribution—particularly in scenarios with limited flow counts and a dominance of elephant flows. To mitigate this challenge, this paper introduces end-to-network load balancing (ENLB), a novel and readily deployable scheme that optimizes uplink utilization through coordinated server-switch traffic scheduling. Leveraging end-to-network convergence principles, ENLB enhances bandwidth efficiency while minimizing flow completion times. Simulation and experimental evaluations demonstrate that ENLB improves network bandwidth utilization by up to 38% and reduces model training task durations by over 3% compared to conventional ECMP-based approaches. These findings underscore ENLB's potential as a scalable solution for modern AI Data Center (AIDC) networks.
期刊介绍:
ransactions on Emerging Telecommunications Technologies (ETT), formerly known as European Transactions on Telecommunications (ETT), has the following aims:
- to attract cutting-edge publications from leading researchers and research groups around the world
- to become a highly cited source of timely research findings in emerging fields of telecommunications
- to limit revision and publication cycles to a few months and thus significantly increase attractiveness to publish
- to become the leading journal for publishing the latest developments in telecommunications