Ran Zhang, Xuan Zhao, Yingying Han, Yubin Yang, Jun Ruan, Jianqin Zhang, Donglin Chen, Heng Wang
{"title":"End-to-Network Load Balancing for AI Data Center Networks: A Convergence-Based Approach to Enhance Training Efficiency","authors":"Ran Zhang, Xuan Zhao, Yingying Han, Yubin Yang, Jun Ruan, Jianqin Zhang, Donglin Chen, Heng Wang","doi":"10.1002/ett.70249","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>In large-scale language model training, network performance is a crucial determinant of training efficiency. Traditional load balancing methods, such as equal-cost multipath (ECMP), often suffer from hash polarization, leading to suboptimal traffic distribution—particularly in scenarios with limited flow counts and a dominance of elephant flows. To mitigate this challenge, this paper introduces end-to-network load balancing (ENLB), a novel and readily deployable scheme that optimizes uplink utilization through coordinated server-switch traffic scheduling. Leveraging end-to-network convergence principles, ENLB enhances bandwidth efficiency while minimizing flow completion times. Simulation and experimental evaluations demonstrate that ENLB improves network bandwidth utilization by up to 38% and reduces model training task durations by over 3% compared to conventional ECMP-based approaches. These findings underscore ENLB's potential as a scalable solution for modern AI Data Center (AIDC) networks.</p>\n </div>","PeriodicalId":23282,"journal":{"name":"Transactions on Emerging Telecommunications Technologies","volume":"36 10","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions on Emerging Telecommunications Technologies","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ett.70249","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
In large-scale language model training, network performance is a crucial determinant of training efficiency. Traditional load balancing methods, such as equal-cost multipath (ECMP), often suffer from hash polarization, leading to suboptimal traffic distribution—particularly in scenarios with limited flow counts and a dominance of elephant flows. To mitigate this challenge, this paper introduces end-to-network load balancing (ENLB), a novel and readily deployable scheme that optimizes uplink utilization through coordinated server-switch traffic scheduling. Leveraging end-to-network convergence principles, ENLB enhances bandwidth efficiency while minimizing flow completion times. Simulation and experimental evaluations demonstrate that ENLB improves network bandwidth utilization by up to 38% and reduces model training task durations by over 3% compared to conventional ECMP-based approaches. These findings underscore ENLB's potential as a scalable solution for modern AI Data Center (AIDC) networks.
期刊介绍:
ransactions on Emerging Telecommunications Technologies (ETT), formerly known as European Transactions on Telecommunications (ETT), has the following aims:
- to attract cutting-edge publications from leading researchers and research groups around the world
- to become a highly cited source of timely research findings in emerging fields of telecommunications
- to limit revision and publication cycles to a few months and thus significantly increase attractiveness to publish
- to become the leading journal for publishing the latest developments in telecommunications