RECCL:优化可重构光网络的集合算法

IF 4 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Hong Zou;Huaxi Gu;Xiaoshan Yu;Zhuodong Wu;Yifeng Zhu
{"title":"RECCL:优化可重构光网络的集合算法","authors":"Hong Zou;Huaxi Gu;Xiaoshan Yu;Zhuodong Wu;Yifeng Zhu","doi":"10.1364/JOCN.555632","DOIUrl":null,"url":null,"abstract":"With the introduction of optical circuit switching, network topologies in distributed machine learning systems can be periodically reconfigured. Existing collective strategies use fixed communication steps and relationships during model training, limiting their ability to fully utilize network resources. Therefore, we propose RECCL, a reconfigurable collective communication library that dynamically adjusts communication relationships in collective algorithms based on network topology reconfigurations, thereby enhancing link utilization. RECCL considers the interactions between collectives and reconfigurable networks and develops a collective cost model for various topologies by using the innovative collective sketch and fine-grained network model. Based on this cost model, RECCL designed an optimization algorithm that rapidly reconfigures collectives with minimal communication cost for the current topology, enabling reconfiguration for kilo-scale nodes within 300 ms. Our experiments show that RECCL’s reconfigured collectives are <tex>$1.08{-}2.5 \\times$</tex> faster than those in MPI or NCCL. RECCL can accelerate end-to-end training of GPT-2 and BERT by <tex>$1.12{-}1.65 \\times$</tex> for different batch sizes.","PeriodicalId":50103,"journal":{"name":"Journal of Optical Communications and Networking","volume":"17 6","pages":"470-484"},"PeriodicalIF":4.0000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RECCL: optimizing collective algorithms for reconfigurable optical networks\",\"authors\":\"Hong Zou;Huaxi Gu;Xiaoshan Yu;Zhuodong Wu;Yifeng Zhu\",\"doi\":\"10.1364/JOCN.555632\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the introduction of optical circuit switching, network topologies in distributed machine learning systems can be periodically reconfigured. Existing collective strategies use fixed communication steps and relationships during model training, limiting their ability to fully utilize network resources. Therefore, we propose RECCL, a reconfigurable collective communication library that dynamically adjusts communication relationships in collective algorithms based on network topology reconfigurations, thereby enhancing link utilization. RECCL considers the interactions between collectives and reconfigurable networks and develops a collective cost model for various topologies by using the innovative collective sketch and fine-grained network model. Based on this cost model, RECCL designed an optimization algorithm that rapidly reconfigures collectives with minimal communication cost for the current topology, enabling reconfiguration for kilo-scale nodes within 300 ms. Our experiments show that RECCL’s reconfigured collectives are <tex>$1.08{-}2.5 \\\\times$</tex> faster than those in MPI or NCCL. RECCL can accelerate end-to-end training of GPT-2 and BERT by <tex>$1.12{-}1.65 \\\\times$</tex> for different batch sizes.\",\"PeriodicalId\":50103,\"journal\":{\"name\":\"Journal of Optical Communications and Networking\",\"volume\":\"17 6\",\"pages\":\"470-484\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2025-03-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Optical Communications and Networking\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11007423/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Optical Communications and Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11007423/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

随着光电路交换的引入,分布式机器学习系统中的网络拓扑结构可以周期性地重新配置。现有的集体策略在模型训练过程中使用固定的沟通步骤和关系,限制了它们充分利用网络资源的能力。因此,我们提出了RECCL,这是一个可重构的集体通信库,它可以根据网络拓扑重构动态调整集体算法中的通信关系,从而提高链路利用率。RECCL考虑了集体和可重构网络之间的相互作用,并通过使用创新的集体草图和细粒度网络模型开发了各种拓扑的集体成本模型。基于该成本模型,RECCL设计了一种优化算法,该算法可以以最小的通信成本快速重新配置当前拓扑的集合,从而在300 ms内实现千级节点的重新配置。我们的实验表明,RECCL的重新配置集合比MPI或NCCL快1.08{-}2.5 \ $。RECCL可以将GPT-2和BERT的端到端训练速度提高$1.12{-}1.65 \倍$。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
RECCL: optimizing collective algorithms for reconfigurable optical networks
With the introduction of optical circuit switching, network topologies in distributed machine learning systems can be periodically reconfigured. Existing collective strategies use fixed communication steps and relationships during model training, limiting their ability to fully utilize network resources. Therefore, we propose RECCL, a reconfigurable collective communication library that dynamically adjusts communication relationships in collective algorithms based on network topology reconfigurations, thereby enhancing link utilization. RECCL considers the interactions between collectives and reconfigurable networks and develops a collective cost model for various topologies by using the innovative collective sketch and fine-grained network model. Based on this cost model, RECCL designed an optimization algorithm that rapidly reconfigures collectives with minimal communication cost for the current topology, enabling reconfiguration for kilo-scale nodes within 300 ms. Our experiments show that RECCL’s reconfigured collectives are $1.08{-}2.5 \times$ faster than those in MPI or NCCL. RECCL can accelerate end-to-end training of GPT-2 and BERT by $1.12{-}1.65 \times$ for different batch sizes.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
9.40
自引率
16.00%
发文量
104
审稿时长
4 months
期刊介绍: The scope of the Journal includes advances in the state-of-the-art of optical networking science, technology, and engineering. Both theoretical contributions (including new techniques, concepts, analyses, and economic studies) and practical contributions (including optical networking experiments, prototypes, and new applications) are encouraged. Subareas of interest include the architecture and design of optical networks, optical network survivability and security, software-defined optical networking, elastic optical networks, data and control plane advances, network management related innovation, and optical access networks. Enabling technologies and their applications are suitable topics only if the results are shown to directly impact optical networking beyond simple point-to-point networks.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信