Hong Zou;Huaxi Gu;Xiaoshan Yu;Zhuodong Wu;Yifeng Zhu
{"title":"RECCL:优化可重构光网络的集合算法","authors":"Hong Zou;Huaxi Gu;Xiaoshan Yu;Zhuodong Wu;Yifeng Zhu","doi":"10.1364/JOCN.555632","DOIUrl":null,"url":null,"abstract":"With the introduction of optical circuit switching, network topologies in distributed machine learning systems can be periodically reconfigured. Existing collective strategies use fixed communication steps and relationships during model training, limiting their ability to fully utilize network resources. Therefore, we propose RECCL, a reconfigurable collective communication library that dynamically adjusts communication relationships in collective algorithms based on network topology reconfigurations, thereby enhancing link utilization. RECCL considers the interactions between collectives and reconfigurable networks and develops a collective cost model for various topologies by using the innovative collective sketch and fine-grained network model. Based on this cost model, RECCL designed an optimization algorithm that rapidly reconfigures collectives with minimal communication cost for the current topology, enabling reconfiguration for kilo-scale nodes within 300 ms. Our experiments show that RECCL’s reconfigured collectives are <tex>$1.08{-}2.5 \\times$</tex> faster than those in MPI or NCCL. RECCL can accelerate end-to-end training of GPT-2 and BERT by <tex>$1.12{-}1.65 \\times$</tex> for different batch sizes.","PeriodicalId":50103,"journal":{"name":"Journal of Optical Communications and Networking","volume":"17 6","pages":"470-484"},"PeriodicalIF":4.0000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RECCL: optimizing collective algorithms for reconfigurable optical networks\",\"authors\":\"Hong Zou;Huaxi Gu;Xiaoshan Yu;Zhuodong Wu;Yifeng Zhu\",\"doi\":\"10.1364/JOCN.555632\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the introduction of optical circuit switching, network topologies in distributed machine learning systems can be periodically reconfigured. Existing collective strategies use fixed communication steps and relationships during model training, limiting their ability to fully utilize network resources. Therefore, we propose RECCL, a reconfigurable collective communication library that dynamically adjusts communication relationships in collective algorithms based on network topology reconfigurations, thereby enhancing link utilization. RECCL considers the interactions between collectives and reconfigurable networks and develops a collective cost model for various topologies by using the innovative collective sketch and fine-grained network model. Based on this cost model, RECCL designed an optimization algorithm that rapidly reconfigures collectives with minimal communication cost for the current topology, enabling reconfiguration for kilo-scale nodes within 300 ms. Our experiments show that RECCL’s reconfigured collectives are <tex>$1.08{-}2.5 \\\\times$</tex> faster than those in MPI or NCCL. RECCL can accelerate end-to-end training of GPT-2 and BERT by <tex>$1.12{-}1.65 \\\\times$</tex> for different batch sizes.\",\"PeriodicalId\":50103,\"journal\":{\"name\":\"Journal of Optical Communications and Networking\",\"volume\":\"17 6\",\"pages\":\"470-484\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2025-03-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Optical Communications and Networking\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11007423/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Optical Communications and Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11007423/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
RECCL: optimizing collective algorithms for reconfigurable optical networks
With the introduction of optical circuit switching, network topologies in distributed machine learning systems can be periodically reconfigured. Existing collective strategies use fixed communication steps and relationships during model training, limiting their ability to fully utilize network resources. Therefore, we propose RECCL, a reconfigurable collective communication library that dynamically adjusts communication relationships in collective algorithms based on network topology reconfigurations, thereby enhancing link utilization. RECCL considers the interactions between collectives and reconfigurable networks and develops a collective cost model for various topologies by using the innovative collective sketch and fine-grained network model. Based on this cost model, RECCL designed an optimization algorithm that rapidly reconfigures collectives with minimal communication cost for the current topology, enabling reconfiguration for kilo-scale nodes within 300 ms. Our experiments show that RECCL’s reconfigured collectives are $1.08{-}2.5 \times$ faster than those in MPI or NCCL. RECCL can accelerate end-to-end training of GPT-2 and BERT by $1.12{-}1.65 \times$ for different batch sizes.
期刊介绍:
The scope of the Journal includes advances in the state-of-the-art of optical networking science, technology, and engineering. Both theoretical contributions (including new techniques, concepts, analyses, and economic studies) and practical contributions (including optical networking experiments, prototypes, and new applications) are encouraged. Subareas of interest include the architecture and design of optical networks, optical network survivability and security, software-defined optical networking, elastic optical networks, data and control plane advances, network management related innovation, and optical access networks. Enabling technologies and their applications are suitable topics only if the results are shown to directly impact optical networking beyond simple point-to-point networks.