Ling Liu , Wang Xu , Pan Zhou , Xiaoqiong Xu , Xi Chen , Hongfang Yu , Gang Sun
{"title":"可重构光广域网中地理分布式机器学习全局参数同步优化","authors":"Ling Liu , Wang Xu , Pan Zhou , Xiaoqiong Xu , Xi Chen , Hongfang Yu , Gang Sun","doi":"10.1016/j.neucom.2025.131025","DOIUrl":null,"url":null,"abstract":"<div><div>Geo-distributed machine learning (Geo-DML) usually uses a hierarchical training architecture, local parameter synchronization (LPS) within data center and global parameter synchronization (GPS) between data centers. Compared to fast LAN bandwidth, the heterogeneous and scarce WAN bandwidth becomes one of the main bottlenecks of training performance for Geo-DML. Fortunately, the emerging optical technologies render the modern WAN topology reconfigurable, which has been adopted to improve the performance of some traditional traffic with the help of software-defined networking (SDN). However, the reconfigurable WAN topology is often overlooked by most schemes aimed at accelerating Geo-DML. In this paper, we propose <em>AdaptivePS</em>, an adaptive global parameter synchronization scheduling scheme that leverages the reconfigurable feature of WAN topology and the training characteristics to speed up Geo-DML training. Specifically, mathematical optimization models considering the topology construction and parameter synchronization scheduling are firstly established. Then <em>AdaptivePS</em> solves the mathematical models through <em>relaxing</em> and <em>deterministic rounding</em> scheme, obtaining the deployment of global aggregation nodes, wavelength allocation, path and rate allocation. The simulation results based on real WAN topologies show that compared to <em>RoWAN</em>, <em>RAPIER</em> and <em>Baseline</em>, <em>AdaptivePS</em> can reduce global communication time (GCT) by up to <span><math><mn>73.4</mn><mspace></mspace><mi>%</mi></math></span>, <span><math><mn>86.7</mn><mspace></mspace><mi>%</mi></math></span>, <span><math><mn>96.2</mn><mspace></mspace><mi>%</mi></math></span>, respectively. This demonstrates that <em>AdaptivePS</em> can effectively cope with different network environments, with the help of adaptive selection of global aggregation nodes, reconfigurable topology, and mathematical model based scheduling.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"652 ","pages":"Article 131025"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimizing global parameter synchronization for geo-distributed machine learning in reconfigurable optical wide area networks\",\"authors\":\"Ling Liu , Wang Xu , Pan Zhou , Xiaoqiong Xu , Xi Chen , Hongfang Yu , Gang Sun\",\"doi\":\"10.1016/j.neucom.2025.131025\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Geo-distributed machine learning (Geo-DML) usually uses a hierarchical training architecture, local parameter synchronization (LPS) within data center and global parameter synchronization (GPS) between data centers. Compared to fast LAN bandwidth, the heterogeneous and scarce WAN bandwidth becomes one of the main bottlenecks of training performance for Geo-DML. Fortunately, the emerging optical technologies render the modern WAN topology reconfigurable, which has been adopted to improve the performance of some traditional traffic with the help of software-defined networking (SDN). However, the reconfigurable WAN topology is often overlooked by most schemes aimed at accelerating Geo-DML. In this paper, we propose <em>AdaptivePS</em>, an adaptive global parameter synchronization scheduling scheme that leverages the reconfigurable feature of WAN topology and the training characteristics to speed up Geo-DML training. Specifically, mathematical optimization models considering the topology construction and parameter synchronization scheduling are firstly established. Then <em>AdaptivePS</em> solves the mathematical models through <em>relaxing</em> and <em>deterministic rounding</em> scheme, obtaining the deployment of global aggregation nodes, wavelength allocation, path and rate allocation. The simulation results based on real WAN topologies show that compared to <em>RoWAN</em>, <em>RAPIER</em> and <em>Baseline</em>, <em>AdaptivePS</em> can reduce global communication time (GCT) by up to <span><math><mn>73.4</mn><mspace></mspace><mi>%</mi></math></span>, <span><math><mn>86.7</mn><mspace></mspace><mi>%</mi></math></span>, <span><math><mn>96.2</mn><mspace></mspace><mi>%</mi></math></span>, respectively. This demonstrates that <em>AdaptivePS</em> can effectively cope with different network environments, with the help of adaptive selection of global aggregation nodes, reconfigurable topology, and mathematical model based scheduling.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"652 \",\"pages\":\"Article 131025\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225016972\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225016972","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Optimizing global parameter synchronization for geo-distributed machine learning in reconfigurable optical wide area networks
Geo-distributed machine learning (Geo-DML) usually uses a hierarchical training architecture, local parameter synchronization (LPS) within data center and global parameter synchronization (GPS) between data centers. Compared to fast LAN bandwidth, the heterogeneous and scarce WAN bandwidth becomes one of the main bottlenecks of training performance for Geo-DML. Fortunately, the emerging optical technologies render the modern WAN topology reconfigurable, which has been adopted to improve the performance of some traditional traffic with the help of software-defined networking (SDN). However, the reconfigurable WAN topology is often overlooked by most schemes aimed at accelerating Geo-DML. In this paper, we propose AdaptivePS, an adaptive global parameter synchronization scheduling scheme that leverages the reconfigurable feature of WAN topology and the training characteristics to speed up Geo-DML training. Specifically, mathematical optimization models considering the topology construction and parameter synchronization scheduling are firstly established. Then AdaptivePS solves the mathematical models through relaxing and deterministic rounding scheme, obtaining the deployment of global aggregation nodes, wavelength allocation, path and rate allocation. The simulation results based on real WAN topologies show that compared to RoWAN, RAPIER and Baseline, AdaptivePS can reduce global communication time (GCT) by up to , , , respectively. This demonstrates that AdaptivePS can effectively cope with different network environments, with the help of adaptive selection of global aggregation nodes, reconfigurable topology, and mathematical model based scheduling.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.