Optimizing global parameter synchronization for geo-distributed machine learning in reconfigurable optical wide area networks

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-07-19 DOI:10.1016/j.neucom.2025.131025

Ling Liu , Wang Xu , Pan Zhou , Xiaoqiong Xu , Xi Chen , Hongfang Yu , Gang Sun

{"title":"Optimizing global parameter synchronization for geo-distributed machine learning in reconfigurable optical wide area networks","authors":"Ling Liu , Wang Xu , Pan Zhou , Xiaoqiong Xu , Xi Chen , Hongfang Yu , Gang Sun","doi":"10.1016/j.neucom.2025.131025","DOIUrl":null,"url":null,"abstract":"<div><div>Geo-distributed machine learning (Geo-DML) usually uses a hierarchical training architecture, local parameter synchronization (LPS) within data center and global parameter synchronization (GPS) between data centers. Compared to fast LAN bandwidth, the heterogeneous and scarce WAN bandwidth becomes one of the main bottlenecks of training performance for Geo-DML. Fortunately, the emerging optical technologies render the modern WAN topology reconfigurable, which has been adopted to improve the performance of some traditional traffic with the help of software-defined networking (SDN). However, the reconfigurable WAN topology is often overlooked by most schemes aimed at accelerating Geo-DML. In this paper, we propose AdaptivePS, an adaptive global parameter synchronization scheduling scheme that leverages the reconfigurable feature of WAN topology and the training characteristics to speed up Geo-DML training. Specifically, mathematical optimization models considering the topology construction and parameter synchronization scheduling are firstly established. Then AdaptivePS solves the mathematical models through relaxing and deterministic rounding scheme, obtaining the deployment of global aggregation nodes, wavelength allocation, path and rate allocation. The simulation results based on real WAN topologies show that compared to RoWAN, RAPIER and Baseline, AdaptivePS can reduce global communication time (GCT) by up to <math><mn>73.4</mn><mspace></mspace><mi>%</mi></math>, <math><mn>86.7</mn><mspace></mspace><mi>%</mi></math>, <math><mn>96.2</mn><mspace></mspace><mi>%</mi></math>, respectively. This demonstrates that AdaptivePS can effectively cope with different network environments, with the help of adaptive selection of global aggregation nodes, reconfigurable topology, and mathematical model based scheduling.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"652 ","pages":"Article 131025"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225016972","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Geo-distributed machine learning (Geo-DML) usually uses a hierarchical training architecture, local parameter synchronization (LPS) within data center and global parameter synchronization (GPS) between data centers. Compared to fast LAN bandwidth, the heterogeneous and scarce WAN bandwidth becomes one of the main bottlenecks of training performance for Geo-DML. Fortunately, the emerging optical technologies render the modern WAN topology reconfigurable, which has been adopted to improve the performance of some traditional traffic with the help of software-defined networking (SDN). However, the reconfigurable WAN topology is often overlooked by most schemes aimed at accelerating Geo-DML. In this paper, we propose AdaptivePS, an adaptive global parameter synchronization scheduling scheme that leverages the reconfigurable feature of WAN topology and the training characteristics to speed up Geo-DML training. Specifically, mathematical optimization models considering the topology construction and parameter synchronization scheduling are firstly established. Then AdaptivePS solves the mathematical models through relaxing and deterministic rounding scheme, obtaining the deployment of global aggregation nodes, wavelength allocation, path and rate allocation. The simulation results based on real WAN topologies show that compared to RoWAN, RAPIER and Baseline, AdaptivePS can reduce global communication time (GCT) by up to

73.4 %

86.7 %

96.2 %

, respectively. This demonstrates that AdaptivePS can effectively cope with different network environments, with the help of adaptive selection of global aggregation nodes, reconfigurable topology, and mathematical model based scheduling.

查看原文本刊更多论文

可重构光广域网中地理分布式机器学习全局参数同步优化

地理分布式机器学习（Geo-DML）通常使用分层训练架构，数据中心内的本地参数同步（LPS）和数据中心之间的全局参数同步（GPS）。相对于快速的局域网带宽，异构和稀缺的广域网带宽成为制约Geo-DML训练性能的主要瓶颈之一。幸运的是，新兴的光学技术使现代广域网拓扑结构可重构，这已被用于在软件定义网络（SDN）的帮助下提高一些传统流量的性能。然而，大多数旨在加速Geo-DML的方案往往忽略了可重构的WAN拓扑。本文提出了一种自适应全局参数同步调度方案AdaptivePS，该方案利用广域网拓扑的可重构特性和训练特性来提高Geo-DML训练速度。具体而言，首先建立了考虑拓扑结构和参数同步调度的数学优化模型。然后，AdaptivePS通过松弛和确定性舍入方案对数学模型进行求解，得到全局汇聚节点的部署、波长分配、路径分配和速率分配。基于实际广域网拓扑的仿真结果表明，与RoWAN、RAPIER和Baseline相比，AdaptivePS的全局通信时间（GCT）分别减少了73.4%、86.7%和96.2%。结果表明，通过自适应选择全局聚合节点、可重构拓扑结构和基于数学模型的调度，AdaptivePS能够有效地应对不同的网络环境。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.