Optimizing global parameter synchronization for geo-distributed machine learning in reconfigurable optical wide area networks

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Ling Liu , Wang Xu , Pan Zhou , Xiaoqiong Xu , Xi Chen , Hongfang Yu , Gang Sun
{"title":"Optimizing global parameter synchronization for geo-distributed machine learning in reconfigurable optical wide area networks","authors":"Ling Liu ,&nbsp;Wang Xu ,&nbsp;Pan Zhou ,&nbsp;Xiaoqiong Xu ,&nbsp;Xi Chen ,&nbsp;Hongfang Yu ,&nbsp;Gang Sun","doi":"10.1016/j.neucom.2025.131025","DOIUrl":null,"url":null,"abstract":"<div><div>Geo-distributed machine learning (Geo-DML) usually uses a hierarchical training architecture, local parameter synchronization (LPS) within data center and global parameter synchronization (GPS) between data centers. Compared to fast LAN bandwidth, the heterogeneous and scarce WAN bandwidth becomes one of the main bottlenecks of training performance for Geo-DML. Fortunately, the emerging optical technologies render the modern WAN topology reconfigurable, which has been adopted to improve the performance of some traditional traffic with the help of software-defined networking (SDN). However, the reconfigurable WAN topology is often overlooked by most schemes aimed at accelerating Geo-DML. In this paper, we propose <em>AdaptivePS</em>, an adaptive global parameter synchronization scheduling scheme that leverages the reconfigurable feature of WAN topology and the training characteristics to speed up Geo-DML training. Specifically, mathematical optimization models considering the topology construction and parameter synchronization scheduling are firstly established. Then <em>AdaptivePS</em> solves the mathematical models through <em>relaxing</em> and <em>deterministic rounding</em> scheme, obtaining the deployment of global aggregation nodes, wavelength allocation, path and rate allocation. The simulation results based on real WAN topologies show that compared to <em>RoWAN</em>, <em>RAPIER</em> and <em>Baseline</em>, <em>AdaptivePS</em> can reduce global communication time (GCT) by up to <span><math><mn>73.4</mn><mspace></mspace><mi>%</mi></math></span>, <span><math><mn>86.7</mn><mspace></mspace><mi>%</mi></math></span>, <span><math><mn>96.2</mn><mspace></mspace><mi>%</mi></math></span>, respectively. This demonstrates that <em>AdaptivePS</em> can effectively cope with different network environments, with the help of adaptive selection of global aggregation nodes, reconfigurable topology, and mathematical model based scheduling.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"652 ","pages":"Article 131025"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225016972","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Geo-distributed machine learning (Geo-DML) usually uses a hierarchical training architecture, local parameter synchronization (LPS) within data center and global parameter synchronization (GPS) between data centers. Compared to fast LAN bandwidth, the heterogeneous and scarce WAN bandwidth becomes one of the main bottlenecks of training performance for Geo-DML. Fortunately, the emerging optical technologies render the modern WAN topology reconfigurable, which has been adopted to improve the performance of some traditional traffic with the help of software-defined networking (SDN). However, the reconfigurable WAN topology is often overlooked by most schemes aimed at accelerating Geo-DML. In this paper, we propose AdaptivePS, an adaptive global parameter synchronization scheduling scheme that leverages the reconfigurable feature of WAN topology and the training characteristics to speed up Geo-DML training. Specifically, mathematical optimization models considering the topology construction and parameter synchronization scheduling are firstly established. Then AdaptivePS solves the mathematical models through relaxing and deterministic rounding scheme, obtaining the deployment of global aggregation nodes, wavelength allocation, path and rate allocation. The simulation results based on real WAN topologies show that compared to RoWAN, RAPIER and Baseline, AdaptivePS can reduce global communication time (GCT) by up to 73.4%, 86.7%, 96.2%, respectively. This demonstrates that AdaptivePS can effectively cope with different network environments, with the help of adaptive selection of global aggregation nodes, reconfigurable topology, and mathematical model based scheduling.
可重构光广域网中地理分布式机器学习全局参数同步优化
地理分布式机器学习(Geo-DML)通常使用分层训练架构,数据中心内的本地参数同步(LPS)和数据中心之间的全局参数同步(GPS)。相对于快速的局域网带宽,异构和稀缺的广域网带宽成为制约Geo-DML训练性能的主要瓶颈之一。幸运的是,新兴的光学技术使现代广域网拓扑结构可重构,这已被用于在软件定义网络(SDN)的帮助下提高一些传统流量的性能。然而,大多数旨在加速Geo-DML的方案往往忽略了可重构的WAN拓扑。本文提出了一种自适应全局参数同步调度方案AdaptivePS,该方案利用广域网拓扑的可重构特性和训练特性来提高Geo-DML训练速度。具体而言,首先建立了考虑拓扑结构和参数同步调度的数学优化模型。然后,AdaptivePS通过松弛和确定性舍入方案对数学模型进行求解,得到全局汇聚节点的部署、波长分配、路径分配和速率分配。基于实际广域网拓扑的仿真结果表明,与RoWAN、RAPIER和Baseline相比,AdaptivePS的全局通信时间(GCT)分别减少了73.4%、86.7%和96.2%。结果表明,通过自适应选择全局聚合节点、可重构拓扑结构和基于数学模型的调度,AdaptivePS能够有效地应对不同的网络环境。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信