Ling Liu , Xiaoqiong Xu , Pan Zhou , Xi Chen , Daji Ergu , Hongfang Yu , Gang Sun , Mohsen Guizani
{"title":"PSscheduler: A parameter synchronization scheduling algorithm for distributed machine learning in reconfigurable optical networks","authors":"Ling Liu , Xiaoqiong Xu , Pan Zhou , Xi Chen , Daji Ergu , Hongfang Yu , Gang Sun , Mohsen Guizani","doi":"10.1016/j.neucom.2024.128876","DOIUrl":null,"url":null,"abstract":"<div><div>With the increasing size of training datasets and models, parameter synchronization stage puts a heavy burden on the network, and communication has become one of the main performance bottlenecks of distributed machine learning (DML). Concurrently, optical circuit switch (OCS) with high bandwidth and reconfigurable features has increasingly introduced into the construction of network topology, obtaining the reconfigurable optical networks. Actually, OCS is conducive to accelerating the parameter synchronization stage, and thus improves training performance. However, unreasonable circuit scheduling algorithm has a great impact on parameter synchronization time because of non-negligible OCS switching delay. Besides, most of the existing circuit scheduling algorithms do not effectively use the training characteristics of DML, and the performance gains are limited. Therefore, in this paper, we study the parameter synchronization scheduling algorithm in reconfigurable optical networks, and propose PSscheduler by jointly optimizing the circuit scheduling and deployment of parameter servers in parameter server (PS) architecture. Specifically, a mathematical optimization model is established first, which takes into account the deployment of parameter servers, the allocation of parameter blocks and circuit scheduling. Subsequently, the mathematical model is solved by relaxed variables and deterministic rounding approach. The results of simulation based on real DML workloads demonstrate that compared to <em>Sunflow</em> and <em>HLF</em> , PSscheduler is more stable and can reduce parameter synchronization time (PST) by up to 46.61% and 25%, respectively.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"616 ","pages":"Article 128876"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016473","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
With the increasing size of training datasets and models, parameter synchronization stage puts a heavy burden on the network, and communication has become one of the main performance bottlenecks of distributed machine learning (DML). Concurrently, optical circuit switch (OCS) with high bandwidth and reconfigurable features has increasingly introduced into the construction of network topology, obtaining the reconfigurable optical networks. Actually, OCS is conducive to accelerating the parameter synchronization stage, and thus improves training performance. However, unreasonable circuit scheduling algorithm has a great impact on parameter synchronization time because of non-negligible OCS switching delay. Besides, most of the existing circuit scheduling algorithms do not effectively use the training characteristics of DML, and the performance gains are limited. Therefore, in this paper, we study the parameter synchronization scheduling algorithm in reconfigurable optical networks, and propose PSscheduler by jointly optimizing the circuit scheduling and deployment of parameter servers in parameter server (PS) architecture. Specifically, a mathematical optimization model is established first, which takes into account the deployment of parameter servers, the allocation of parameter blocks and circuit scheduling. Subsequently, the mathematical model is solved by relaxed variables and deterministic rounding approach. The results of simulation based on real DML workloads demonstrate that compared to Sunflow and HLF , PSscheduler is more stable and can reduce parameter synchronization time (PST) by up to 46.61% and 25%, respectively.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.