{"title":"用于远距离地理分布式机器学习的软件定义云光网络","authors":"Meng Lian;Yongli Zhao;Yike Jiang;Tingting Bao;Yuan Cao;Jie Zhang","doi":"10.1364/JOCN.553555","DOIUrl":null,"url":null,"abstract":"Optical networks enable long-haul geographically distributed machine learning (GDML) by connecting multiple data centers (DCs), offering a solution to overcome limitations of single DC-based training for large models. However, effective coordination is hindered by limited resource sharing among cloud and network entities. In this work, we propose an architecture of a software-defined cloud–optical network (SD-CON). Domain controllers of SD-CON jointly abstract cloud and network resources, while a hyper-domain controller establishes cloud–network service function chains (CN-SFCs) to enhance the cloud–network collaboration. Additionally, we introduce the task scheduling algorithm with a multi-candidate parameter server (MPS) to optimize the CN-SFCs. A 1000 km GDML experiment on the China Environment for Network Innovation demonstrates rapid allocation of cloud and network resources (<tex>${\\sim}{5.7}\\;{\\rm s}$</tex> latency) in SD-CON, improving task success rates (over 23.111%) and enhancing resource utilization compared with the baselines.","PeriodicalId":50103,"journal":{"name":"Journal of Optical Communications and Networking","volume":"17 5","pages":"363-377"},"PeriodicalIF":4.0000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Software-defined cloud–optical networks for long-haul geographically distributed machine learning\",\"authors\":\"Meng Lian;Yongli Zhao;Yike Jiang;Tingting Bao;Yuan Cao;Jie Zhang\",\"doi\":\"10.1364/JOCN.553555\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Optical networks enable long-haul geographically distributed machine learning (GDML) by connecting multiple data centers (DCs), offering a solution to overcome limitations of single DC-based training for large models. However, effective coordination is hindered by limited resource sharing among cloud and network entities. In this work, we propose an architecture of a software-defined cloud–optical network (SD-CON). Domain controllers of SD-CON jointly abstract cloud and network resources, while a hyper-domain controller establishes cloud–network service function chains (CN-SFCs) to enhance the cloud–network collaboration. Additionally, we introduce the task scheduling algorithm with a multi-candidate parameter server (MPS) to optimize the CN-SFCs. A 1000 km GDML experiment on the China Environment for Network Innovation demonstrates rapid allocation of cloud and network resources (<tex>${\\\\sim}{5.7}\\\\;{\\\\rm s}$</tex> latency) in SD-CON, improving task success rates (over 23.111%) and enhancing resource utilization compared with the baselines.\",\"PeriodicalId\":50103,\"journal\":{\"name\":\"Journal of Optical Communications and Networking\",\"volume\":\"17 5\",\"pages\":\"363-377\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Optical Communications and Networking\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10955384/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Optical Communications and Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10955384/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
Software-defined cloud–optical networks for long-haul geographically distributed machine learning
Optical networks enable long-haul geographically distributed machine learning (GDML) by connecting multiple data centers (DCs), offering a solution to overcome limitations of single DC-based training for large models. However, effective coordination is hindered by limited resource sharing among cloud and network entities. In this work, we propose an architecture of a software-defined cloud–optical network (SD-CON). Domain controllers of SD-CON jointly abstract cloud and network resources, while a hyper-domain controller establishes cloud–network service function chains (CN-SFCs) to enhance the cloud–network collaboration. Additionally, we introduce the task scheduling algorithm with a multi-candidate parameter server (MPS) to optimize the CN-SFCs. A 1000 km GDML experiment on the China Environment for Network Innovation demonstrates rapid allocation of cloud and network resources (${\sim}{5.7}\;{\rm s}$ latency) in SD-CON, improving task success rates (over 23.111%) and enhancing resource utilization compared with the baselines.
期刊介绍:
The scope of the Journal includes advances in the state-of-the-art of optical networking science, technology, and engineering. Both theoretical contributions (including new techniques, concepts, analyses, and economic studies) and practical contributions (including optical networking experiments, prototypes, and new applications) are encouraged. Subareas of interest include the architecture and design of optical networks, optical network survivability and security, software-defined optical networking, elastic optical networks, data and control plane advances, network management related innovation, and optical access networks. Enabling technologies and their applications are suitable topics only if the results are shown to directly impact optical networking beyond simple point-to-point networks.