Seamless data delivery for distributed AI workloads via dynamic queue scheduling and FPGA-based implementation in converged optical-electrical networks
IF 4.3 2区 计算机科学Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Shi Feng;Jiawei Zhang;Jun Dai;Yashe Liu;Zhenhua Liu;Yuefeng Ji
{"title":"Seamless data delivery for distributed AI workloads via dynamic queue scheduling and FPGA-based implementation in converged optical-electrical networks","authors":"Shi Feng;Jiawei Zhang;Jun Dai;Yashe Liu;Zhenhua Liu;Yuefeng Ji","doi":"10.1364/JOCN.563049","DOIUrl":null,"url":null,"abstract":"The deployment of universal AI training jobs in high-performance computing centers poses significant challenges to network architectures. Networks designed for artificial intelligence (AI) have emerged as a prevalent trend in datacenter networks. Traditional electrical packet-switching (EPS) networks face capacity limits due to the slowdown of Moore’s law and struggle to accommodate the specific traffic patterns of distributed deep learning (DDL). In contrast, optical circuit-switching (OCS) technology provides high-bandwidth, dedicated optical paths. The converged optical/electrical datacenter network (COE-DCN) has emerged as a promising solution for AI-DCN. However, optical circuit reconfiguration and multiplexing often introduce delays, disrupting traffic flow. Prior work in COE-DCN designs schedules the network, neglecting the negative influence during the optical path configuration. This paper addresses these challenges by analyzing traditional optical path provisioning methods and traffic-overlapping scenarios in COE-DCNs. We propose a bisection-assisted control mechanism to collaborate with optical and electrical networks, ensuring continuous data transmission. Our approach integrates queue-aware scheduling to dynamically allocate resources across EPS and OCS, minimizing transition latency and optimizing traffic flow. To validate the proposed scheme, we implement a field-programmable gate array (FPGA)-based hardware platform, which achieves sub-microsecond packet-level transition latency and demonstrates efficient queue management. Experimental results confirm significant improvements in job acceleration for overlapping DDL traffic scenarios, highlighting the effectiveness of our FPGA-based, queue-optimized design.","PeriodicalId":50103,"journal":{"name":"Journal of Optical Communications and Networking","volume":"17 9","pages":"820-833"},"PeriodicalIF":4.3000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Optical Communications and Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11137415/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
The deployment of universal AI training jobs in high-performance computing centers poses significant challenges to network architectures. Networks designed for artificial intelligence (AI) have emerged as a prevalent trend in datacenter networks. Traditional electrical packet-switching (EPS) networks face capacity limits due to the slowdown of Moore’s law and struggle to accommodate the specific traffic patterns of distributed deep learning (DDL). In contrast, optical circuit-switching (OCS) technology provides high-bandwidth, dedicated optical paths. The converged optical/electrical datacenter network (COE-DCN) has emerged as a promising solution for AI-DCN. However, optical circuit reconfiguration and multiplexing often introduce delays, disrupting traffic flow. Prior work in COE-DCN designs schedules the network, neglecting the negative influence during the optical path configuration. This paper addresses these challenges by analyzing traditional optical path provisioning methods and traffic-overlapping scenarios in COE-DCNs. We propose a bisection-assisted control mechanism to collaborate with optical and electrical networks, ensuring continuous data transmission. Our approach integrates queue-aware scheduling to dynamically allocate resources across EPS and OCS, minimizing transition latency and optimizing traffic flow. To validate the proposed scheme, we implement a field-programmable gate array (FPGA)-based hardware platform, which achieves sub-microsecond packet-level transition latency and demonstrates efficient queue management. Experimental results confirm significant improvements in job acceleration for overlapping DDL traffic scenarios, highlighting the effectiveness of our FPGA-based, queue-optimized design.
期刊介绍:
The scope of the Journal includes advances in the state-of-the-art of optical networking science, technology, and engineering. Both theoretical contributions (including new techniques, concepts, analyses, and economic studies) and practical contributions (including optical networking experiments, prototypes, and new applications) are encouraged. Subareas of interest include the architecture and design of optical networks, optical network survivability and security, software-defined optical networking, elastic optical networks, data and control plane advances, network management related innovation, and optical access networks. Enabling technologies and their applications are suitable topics only if the results are shown to directly impact optical networking beyond simple point-to-point networks.