Seamless data delivery for distributed AI workloads via dynamic queue scheduling and FPGA-based implementation in converged optical-electrical networks

IF 4.3 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Shi Feng;Jiawei Zhang;Jun Dai;Yashe Liu;Zhenhua Liu;Yuefeng Ji
{"title":"Seamless data delivery for distributed AI workloads via dynamic queue scheduling and FPGA-based implementation in converged optical-electrical networks","authors":"Shi Feng;Jiawei Zhang;Jun Dai;Yashe Liu;Zhenhua Liu;Yuefeng Ji","doi":"10.1364/JOCN.563049","DOIUrl":null,"url":null,"abstract":"The deployment of universal AI training jobs in high-performance computing centers poses significant challenges to network architectures. Networks designed for artificial intelligence (AI) have emerged as a prevalent trend in datacenter networks. Traditional electrical packet-switching (EPS) networks face capacity limits due to the slowdown of Moore’s law and struggle to accommodate the specific traffic patterns of distributed deep learning (DDL). In contrast, optical circuit-switching (OCS) technology provides high-bandwidth, dedicated optical paths. The converged optical/electrical datacenter network (COE-DCN) has emerged as a promising solution for AI-DCN. However, optical circuit reconfiguration and multiplexing often introduce delays, disrupting traffic flow. Prior work in COE-DCN designs schedules the network, neglecting the negative influence during the optical path configuration. This paper addresses these challenges by analyzing traditional optical path provisioning methods and traffic-overlapping scenarios in COE-DCNs. We propose a bisection-assisted control mechanism to collaborate with optical and electrical networks, ensuring continuous data transmission. Our approach integrates queue-aware scheduling to dynamically allocate resources across EPS and OCS, minimizing transition latency and optimizing traffic flow. To validate the proposed scheme, we implement a field-programmable gate array (FPGA)-based hardware platform, which achieves sub-microsecond packet-level transition latency and demonstrates efficient queue management. Experimental results confirm significant improvements in job acceleration for overlapping DDL traffic scenarios, highlighting the effectiveness of our FPGA-based, queue-optimized design.","PeriodicalId":50103,"journal":{"name":"Journal of Optical Communications and Networking","volume":"17 9","pages":"820-833"},"PeriodicalIF":4.3000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Optical Communications and Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11137415/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

The deployment of universal AI training jobs in high-performance computing centers poses significant challenges to network architectures. Networks designed for artificial intelligence (AI) have emerged as a prevalent trend in datacenter networks. Traditional electrical packet-switching (EPS) networks face capacity limits due to the slowdown of Moore’s law and struggle to accommodate the specific traffic patterns of distributed deep learning (DDL). In contrast, optical circuit-switching (OCS) technology provides high-bandwidth, dedicated optical paths. The converged optical/electrical datacenter network (COE-DCN) has emerged as a promising solution for AI-DCN. However, optical circuit reconfiguration and multiplexing often introduce delays, disrupting traffic flow. Prior work in COE-DCN designs schedules the network, neglecting the negative influence during the optical path configuration. This paper addresses these challenges by analyzing traditional optical path provisioning methods and traffic-overlapping scenarios in COE-DCNs. We propose a bisection-assisted control mechanism to collaborate with optical and electrical networks, ensuring continuous data transmission. Our approach integrates queue-aware scheduling to dynamically allocate resources across EPS and OCS, minimizing transition latency and optimizing traffic flow. To validate the proposed scheme, we implement a field-programmable gate array (FPGA)-based hardware platform, which achieves sub-microsecond packet-level transition latency and demonstrates efficient queue management. Experimental results confirm significant improvements in job acceleration for overlapping DDL traffic scenarios, highlighting the effectiveness of our FPGA-based, queue-optimized design.
通过动态队列调度和基于fpga的融合光-电网络实现分布式人工智能工作负载的无缝数据传输
在高性能计算中心部署通用人工智能培训工作对网络架构提出了重大挑战。为人工智能(AI)设计的网络已经成为数据中心网络的流行趋势。传统的电子分组交换(EPS)网络由于摩尔定律的减速而面临容量限制,并且难以适应分布式深度学习(DDL)的特定流量模式。相比之下,光电路交换(OCS)技术提供高带宽、专用光路。融合光/电数据中心网络(COE-DCN)已成为AI-DCN的一个有前途的解决方案。然而,光电路重构和多路复用往往会引入延迟,扰乱交通流量。先前的COE-DCN设计工作是对网络进行调度,忽略了光路配置过程中的负面影响。本文通过分析传统的光路配置方法和coe - dcn中的流量重叠场景来解决这些挑战。我们提出了一种对分辅助控制机制,与光和电网络协作,确保数据的连续传输。我们的方法集成了队列感知调度,在EPS和OCS之间动态分配资源,最大限度地减少转换延迟并优化交通流。为了验证所提出的方案,我们实现了一个基于现场可编程门阵列(FPGA)的硬件平台,该平台实现了亚微秒级的分组级转换延迟,并展示了高效的队列管理。实验结果证实了重叠DDL流量场景下作业加速的显著改善,突出了我们基于fpga的队列优化设计的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
9.40
自引率
16.00%
发文量
104
审稿时长
4 months
期刊介绍: The scope of the Journal includes advances in the state-of-the-art of optical networking science, technology, and engineering. Both theoretical contributions (including new techniques, concepts, analyses, and economic studies) and practical contributions (including optical networking experiments, prototypes, and new applications) are encouraged. Subareas of interest include the architecture and design of optical networks, optical network survivability and security, software-defined optical networking, elastic optical networks, data and control plane advances, network management related innovation, and optical access networks. Enabling technologies and their applications are suitable topics only if the results are shown to directly impact optical networking beyond simple point-to-point networks.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信