基于光交换的机架级以上分解人工智能数据中心性能研究

IF 4 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Fulong Yan;Hanting Huang;Yanxian Bi;Peizhao Li;Zhiwen Xue;Chao Li;Zichen Liu;Zhixue He;Zihao Li;QiDong Cao
{"title":"基于光交换的机架级以上分解人工智能数据中心性能研究","authors":"Fulong Yan;Hanting Huang;Yanxian Bi;Peizhao Li;Zhiwen Xue;Chao Li;Zichen Liu;Zhixue He;Zihao Li;QiDong Cao","doi":"10.1364/JOCN.559613","DOIUrl":null,"url":null,"abstract":"Accompanying the ever-increasing scale of big data applications, artificial intelligence data centers are facing the issue of resource fragments, resulting in a low network resource utilization ratio. Disaggregating the network resources is an efficient solution to improve the network resource utilization ratio by allocating the required amount of resources. In this paper, we focus on the problem of CPU and GPU resource disaggregation for an artificial intelligence data center. We carry out investigations for data center disaggregation exploiting optical switching. The results show that PCIe over optical (PO) guarantees 3 µs latency with 62 m of fiber. Compared with the PCIe with Ethernet switch (PE) solution, the PO scheme saves 48.3% completion time for the backprop application. Moreover, we compare the cost and power consumption of a data center architecture that scales out as the square of the port count of an optical packet switch (OPSquare) employing a PO scheme with respect to variants of OPSquare and Leaf-Spine under different network scales and interface bandwidths. Results show that the optical network architecture with the PO scheme saves 31.6% in cost and 12% in power consumption, respectively, compared with the Leaf-Spine with Ethernet solution.","PeriodicalId":50103,"journal":{"name":"Journal of Optical Communications and Networking","volume":"17 9","pages":"D43-D52"},"PeriodicalIF":4.0000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance investigation on disaggregated artificial intelligence data centers beyond rack scale with optical switching\",\"authors\":\"Fulong Yan;Hanting Huang;Yanxian Bi;Peizhao Li;Zhiwen Xue;Chao Li;Zichen Liu;Zhixue He;Zihao Li;QiDong Cao\",\"doi\":\"10.1364/JOCN.559613\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accompanying the ever-increasing scale of big data applications, artificial intelligence data centers are facing the issue of resource fragments, resulting in a low network resource utilization ratio. Disaggregating the network resources is an efficient solution to improve the network resource utilization ratio by allocating the required amount of resources. In this paper, we focus on the problem of CPU and GPU resource disaggregation for an artificial intelligence data center. We carry out investigations for data center disaggregation exploiting optical switching. The results show that PCIe over optical (PO) guarantees 3 µs latency with 62 m of fiber. Compared with the PCIe with Ethernet switch (PE) solution, the PO scheme saves 48.3% completion time for the backprop application. Moreover, we compare the cost and power consumption of a data center architecture that scales out as the square of the port count of an optical packet switch (OPSquare) employing a PO scheme with respect to variants of OPSquare and Leaf-Spine under different network scales and interface bandwidths. Results show that the optical network architecture with the PO scheme saves 31.6% in cost and 12% in power consumption, respectively, compared with the Leaf-Spine with Ethernet solution.\",\"PeriodicalId\":50103,\"journal\":{\"name\":\"Journal of Optical Communications and Networking\",\"volume\":\"17 9\",\"pages\":\"D43-D52\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2025-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Optical Communications and Networking\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11079779/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Optical Communications and Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11079779/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

随着大数据应用规模的不断扩大,人工智能数据中心面临着资源碎片化的问题,导致网络资源利用率较低。对网络资源进行分解是一种有效的解决方案,通过分配所需的资源来提高网络资源利用率。本文主要研究人工智能数据中心的CPU和GPU资源分解问题。研究了利用光交换实现数据中心分解的方法。结果表明,PCIe over optical (PO)可以在62 m光纤下保证3µs的延迟。与PCIe + PE (Ethernet switch)解决方案相比,PO方案为backprop应用节省48.3%的完成时间。此外,我们比较了在不同网络规模和接口带宽下,采用PO方案的光分组交换机(OPSquare)的端口数的平方向外扩展的数据中心架构的成本和功耗,以及OPSquare和Leaf-Spine的变体。结果表明,采用PO方案的光网络架构与采用Ethernet方案的Leaf-Spine相比,成本节约31.6%,功耗节约12%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Performance investigation on disaggregated artificial intelligence data centers beyond rack scale with optical switching
Accompanying the ever-increasing scale of big data applications, artificial intelligence data centers are facing the issue of resource fragments, resulting in a low network resource utilization ratio. Disaggregating the network resources is an efficient solution to improve the network resource utilization ratio by allocating the required amount of resources. In this paper, we focus on the problem of CPU and GPU resource disaggregation for an artificial intelligence data center. We carry out investigations for data center disaggregation exploiting optical switching. The results show that PCIe over optical (PO) guarantees 3 µs latency with 62 m of fiber. Compared with the PCIe with Ethernet switch (PE) solution, the PO scheme saves 48.3% completion time for the backprop application. Moreover, we compare the cost and power consumption of a data center architecture that scales out as the square of the port count of an optical packet switch (OPSquare) employing a PO scheme with respect to variants of OPSquare and Leaf-Spine under different network scales and interface bandwidths. Results show that the optical network architecture with the PO scheme saves 31.6% in cost and 12% in power consumption, respectively, compared with the Leaf-Spine with Ethernet solution.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
9.40
自引率
16.00%
发文量
104
审稿时长
4 months
期刊介绍: The scope of the Journal includes advances in the state-of-the-art of optical networking science, technology, and engineering. Both theoretical contributions (including new techniques, concepts, analyses, and economic studies) and practical contributions (including optical networking experiments, prototypes, and new applications) are encouraged. Subareas of interest include the architecture and design of optical networks, optical network survivability and security, software-defined optical networking, elastic optical networks, data and control plane advances, network management related innovation, and optical access networks. Enabling technologies and their applications are suitable topics only if the results are shown to directly impact optical networking beyond simple point-to-point networks.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信