基于光交换的机架级以上分解人工智能数据中心性能研究

IF 4.3 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Optical Communications and Networking Pub Date : 2025-07-14 DOI:10.1364/JOCN.559613

Fulong Yan;Hanting Huang;Yanxian Bi;Peizhao Li;Zhiwen Xue;Chao Li;Zichen Liu;Zhixue He;Zihao Li;QiDong Cao

{"title":"基于光交换的机架级以上分解人工智能数据中心性能研究","authors":"Fulong Yan;Hanting Huang;Yanxian Bi;Peizhao Li;Zhiwen Xue;Chao Li;Zichen Liu;Zhixue He;Zihao Li;QiDong Cao","doi":"10.1364/JOCN.559613","DOIUrl":null,"url":null,"abstract":"Accompanying the ever-increasing scale of big data applications, artificial intelligence data centers are facing the issue of resource fragments, resulting in a low network resource utilization ratio. Disaggregating the network resources is an efficient solution to improve the network resource utilization ratio by allocating the required amount of resources. In this paper, we focus on the problem of CPU and GPU resource disaggregation for an artificial intelligence data center. We carry out investigations for data center disaggregation exploiting optical switching. The results show that PCIe over optical (PO) guarantees 3 µs latency with 62 m of fiber. Compared with the PCIe with Ethernet switch (PE) solution, the PO scheme saves 48.3% completion time for the backprop application. Moreover, we compare the cost and power consumption of a data center architecture that scales out as the square of the port count of an optical packet switch (OPSquare) employing a PO scheme with respect to variants of OPSquare and Leaf-Spine under different network scales and interface bandwidths. Results show that the optical network architecture with the PO scheme saves 31.6% in cost and 12% in power consumption, respectively, compared with the Leaf-Spine with Ethernet solution.","PeriodicalId":50103,"journal":{"name":"Journal of Optical Communications and Networking","volume":"17 9","pages":"D43-D52"},"PeriodicalIF":4.3000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance investigation on disaggregated artificial intelligence data centers beyond rack scale with optical switching\",\"authors\":\"Fulong Yan;Hanting Huang;Yanxian Bi;Peizhao Li;Zhiwen Xue;Chao Li;Zichen Liu;Zhixue He;Zihao Li;QiDong Cao\",\"doi\":\"10.1364/JOCN.559613\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accompanying the ever-increasing scale of big data applications, artificial intelligence data centers are facing the issue of resource fragments, resulting in a low network resource utilization ratio. Disaggregating the network resources is an efficient solution to improve the network resource utilization ratio by allocating the required amount of resources. In this paper, we focus on the problem of CPU and GPU resource disaggregation for an artificial intelligence data center. We carry out investigations for data center disaggregation exploiting optical switching. The results show that PCIe over optical (PO) guarantees 3 µs latency with 62 m of fiber. Compared with the PCIe with Ethernet switch (PE) solution, the PO scheme saves 48.3% completion time for the backprop application. Moreover, we compare the cost and power consumption of a data center architecture that scales out as the square of the port count of an optical packet switch (OPSquare) employing a PO scheme with respect to variants of OPSquare and Leaf-Spine under different network scales and interface bandwidths. Results show that the optical network architecture with the PO scheme saves 31.6% in cost and 12% in power consumption, respectively, compared with the Leaf-Spine with Ethernet solution.\",\"PeriodicalId\":50103,\"journal\":{\"name\":\"Journal of Optical Communications and Networking\",\"volume\":\"17 9\",\"pages\":\"D43-D52\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Optical Communications and Networking\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11079779/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Optical Communications and Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11079779/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

随着大数据应用规模的不断扩大，人工智能数据中心面临着资源碎片化的问题，导致网络资源利用率较低。对网络资源进行分解是一种有效的解决方案，通过分配所需的资源来提高网络资源利用率。本文主要研究人工智能数据中心的CPU和GPU资源分解问题。研究了利用光交换实现数据中心分解的方法。结果表明，PCIe over optical （PO）可以在62 m光纤下保证3µs的延迟。与PCIe + PE （Ethernet switch）解决方案相比，PO方案为backprop应用节省48.3%的完成时间。此外，我们比较了在不同网络规模和接口带宽下，采用PO方案的光分组交换机（OPSquare）的端口数的平方向外扩展的数据中心架构的成本和功耗，以及OPSquare和Leaf-Spine的变体。结果表明，采用PO方案的光网络架构与采用Ethernet方案的Leaf-Spine相比，成本节约31.6%，功耗节约12%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance investigation on disaggregated artificial intelligence data centers beyond rack scale with optical switching

Accompanying the ever-increasing scale of big data applications, artificial intelligence data centers are facing the issue of resource fragments, resulting in a low network resource utilization ratio. Disaggregating the network resources is an efficient solution to improve the network resource utilization ratio by allocating the required amount of resources. In this paper, we focus on the problem of CPU and GPU resource disaggregation for an artificial intelligence data center. We carry out investigations for data center disaggregation exploiting optical switching. The results show that PCIe over optical (PO) guarantees 3 µs latency with 62 m of fiber. Compared with the PCIe with Ethernet switch (PE) solution, the PO scheme saves 48.3% completion time for the backprop application. Moreover, we compare the cost and power consumption of a data center architecture that scales out as the square of the port count of an optical packet switch (OPSquare) employing a PO scheme with respect to variants of OPSquare and Leaf-Spine under different network scales and interface bandwidths. Results show that the optical network architecture with the PO scheme saves 31.6% in cost and 12% in power consumption, respectively, compared with the Leaf-Spine with Ethernet solution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Optical Communications and Networking 工程技术-电信学

CiteScore

9.40

自引率

16.00%

发文量

104

审稿时长

4 months

期刊介绍： The scope of the Journal includes advances in the state-of-the-art of optical networking science, technology, and engineering. Both theoretical contributions (including new techniques, concepts, analyses, and economic studies) and practical contributions (including optical networking experiments, prototypes, and new applications) are encouraged. Subareas of interest include the architecture and design of optical networks, optical network survivability and security, software-defined optical networking, elastic optical networks, data and control plane advances, network management related innovation, and optical access networks. Enabling technologies and their applications are suitable topics only if the results are shown to directly impact optical networking beyond simple point-to-point networks.