高性能计算中可重构蜻蜓网络的组级资源分配策略

Guangnan Feng, Dezun Dong, Shizhen Zhao, Yutong Lu
{"title":"高性能计算中可重构蜻蜓网络的组级资源分配策略","authors":"Guangnan Feng, Dezun Dong, Shizhen Zhao, Yutong Lu","doi":"10.1145/3577193.3593732","DOIUrl":null,"url":null,"abstract":"Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has been adopted in new exascale High Performance Computing (HPC) systems. However, Dragonfly topology suffers from the limited direct links between groups. The reconfigurable network can solve this problem by reconfiguring topology to adjust the number of direct links between groups. While the performance improvement of a single job on reconfigurable HPC network has been evaluated in previous works, the performance of HPC workloads has not been studied because of the lack of an appropriate resource allocation policy. In this work, we propose Group-level Resource Allocation Policy (GRAP) to allocate both compute nodes and Reconfigurable Links for jobs in Reconfigurable Dragonfly Network (RDN). We start with formulating three design principles: reconfigurable network should be reconfiguration interference-free, guarantee connectivity and performance for each job, and satisfy varied resource requests. According to the principles, GRAP uses different strategies for small and large jobs, and contains three allocation modes for large jobs: Balance Mode, Custom Mode, and Adaptive Mode. Finally, we evaluate GRAP with the CODES network simulation framework and the Slurm Simulator using real workload traces. The results demonstrate that RDN coupled with GRAP achieves lower latency, higher bandwidth, and lower job wait time.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPC\",\"authors\":\"Guangnan Feng, Dezun Dong, Shizhen Zhao, Yutong Lu\",\"doi\":\"10.1145/3577193.3593732\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has been adopted in new exascale High Performance Computing (HPC) systems. However, Dragonfly topology suffers from the limited direct links between groups. The reconfigurable network can solve this problem by reconfiguring topology to adjust the number of direct links between groups. While the performance improvement of a single job on reconfigurable HPC network has been evaluated in previous works, the performance of HPC workloads has not been studied because of the lack of an appropriate resource allocation policy. In this work, we propose Group-level Resource Allocation Policy (GRAP) to allocate both compute nodes and Reconfigurable Links for jobs in Reconfigurable Dragonfly Network (RDN). We start with formulating three design principles: reconfigurable network should be reconfiguration interference-free, guarantee connectivity and performance for each job, and satisfy varied resource requests. According to the principles, GRAP uses different strategies for small and large jobs, and contains three allocation modes for large jobs: Balance Mode, Custom Mode, and Adaptive Mode. Finally, we evaluate GRAP with the CODES network simulation framework and the Slurm Simulator using real workload traces. The results demonstrate that RDN coupled with GRAP achieves lower latency, higher bandwidth, and lower job wait time.\",\"PeriodicalId\":424155,\"journal\":{\"name\":\"Proceedings of the 37th International Conference on Supercomputing\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 37th International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3577193.3593732\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593732","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

蜻蜓是一种高度可扩展、低直径和低成本的网络拓扑结构,已被用于新的百亿亿级高性能计算(HPC)系统。然而,蜻蜓拓扑结构的缺点是群体之间的直接连接有限。可重构网络可以通过重新配置拓扑来调整组间直连链路的数量,从而解决这一问题。虽然以前的研究已经对可重构高性能计算网络中单个作业的性能改进进行了评估,但由于缺乏适当的资源分配策略,尚未对高性能计算工作负载的性能进行研究。在这项工作中,我们提出了组级资源分配策略(GRAP)来为可重构蜻蜓网络(RDN)中的作业分配计算节点和可重构链路。我们首先制定了三个设计原则:可重构网络应该是无重构干扰的,保证每个作业的连接性和性能,并满足各种资源请求。根据原理,GRAP对小作业和大作业采用不同的分配策略,对大作业包含三种分配模式:Balance Mode、Custom Mode和Adaptive Mode。最后,我们使用CODES网络仿真框架和使用真实工作负载跟踪的Slurm模拟器来评估GRAP。结果表明,RDN与GRAP相结合可以实现更低的延迟、更高的带宽和更短的作业等待时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPC
Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has been adopted in new exascale High Performance Computing (HPC) systems. However, Dragonfly topology suffers from the limited direct links between groups. The reconfigurable network can solve this problem by reconfiguring topology to adjust the number of direct links between groups. While the performance improvement of a single job on reconfigurable HPC network has been evaluated in previous works, the performance of HPC workloads has not been studied because of the lack of an appropriate resource allocation policy. In this work, we propose Group-level Resource Allocation Policy (GRAP) to allocate both compute nodes and Reconfigurable Links for jobs in Reconfigurable Dragonfly Network (RDN). We start with formulating three design principles: reconfigurable network should be reconfiguration interference-free, guarantee connectivity and performance for each job, and satisfy varied resource requests. According to the principles, GRAP uses different strategies for small and large jobs, and contains three allocation modes for large jobs: Balance Mode, Custom Mode, and Adaptive Mode. Finally, we evaluate GRAP with the CODES network simulation framework and the Slurm Simulator using real workload traces. The results demonstrate that RDN coupled with GRAP achieves lower latency, higher bandwidth, and lower job wait time.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信