Improving HPC System Throughput and Response Time using Memory Disaggregation

2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2021-12-01 DOI:10.1109/ICPADS53394.2021.00041

F. V. Zacarias, P. Carpenter, V. Petrucci

{"title":"Improving HPC System Throughput and Response Time using Memory Disaggregation","authors":"F. V. Zacarias, P. Carpenter, V. Petrucci","doi":"10.1109/ICPADS53394.2021.00041","DOIUrl":null,"url":null,"abstract":"HPC clusters are cost-effective, well understood, and scalable, but the rigid boundaries between compute nodes may lead to poor utilization of compute and memory resources. HPC jobs may vary, by orders of magnitude, in memory consumption per core. Thus, even when the system is provisioned to accommodate normal and large capacity nodes, a mismatch between the system and the memory demands of the scheduled jobs can lead to inefficient usage of both memory and compute resources. Disaggregated memory has recently been proposed as a way to mitigate this problem by flexibly allocating memory capacity across cluster nodes. This paper presents a simulation approach for at-scale evaluation of job schedulers with disaggregated memories and it introduces a new disaggregated-aware job allocation policy for the Slurm resource manager. Our results show that using disaggregated memories, depending on the imbalance between the system and the submitted jobs, a similar throughput and job response time can be achieved on a system with up to 33% less total memory provisioning.","PeriodicalId":309508,"journal":{"name":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"212 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS53394.2021.00041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

HPC clusters are cost-effective, well understood, and scalable, but the rigid boundaries between compute nodes may lead to poor utilization of compute and memory resources. HPC jobs may vary, by orders of magnitude, in memory consumption per core. Thus, even when the system is provisioned to accommodate normal and large capacity nodes, a mismatch between the system and the memory demands of the scheduled jobs can lead to inefficient usage of both memory and compute resources. Disaggregated memory has recently been proposed as a way to mitigate this problem by flexibly allocating memory capacity across cluster nodes. This paper presents a simulation approach for at-scale evaluation of job schedulers with disaggregated memories and it introduces a new disaggregated-aware job allocation policy for the Slurm resource manager. Our results show that using disaggregated memories, depending on the imbalance between the system and the submitted jobs, a similar throughput and job response time can be achieved on a system with up to 33% less total memory provisioning.

查看原文本刊更多论文

利用内存分解提高高性能计算系统的吞吐量和响应时间

HPC集群具有成本效益高、易于理解和可扩展的特点，但是计算节点之间的严格边界可能导致计算和内存资源的利用率低下。HPC作业在每个核心的内存消耗方面可能会有数量级的变化。因此，即使将系统配置为容纳普通和大容量节点，系统与计划作业的内存需求之间的不匹配也可能导致内存和计算资源的低效使用。分解内存最近被提出作为一种通过灵活地在集群节点间分配内存容量来缓解这个问题的方法。提出了一种大规模评估分解内存作业调度器的仿真方法，并为Slurm资源管理器引入了一种新的分解感知作业分配策略。我们的结果表明，根据系统和提交作业之间的不平衡，使用分解的内存，在总内存配置最多减少33%的情况下，可以在系统上实现类似的吞吐量和作业响应时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)

自引率

0.00%

发文量