面向通信概要、拓扑和节点故障感知过程的放置

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2020-09-01 DOI:10.1109/SBAC-PAD49847.2020.00041

Ioannis Vardas, Manolis Ploumidis, M. Marazakis

{"title":"面向通信概要、拓扑和节点故障感知过程的放置","authors":"Ioannis Vardas, Manolis Ploumidis, M. Marazakis","doi":"10.1109/SBAC-PAD49847.2020.00041","DOIUrl":null,"url":null,"abstract":"HPC systems need to keep growing in size to meet the ever-increasing demand for high levels of capability and capacity, often in tight time windows for urgent computation. However, increasing the size, complexity and heterogeneity of HPC systems also increases the risk and impact of system failures, that result in resource waste and aborted jobs. A major contributor to job completion time is the cost of interprocess communication. To address performance and energy efficiency, several prior studies have targeted improvements of communication locality. To meet this goal, they derive a mapping of MPI processes to system nodes in a way that reduces communication cost. However, such approaches disregard the effect of system failures. In this work, we propose a resource allocation approach for MPI jobs, considering both high performance and error resilience. Our approach, named Communication Profile, Topology and node Failure (CPTF), takes into account the application's communication profile, system topology and node failure probability for assigning job processes to nodes. We evaluate variants of CPTF through simulations of two MPI applications, one with a regular communication pattern (LAMMPS) and one with an irregular one (NPB-DT). In both cases, the variant of CPTF that strives to avoid failure-prone nodes and communication paths achieves lower time to complete job batches when compared to the default resource allocation policy of Slurm. It also exhibits the lowest ratio of aborted jobs. The average improvement in batch completion time is 67% for NPB-DT and 34% for LAMMPS.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Towards Communication Profile, Topology and Node Failure Aware Process Placement\",\"authors\":\"Ioannis Vardas, Manolis Ploumidis, M. Marazakis\",\"doi\":\"10.1109/SBAC-PAD49847.2020.00041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"HPC systems need to keep growing in size to meet the ever-increasing demand for high levels of capability and capacity, often in tight time windows for urgent computation. However, increasing the size, complexity and heterogeneity of HPC systems also increases the risk and impact of system failures, that result in resource waste and aborted jobs. A major contributor to job completion time is the cost of interprocess communication. To address performance and energy efficiency, several prior studies have targeted improvements of communication locality. To meet this goal, they derive a mapping of MPI processes to system nodes in a way that reduces communication cost. However, such approaches disregard the effect of system failures. In this work, we propose a resource allocation approach for MPI jobs, considering both high performance and error resilience. Our approach, named Communication Profile, Topology and node Failure (CPTF), takes into account the application's communication profile, system topology and node failure probability for assigning job processes to nodes. We evaluate variants of CPTF through simulations of two MPI applications, one with a regular communication pattern (LAMMPS) and one with an irregular one (NPB-DT). In both cases, the variant of CPTF that strives to avoid failure-prone nodes and communication paths achieves lower time to complete job batches when compared to the default resource allocation policy of Slurm. It also exhibits the lowest ratio of aborted jobs. The average improvement in batch completion time is 67% for NPB-DT and 34% for LAMMPS.\",\"PeriodicalId\":202581,\"journal\":{\"name\":\"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SBAC-PAD49847.2020.00041\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD49847.2020.00041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

高性能计算系统需要不断扩大规模，以满足对高水平能力和容量不断增长的需求，通常在紧迫的时间窗口内进行紧急计算。然而，增加高性能计算系统的规模、复杂性和异构性也会增加系统故障的风险和影响，从而导致资源浪费和作业中断。影响作业完成时间的一个主要因素是进程间通信的成本。为了解决性能和能源效率问题，先前的一些研究针对通信局部性的改进。为了实现这一目标，他们以一种降低通信成本的方式导出MPI进程到系统节点的映射。然而，这种方法忽略了系统故障的影响。在这项工作中，我们提出了一种MPI作业的资源分配方法，同时考虑了高性能和错误弹性。我们的方法，命名为通信配置文件、拓扑和节点故障(CPTF)，考虑了应用程序的通信配置文件、系统拓扑和节点故障概率，以便将作业进程分配给节点。我们通过模拟两个MPI应用程序来评估CPTF的变体，一个具有规则通信模式(LAMMPS)，一个具有不规则通信模式(NPB-DT)。在这两种情况下，与Slurm的默认资源分配策略相比，努力避免容易发生故障的节点和通信路径的CPTF变体完成作业批处理的时间更短。它也显示出最低的工作流产率。NPB-DT的批完成时间平均改善67%，LAMMPS的批完成时间平均改善34%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards Communication Profile, Topology and Node Failure Aware Process Placement

HPC systems need to keep growing in size to meet the ever-increasing demand for high levels of capability and capacity, often in tight time windows for urgent computation. However, increasing the size, complexity and heterogeneity of HPC systems also increases the risk and impact of system failures, that result in resource waste and aborted jobs. A major contributor to job completion time is the cost of interprocess communication. To address performance and energy efficiency, several prior studies have targeted improvements of communication locality. To meet this goal, they derive a mapping of MPI processes to system nodes in a way that reduces communication cost. However, such approaches disregard the effect of system failures. In this work, we propose a resource allocation approach for MPI jobs, considering both high performance and error resilience. Our approach, named Communication Profile, Topology and node Failure (CPTF), takes into account the application's communication profile, system topology and node failure probability for assigning job processes to nodes. We evaluate variants of CPTF through simulations of two MPI applications, one with a regular communication pattern (LAMMPS) and one with an irregular one (NPB-DT). In both cases, the variant of CPTF that strives to avoid failure-prone nodes and communication paths achieves lower time to complete job batches when compared to the default resource allocation policy of Slurm. It also exhibits the lowest ratio of aborted jobs. The average improvement in batch completion time is 67% for NPB-DT and 34% for LAMMPS.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

自引率

0.00%

发文量