MATCH: An MPI Fault Tolerance Benchmark Suite

2020 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2020-10-01 DOI:10.1109/IISWC50251.2020.00015

Luanzheng Guo, G. Georgakoudis, K. Parasyris, I. Laguna, Dong Li

{"title":"MATCH: An MPI Fault Tolerance Benchmark Suite","authors":"Luanzheng Guo, G. Georgakoudis, K. Parasyris, I. Laguna, Dong Li","doi":"10.1109/IISWC50251.2020.00015","DOIUrl":null,"url":null,"abstract":"MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI-FT-Bench.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC50251.2020.00015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI-FT-Bench.

查看原文本刊更多论文

MATCH:一个MPI容错基准测试套件

MPI已经广泛部署在高性能计算旗舰系统中，旨在加速运行在数百个进程和计算节点上的分布式科学应用程序。维护MPI应用程序执行的正确性和完整性至关重要，特别是对于安全关键型科学应用程序。因此，提出了一系列有效的MPI容错技术，以使MPI应用程序的执行能够有效地从系统故障中恢复。然而，目前还没有一种结构化的方法来研究和比较不同的MPI容错设计，从而指导针对不同场景的高效MPI容错技术的选择和开发。为了解决这个问题，我们设计、开发和评估了一个名为MATCH的基准套件，以表征、研究和全面比较MPI容错设计的不同组合和配置。我们的调查得出了有用的发现:(1)Reinit恢复总体上优于ULFM恢复;(2) Reinit恢复与尺度大小和输入问题大小无关，而ULFM恢复与尺度大小无关;(3)采用带FTI检查点的Reinit恢复是一种高效的容错设计。MATCH代码可在https://github.com/kakulo/MPI-FT-Bench获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Symposium on Workload Characterization (IISWC)

自引率

0.00%

发文量