Building algorithmically nonstop fault tolerant MPI programs

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI:10.1109/HiPC.2011.6152716

Rui Wang, Erlin Yao, Mingyu Chen, Guangming Tan, P. Balaji, Darius Buntinas

{"title":"Building algorithmically nonstop fault tolerant MPI programs","authors":"Rui Wang, Erlin Yao, Mingyu Chen, Guangming Tan, P. Balaji, Darius Buntinas","doi":"10.1109/HiPC.2011.6152716","DOIUrl":null,"url":null,"abstract":"With the growing scale of high-performance computing (HPC) systems, today and more so tomorrow, faults are a norm rather than an exception. HPC applications typically tolerate fail-stop failures under the stop-and-wait scheme, where even if only one processor fails, the whole system has to stop and wait for the recovery of the corrupted data. It is now a more-or-less accepted fact that the stop-and-wait scheme will not scale to the next generation of HPC systems. Inspired by the previous stop-and-wait algorithm-based fault tolerance (ABFT) recovery technique, we propose in this paper a nonstop fault tolerance scheme at the application level and describe its implementation. When failure occurs during the execution of applications, we do not stop to wait for the recovery of the corrupted node; instead, we replace it with the corresponding redundant node and continue the execution. At the end of execution, the correct solution can be recovered algorithmically at a very low cost. In order to implement the scheme, some new fault-tolerant features of the Message Passing Interface (MPI) have been investigated and utilized in the MPICH implementation of MPI. We also describe a case study using High Performance Linpack (HPL) with these new features and evaluate the performance of both our new scheme and ABFT recovery. Experimental results show the advantage of our new scheme over ABFT recovery even in a small scale.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 18th International Conference on High Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2011.6152716","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

With the growing scale of high-performance computing (HPC) systems, today and more so tomorrow, faults are a norm rather than an exception. HPC applications typically tolerate fail-stop failures under the stop-and-wait scheme, where even if only one processor fails, the whole system has to stop and wait for the recovery of the corrupted data. It is now a more-or-less accepted fact that the stop-and-wait scheme will not scale to the next generation of HPC systems. Inspired by the previous stop-and-wait algorithm-based fault tolerance (ABFT) recovery technique, we propose in this paper a nonstop fault tolerance scheme at the application level and describe its implementation. When failure occurs during the execution of applications, we do not stop to wait for the recovery of the corrupted node; instead, we replace it with the corresponding redundant node and continue the execution. At the end of execution, the correct solution can be recovered algorithmically at a very low cost. In order to implement the scheme, some new fault-tolerant features of the Message Passing Interface (MPI) have been investigated and utilized in the MPICH implementation of MPI. We also describe a case study using High Performance Linpack (HPL) with these new features and evaluate the performance of both our new scheme and ABFT recovery. Experimental results show the advantage of our new scheme over ABFT recovery even in a small scale.

查看原文本刊更多论文

构建算法不间断容错MPI程序

随着高性能计算(HPC)系统规模的不断扩大，无论是现在还是未来，故障都是常态，而不是例外。在停止-等待方案下，HPC应用程序通常容忍故障停止故障，即使只有一个处理器故障，整个系统也必须停止并等待损坏数据的恢复。现在一个或多或少被接受的事实是，停止等待方案将无法扩展到下一代HPC系统。受先前基于停止等待算法的容错(ABFT)恢复技术的启发，本文提出了一种应用层不间断容错方案，并描述了其实现。当应用程序在执行过程中发生故障时，我们不会停止等待损坏节点的恢复;相反，我们用相应的冗余节点替换它并继续执行。在执行结束时，可以以非常低的成本通过算法恢复正确的解决方案。为了实现该方案，研究了消息传递接口(MPI)的一些新的容错特性，并在MPI的MPICH实现中加以利用。我们还描述了一个使用具有这些新特性的高性能Linpack (HPL)的案例研究，并评估了我们的新方案和ABFT恢复的性能。实验结果表明，即使在小范围内，新方案也优于ABFT回收方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 18th International Conference on High Performance Computing

自引率

0.00%

发文量