DINO: Divergent Node Cloning for Sustained Redundancy in HPC

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI:10.1109/CLUSTER.2015.36

Arash Rezaei, F. Mueller, Paul H. Hargrove, Eric Roman

{"title":"DINO: Divergent Node Cloning for Sustained Redundancy in HPC","authors":"Arash Rezaei, F. Mueller, Paul H. Hargrove, Eric Roman","doi":"10.1109/CLUSTER.2015.36","DOIUrl":null,"url":null,"abstract":"Soft faults like silent data corruption and hard faults like hardware failures may cause a high performance computing (HPC) job of thousands of processes to nearly cease to make progress due to recovery overheads. Redundant computing has been proposed as a solution at extreme scale by allocating two or more processes to perform the same task. However, current redundant computing approaches do not repair failed replicas. Thus, SDC-free execution is not guaranteed after a replica failure and the job may finish with incorrect results. Replicas are logically equivalent, yet may have divergent runtime states during job execution, which complicates on-the-fly repairs for forward recovery. In this work, we present a redundant execution environment that quickly repairs hard failures via Divergent Node cloning (DINO) at the MPI task level. DINO contributes a novel task cloning service integrated into the MPI runtime system that solves the problem of consolidating divergent states among replicas on-the-fly. Experimental results indicate that DINO can recover from failures nearly instantaneously, thus retaining the redundancy level throughout job execution. The cloning overhead, depending on the process image size and its transfer rate, ranges from 5.60 to 90.48 seconds. To the best of our knowledge, the design and implementation for repairing failed replicas in redundant MPI computing is unprecedented.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Soft faults like silent data corruption and hard faults like hardware failures may cause a high performance computing (HPC) job of thousands of processes to nearly cease to make progress due to recovery overheads. Redundant computing has been proposed as a solution at extreme scale by allocating two or more processes to perform the same task. However, current redundant computing approaches do not repair failed replicas. Thus, SDC-free execution is not guaranteed after a replica failure and the job may finish with incorrect results. Replicas are logically equivalent, yet may have divergent runtime states during job execution, which complicates on-the-fly repairs for forward recovery. In this work, we present a redundant execution environment that quickly repairs hard failures via Divergent Node cloning (DINO) at the MPI task level. DINO contributes a novel task cloning service integrated into the MPI runtime system that solves the problem of consolidating divergent states among replicas on-the-fly. Experimental results indicate that DINO can recover from failures nearly instantaneously, thus retaining the redundancy level throughout job execution. The cloning overhead, depending on the process image size and its transfer rate, ranges from 5.60 to 90.48 seconds. To the best of our knowledge, the design and implementation for repairing failed replicas in redundant MPI computing is unprecedented.

查看原文本刊更多论文

分布式节点克隆在高性能计算中的持续冗余

软故障(如静默数据损坏)和硬故障(如硬件故障)可能会导致包含数千个进程的高性能计算(HPC)作业由于恢复开销而几乎停止进展。通过分配两个或多个进程来执行相同的任务，冗余计算已被提出作为极端规模的解决方案。然而，目前的冗余计算方法不能修复失败的副本。因此，在副本失败后，不保证无sdc执行，并且作业可能以错误的结果结束。副本在逻辑上是等价的，但是在作业执行期间可能具有不同的运行时状态，这会使向前恢复的动态修复变得复杂。在这项工作中，我们提出了一个冗余执行环境，通过MPI任务级别的分歧节点克隆(DINO)快速修复硬故障。DINO提供了一种集成在MPI运行时系统中的新颖的任务克隆服务，解决了动态地在副本之间整合分散状态的问题。实验结果表明，DINO几乎可以在瞬间从故障中恢复，从而在整个作业执行过程中保持冗余级别。克隆开销(取决于进程映像大小及其传输速率)从5.60秒到90.48秒不等。据我们所知，在冗余MPI计算中修复失败副本的设计和实现是前所未有的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量