Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes

2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) Pub Date : 2019-11-01 DOI:10.1109/FTXS49593.2019.00009

C. Pachajoa, Christina Pacher, W. Gansterer

{"title":"Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes","authors":"C. Pachajoa, Christina Pacher, W. Gansterer","doi":"10.1109/FTXS49593.2019.00009","DOIUrl":null,"url":null,"abstract":"As HPC systems grow in scale to meet increased computational demands, the incidence of faults in a given window of time is expected to grow. This issue is addressed by the scientific community with research on solutions in every computational layer. In this paper, we explore strategies for fault tolerance at the algorithmic level. We propose a node-failure-tolerant preconditioned conjugate gradient method, which is able to efficiently recover from node failures without the use of extra spare nodes, i. e., without any overhead in terms of available hardware. For purposes of load balancing, we redistribute the surviving and reconstructed solver data. The objective is to reconstruct the system either as it was before the node failure, or an equivalent, permuted version, and then continue the execution of the solver only on the surviving nodes. In our experimental evaluations, the recovery stage of the solver typically takes around 10% or less of the solver runtime, including the time to retrieve the problem-defining static data from the hard disk, and, when using a suitable preconditioner, an average solver runtime overhead of 3.5% over that of a resilient solver that uses a replacement node. We investigate the influence of the preconditioner on a trade-off between load-balancing and communication cost in the recovery phase. The obtained solutions are correct, and our method is thus a feasible way to recover from a node failure and continue the execution of the solver only on the surviving nodes.","PeriodicalId":199103,"journal":{"name":"2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FTXS49593.2019.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

As HPC systems grow in scale to meet increased computational demands, the incidence of faults in a given window of time is expected to grow. This issue is addressed by the scientific community with research on solutions in every computational layer. In this paper, we explore strategies for fault tolerance at the algorithmic level. We propose a node-failure-tolerant preconditioned conjugate gradient method, which is able to efficiently recover from node failures without the use of extra spare nodes, i. e., without any overhead in terms of available hardware. For purposes of load balancing, we redistribute the surviving and reconstructed solver data. The objective is to reconstruct the system either as it was before the node failure, or an equivalent, permuted version, and then continue the execution of the solver only on the surviving nodes. In our experimental evaluations, the recovery stage of the solver typically takes around 10% or less of the solver runtime, including the time to retrieve the problem-defining static data from the hard disk, and, when using a suitable preconditioner, an average solver runtime overhead of 3.5% over that of a resilient solver that uses a replacement node. We investigate the influence of the preconditioner on a trade-off between load-balancing and communication cost in the recovery phase. The obtained solutions are correct, and our method is thus a feasible way to recover from a node failure and continue the execution of the solver only on the surviving nodes.

查看原文本刊更多论文

无替换节点的节点抗故障预条件共轭梯度法

随着高性能计算系统的规模不断扩大，以满足不断增长的计算需求，在给定的时间窗口内，故障的发生率预计会增加。科学界通过研究每个计算层的解决方案来解决这个问题。在本文中，我们从算法层面探讨了容错策略。我们提出了一种节点容错预条件共轭梯度方法，该方法能够在不使用额外备用节点的情况下有效地从节点故障中恢复，即在可用硬件方面没有任何开销。为了实现负载平衡，我们重新分配了幸存的和重构的求解器数据。目标是重建节点故障之前的系统，或者是一个等效的、排列的版本，然后只在幸存的节点上继续执行求解器。在我们的实验评估中，求解器的恢复阶段通常占用求解器运行时的10%或更少的时间，包括从硬盘检索定义问题的静态数据的时间，并且，当使用合适的预处理时，求解器的平均运行时开销比使用替换节点的弹性求解器的运行时开销低3.5%。我们研究了前置条件对恢复阶段的负载平衡和通信成本之间权衡的影响。得到的解是正确的，因此我们的方法是一种可行的方法，可以从节点故障中恢复，并仅在幸存的节点上继续执行求解器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)

自引率

0.00%

发文量