An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance

2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) Pub Date : 2013-06-24 DOI:10.1109/DSN.2013.6575309

Joseph Sloan, Rakesh Kumar, G. Bronevetsky

{"title":"An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance","authors":"Joseph Sloan, Rakesh Kumar, G. Bronevetsky","doi":"10.1109/DSN.2013.6575309","DOIUrl":null,"url":null,"abstract":"The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novel fault tolerance approaches are increasingly needed. Prior fault tolerance approaches often rely on checkpoint-rollback based schemes. Unfortunately, such schemes are primarily limited to rare error event scenarios as the overheads of such schemes become prohibitive if faults are common. In this paper, we propose a novel approach for algorithmic correction of faulty application outputs. The key insight for this approach is that even under high error scenarios, even if the result of an algorithm is erroneous, most of it is correct. Instead of simply rolling back to the most recent checkpoint and repeating the entire segment of computation, our novel resilience approach uses algorithmic error localization and partial recomputation to efficiently correct the corrupted results. We evaluate our approach in the specific algorithmic scenario of linear algebra operations, focusing on matrix-vector multiplication (MVM) and iterative linear solvers. We develop a novel technique for localizing errors in MVM and show how to achieve partial recomputation within this algorithm, and demonstrate that this approach both improves the performance of the Conjugate Gradient solver in high error scenarios by 3x-4x and increases the probability that it completes successfully by up to 60% with parallel experiments up to 100 nodes.","PeriodicalId":163407,"journal":{"name":"2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2013.6575309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

Abstract

The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novel fault tolerance approaches are increasingly needed. Prior fault tolerance approaches often rely on checkpoint-rollback based schemes. Unfortunately, such schemes are primarily limited to rare error event scenarios as the overheads of such schemes become prohibitive if faults are common. In this paper, we propose a novel approach for algorithmic correction of faulty application outputs. The key insight for this approach is that even under high error scenarios, even if the result of an algorithm is erroneous, most of it is correct. Instead of simply rolling back to the most recent checkpoint and repeating the entire segment of computation, our novel resilience approach uses algorithmic error localization and partial recomputation to efficiently correct the corrupted results. We evaluate our approach in the specific algorithmic scenario of linear algebra operations, focusing on matrix-vector multiplication (MVM) and iterative linear solvers. We develop a novel technique for localizing errors in MVM and show how to achieve partial recomputation within this algorithm, and demonstrate that this approach both improves the performance of the Conjugate Gradient solver in high error scenarios by 3x-4x and increases the probability that it completes successfully by up to 60% with parallel experiments up to 100 nodes.

查看原文本刊更多论文

基于低开销容错的错误定位和部分重计算算法

大规模并行系统(如高性能计算系统)的尺寸和复杂性的增加使得单个电路产生错误结果的可能性越来越大。因此，越来越需要新的容错方法。先前的容错方法通常依赖于基于检查点回滚的方案。不幸的是，这种方案主要局限于罕见的错误事件场景，因为如果错误很常见，这种方案的开销就会变得令人望而却步。在本文中，我们提出了一种新的算法校正错误的应用输出。这种方法的关键见解是，即使在高误差的情况下，即使算法的结果是错误的，大部分也是正确的。我们的新弹性方法不是简单地回滚到最近的检查点并重复整个计算段，而是使用算法错误定位和部分重新计算来有效地纠正损坏的结果。我们在线性代数运算的特定算法场景中评估我们的方法，重点是矩阵向量乘法(MVM)和迭代线性求解器。我们开发了一种在MVM中定位误差的新技术，并展示了如何在该算法中实现部分重计算，并证明该方法将共轭梯度求解器在高误差场景下的性能提高了3 -4倍，并且在并行实验多达100个节点时将其成功完成的概率提高了60%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

自引率

0.00%

发文量