稀疏线性代数低负荷故障检测的算法方法

IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) Pub Date : 2012-06-25 DOI:10.1109/DSN.2012.6263938

Joseph Sloan, Rakesh Kumar, G. Bronevetsky

{"title":"稀疏线性代数低负荷故障检测的算法方法","authors":"Joseph Sloan, Rakesh Kumar, G. Bronevetsky","doi":"10.1109/DSN.2012.6263938","DOIUrl":null,"url":null,"abstract":"The increasing size and complexity of High-Performance Computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a low energy mode. Previous techniques for Algorithm - Based Fault Tolerance (ABFT) [20] have been proposed for detecting errors in dense linear operations, but have high overhead in the context of sparse problems. In this paper, we propose a set of algorithmic techniques that minimize the overhead of fault detection for sparse problems. The techniques are based on two insights. First, many sparse problems are well structured (e.g. diagonal, banded diagonal, block diagonal), which allows for sampling techniques to produce good approximations of the checks used for fault detection. These approximate checks may be acceptable for many sparse linear algebra applications. Second, many linear applications have enough reuse that pre-conditioning techniques can be used to make these applications more amenable to low-cost algorithmic checks. The proposed techniques are shown to yield up to 2× reductions in performance overhead over traditional ABFT checks for a spectrum of sparse problems. A case study using common linear solvers further illustrates the benefits of the proposed algorithmic techniques.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"95","resultStr":"{\"title\":\"Algorithmic approaches to low overhead fault detection for sparse linear algebra\",\"authors\":\"Joseph Sloan, Rakesh Kumar, G. Bronevetsky\",\"doi\":\"10.1109/DSN.2012.6263938\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The increasing size and complexity of High-Performance Computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a low energy mode. Previous techniques for Algorithm - Based Fault Tolerance (ABFT) [20] have been proposed for detecting errors in dense linear operations, but have high overhead in the context of sparse problems. In this paper, we propose a set of algorithmic techniques that minimize the overhead of fault detection for sparse problems. The techniques are based on two insights. First, many sparse problems are well structured (e.g. diagonal, banded diagonal, block diagonal), which allows for sampling techniques to produce good approximations of the checks used for fault detection. These approximate checks may be acceptable for many sparse linear algebra applications. Second, many linear applications have enough reuse that pre-conditioning techniques can be used to make these applications more amenable to low-cost algorithmic checks. The proposed techniques are shown to yield up to 2× reductions in performance overhead over traditional ABFT checks for a spectrum of sparse problems. A case study using common linear solvers further illustrates the benefits of the proposed algorithmic techniques.\",\"PeriodicalId\":236791,\"journal\":{\"name\":\"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"95\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSN.2012.6263938\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2012.6263938","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 95

摘要

高性能计算系统的尺寸和复杂性的增加使得单个电路产生错误结果的可能性越来越大，特别是在低能量模式下工作时。先前的基于算法的容错(ABFT)技术[20]已经被提出用于检测密集线性操作中的错误，但在稀疏问题的背景下具有很高的开销。在本文中，我们提出了一套算法技术，以减少稀疏问题的故障检测开销。这些技术基于两点见解。首先，许多稀疏问题结构良好(例如对角线、带状对角线、块对角线)，这允许采样技术产生用于故障检测的检查的良好近似。这些近似检查对于许多稀疏线性代数应用可能是可以接受的。其次，许多线性应用程序具有足够的重用性，因此可以使用预处理技术使这些应用程序更适合低成本的算法检查。与传统的ABFT检查相比，所提出的技术在处理一系列稀疏问题时的性能开销降低了2倍。一个使用普通线性解算器的案例研究进一步说明了所提出的算法技术的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Algorithmic approaches to low overhead fault detection for sparse linear algebra

The increasing size and complexity of High-Performance Computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a low energy mode. Previous techniques for Algorithm - Based Fault Tolerance (ABFT) [20] have been proposed for detecting errors in dense linear operations, but have high overhead in the context of sparse problems. In this paper, we propose a set of algorithmic techniques that minimize the overhead of fault detection for sparse problems. The techniques are based on two insights. First, many sparse problems are well structured (e.g. diagonal, banded diagonal, block diagonal), which allows for sampling techniques to produce good approximations of the checks used for fault detection. These approximate checks may be acceptable for many sparse linear algebra applications. Second, many linear applications have enough reuse that pre-conditioning techniques can be used to make these applications more amenable to low-cost algorithmic checks. The proposed techniques are shown to yield up to 2× reductions in performance overhead over traditional ABFT checks for a spectrum of sparse problems. A case study using common linear solvers further illustrates the benefits of the proposed algorithmic techniques.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)

自引率

0.00%

发文量