High Performance Dense Linear System Solver with Soft Error Resilience

Peng Du, P. Luszczek, J. Dongarra
{"title":"High Performance Dense Linear System Solver with Soft Error Resilience","authors":"Peng Du, P. Luszczek, J. Dongarra","doi":"10.1109/CLUSTER.2011.38","DOIUrl":null,"url":null,"abstract":"As the scale of modern high end computing systems continues to grow rapidly, system failure has become an issue that requires a better solution than the commonly used scheme of checkpoint and restart (C/R). While hard errors have been studied extensively over the years, soft errors are still under-studied especially for modern HPC systems, and in some scientific applications C/R is not applicable for soft error at all due to error propagation and lack of error awareness. In this work, we propose an algorithm based fault tolerance (ABFT) for high performance dense linear system solver with soft error resilience. By adapting a mathematical model that treats soft error during LU factorization as rank-one perturbation, the solution of Ax=b can be recovered with the Sherman-Morrison formula. Our contribution includes extending error model from Gaussian elimination and pair wise pivoting to LU with partial pivoting, and we provide a practical numerical bound for error detection and a scalable check pointing algorithm to protect the left factor that is needed for recovering x from soft error. Experimental results on cluster systems with ScaLAPACK show that the fault tolerance functionality adds little overhead to the linear system solving and scales well on such systems.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2011.38","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36

Abstract

As the scale of modern high end computing systems continues to grow rapidly, system failure has become an issue that requires a better solution than the commonly used scheme of checkpoint and restart (C/R). While hard errors have been studied extensively over the years, soft errors are still under-studied especially for modern HPC systems, and in some scientific applications C/R is not applicable for soft error at all due to error propagation and lack of error awareness. In this work, we propose an algorithm based fault tolerance (ABFT) for high performance dense linear system solver with soft error resilience. By adapting a mathematical model that treats soft error during LU factorization as rank-one perturbation, the solution of Ax=b can be recovered with the Sherman-Morrison formula. Our contribution includes extending error model from Gaussian elimination and pair wise pivoting to LU with partial pivoting, and we provide a practical numerical bound for error detection and a scalable check pointing algorithm to protect the left factor that is needed for recovering x from soft error. Experimental results on cluster systems with ScaLAPACK show that the fault tolerance functionality adds little overhead to the linear system solving and scales well on such systems.
具有软误差弹性的高性能密集线性系统求解器
随着现代高端计算系统的规模持续快速增长,系统故障已经成为一个问题,需要比常用的检查点和重启(C/R)方案更好的解决方案。近年来,人们对硬误差进行了广泛的研究,而对软误差的研究仍然不足,特别是对于现代HPC系统,在一些科学应用中,由于错误传播和缺乏错误意识,C/R根本不适用于软误差。在这项工作中,我们提出了一种基于容错(ABFT)的算法,用于具有软错误弹性的高性能密集线性系统求解器。通过采用将LU分解过程中的软误差视为一级扰动的数学模型,可以用Sherman-Morrison公式恢复Ax=b的解。我们的贡献包括将误差模型从高斯消除和对旋转扩展到具有部分旋转的LU,我们提供了一个实用的错误检测数值边界和一个可扩展的校验点算法来保护从软误差中恢复x所需的左因子。利用ScaLAPACK在集群系统上的实验结果表明,该容错功能对线性系统求解增加的开销很小,并且在此类系统上具有良好的可扩展性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信