A Generic Strategy for Node-Failure Resilience for Certain Iterative Linear Algebra Methods

C. Pachajoa, Robert Ernstbrunner, W. Gansterer
{"title":"A Generic Strategy for Node-Failure Resilience for Certain Iterative Linear Algebra Methods","authors":"C. Pachajoa, Robert Ernstbrunner, W. Gansterer","doi":"10.1109/FTXS51974.2020.00010","DOIUrl":null,"url":null,"abstract":"Resilience is an important research topic in HPC. As computer clusters go to extreme scales, work in this area is necessary to keep these machines reliable. In this work, we introduce a generic method to endow iterative algorithms in linear algebra based on sparse matrix-vector products, such as linear system solvers, eigensolvers and similar, with resilience to node failures. This generic method traverses the dependency graph of the variables of the iterative algorithm. If the iterative method exhibits certain properties, it is possible to produce an exact state reconstruction (ESR) algorithm, enabling the recovery of the state of the iterative method in the event of a node failure. This reconstruction is exact, except for small perturbations caused by floating point arithmetic. The generic method exploits redundancy in the matrix-vector product to protect the vector that is the argument of the product. We illustrate the use of this generic approach on three iterative methods: the conjugate gradient method, the BiCGStab method and the Lanczos algorithm. The resulting ESR algorithms enable the reconstruction of their state after a node failure from a few redundantly stored vectors. Unlike previous work in preconditioned conjugate gradient, this generic method produces ESR algorithms that work with general matrices. Consequently, we can no longer assume that local diagonal submatrices used to reconstruct vectors are nonsingular. Thus, we also propose an approach for deriving nonsingular local linear systems for the reconstruction process with reduced condition numbers, based on a communication-avoiding rank-revealing QR factorization with column pivoting.","PeriodicalId":123780,"journal":{"name":"2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FTXS51974.2020.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Resilience is an important research topic in HPC. As computer clusters go to extreme scales, work in this area is necessary to keep these machines reliable. In this work, we introduce a generic method to endow iterative algorithms in linear algebra based on sparse matrix-vector products, such as linear system solvers, eigensolvers and similar, with resilience to node failures. This generic method traverses the dependency graph of the variables of the iterative algorithm. If the iterative method exhibits certain properties, it is possible to produce an exact state reconstruction (ESR) algorithm, enabling the recovery of the state of the iterative method in the event of a node failure. This reconstruction is exact, except for small perturbations caused by floating point arithmetic. The generic method exploits redundancy in the matrix-vector product to protect the vector that is the argument of the product. We illustrate the use of this generic approach on three iterative methods: the conjugate gradient method, the BiCGStab method and the Lanczos algorithm. The resulting ESR algorithms enable the reconstruction of their state after a node failure from a few redundantly stored vectors. Unlike previous work in preconditioned conjugate gradient, this generic method produces ESR algorithms that work with general matrices. Consequently, we can no longer assume that local diagonal submatrices used to reconstruct vectors are nonsingular. Thus, we also propose an approach for deriving nonsingular local linear systems for the reconstruction process with reduced condition numbers, based on a communication-avoiding rank-revealing QR factorization with column pivoting.
一类迭代线性代数方法的节点故障恢复策略
弹性是高性能计算中的一个重要研究课题。随着计算机集群达到极限规模,这方面的工作对于保持这些机器的可靠性是必要的。在这项工作中,我们引入了一种通用的方法来赋予基于稀疏矩阵向量积的线性代数迭代算法,如线性系统求解器、特征求解器和类似的算法,对节点故障具有弹性。这种泛型方法遍历迭代算法变量的依赖关系图。如果迭代方法显示出某些属性,则可以生成精确的状态重建(ESR)算法,从而在节点故障的情况下恢复迭代方法的状态。这种重建是精确的,除了浮点运算引起的小扰动。一般方法利用矩阵-向量乘积中的冗余来保护作为乘积参数的向量。我们举例说明了这种通用方法在三种迭代方法上的应用:共轭梯度法、BiCGStab法和Lanczos算法。由此产生的ESR算法能够从一些冗余存储的向量中重建节点故障后的状态。与先前在预条件共轭梯度中的工作不同,这种通用方法产生了适用于一般矩阵的ESR算法。因此,我们不能再假设用于重构向量的局部对角子矩阵是非奇异的。因此,我们还提出了一种基于具有列枢轴的避免通信的揭示秩的QR分解的方法,用于导出具有简化条件数的重构过程的非奇异局部线性系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信