2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)最新文献

From tasks graphs to asynchronous distributed checkpointing with local restart 从任务图到具有本地重启的异步分布式检查点

2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) Pub Date : 2020-11-01 DOI: 10.1109/FTXS51974.2020.00009

Romain Lion, Samuel Thibault

引用次数: 5

Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale FTXS 2020:超大规模高性能计算的容错

2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) Pub Date : 2020-11-01 DOI: 10.1109/ftxs51974.2020.00001

引用次数: 0

Checkpointing OpenSHMEM Programs Using Compiler Analysis 使用编译器分析检查OpenSHMEM程序

2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) Pub Date : 2020-11-01 DOI: 10.1109/FTXS51974.2020.00011

Md Abdullah Shahneous Bari, Debasmita Basu, Wenbin Lu, Tony Curtis, B. Chapman

{"title":"Checkpointing OpenSHMEM Programs Using Compiler Analysis","authors":"Md Abdullah Shahneous Bari, Debasmita Basu, Wenbin Lu, Tony Curtis, B. Chapman","doi":"10.1109/FTXS51974.2020.00011","DOIUrl":"https://doi.org/10.1109/FTXS51974.2020.00011","url":null,"abstract":"The importance of fault tolerance continues to increase for HPC applications. The continued growth in size and complexity of HPC systems, and of the applications themselves, is leading to an increased likelihood of failures during execution. However, most HPC programming models do not have a built-in fault tolerance mechanism. Instead, application developers usually rely on external support such as application-level checkpoint-restart (C/R) libraries to make their codes fault tolerant. However, this increases the burden on the application developer, who must use the libraries carefully to ensure correct behavior and to minimize the overheads. The C/R routines will be employed to save the values of all needed program variables at the places in the code where they are invoked. It is important for correctness that the program data is in a consistent state at these places. It is non-trivial to determine such points in OpenSHMEM, which relies upon single-sided communications to provide high performance. The amount of data to be collected, and the frequency with which this is performed, must also be carefully tuned, as the overheads introduced by C/R calls can be extremely high. There is very little prior work on checkpoint-restart support in the context of the OpenSHMEM programming interface. In this paper, we introduce OpenSHMEM and describe the challenges it poses for checkpointing. We identify the safest places for inserting C/R calls in an OpenSHMEM program and describe a straightforward approach for identifying the data that needs to be checkpointed at these positions in the code. We provide these two functionalities in a tool that exploits compiler analyses to propose checkpoints and the sets of data for saving at them, to the application developer.","PeriodicalId":123780,"journal":{"name":"2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124405237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Generic Strategy for Node-Failure Resilience for Certain Iterative Linear Algebra Methods 一类迭代线性代数方法的节点故障恢复策略

2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) Pub Date : 2020-11-01 DOI: 10.1109/FTXS51974.2020.00010

C. Pachajoa, Robert Ernstbrunner, W. Gansterer

{"title":"A Generic Strategy for Node-Failure Resilience for Certain Iterative Linear Algebra Methods","authors":"C. Pachajoa, Robert Ernstbrunner, W. Gansterer","doi":"10.1109/FTXS51974.2020.00010","DOIUrl":"https://doi.org/10.1109/FTXS51974.2020.00010","url":null,"abstract":"Resilience is an important research topic in HPC. As computer clusters go to extreme scales, work in this area is necessary to keep these machines reliable. In this work, we introduce a generic method to endow iterative algorithms in linear algebra based on sparse matrix-vector products, such as linear system solvers, eigensolvers and similar, with resilience to node failures. This generic method traverses the dependency graph of the variables of the iterative algorithm. If the iterative method exhibits certain properties, it is possible to produce an exact state reconstruction (ESR) algorithm, enabling the recovery of the state of the iterative method in the event of a node failure. This reconstruction is exact, except for small perturbations caused by floating point arithmetic. The generic method exploits redundancy in the matrix-vector product to protect the vector that is the argument of the product. We illustrate the use of this generic approach on three iterative methods: the conjugate gradient method, the BiCGStab method and the Lanczos algorithm. The resulting ESR algorithms enable the reconstruction of their state after a node failure from a few redundantly stored vectors. Unlike previous work in preconditioned conjugate gradient, this generic method produces ESR algorithms that work with general matrices. Consequently, we can no longer assume that local diagonal submatrices used to reconstruct vectors are nonsingular. Thus, we also propose an approach for deriving nonsingular local linear systems for the reconstruction process with reduced condition numbers, based on a communication-avoiding rank-revealing QR factorization with column pivoting.","PeriodicalId":123780,"journal":{"name":"2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114964377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Models for Resilience Design Patterns 弹性设计模式模型

2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) Pub Date : 2020-11-01 DOI: 10.1109/FTXS51974.2020.00008

M. Kumar, C. Engelmann

引用次数: 0

Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony 通过本地恢复和异步提高消息传递求解器的沉默错误弹性的可伸缩性

2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) Pub Date : 2020-11-01 DOI: 10.1109/FTXS51974.2020.00006

H. Kolla, J. Mayo, K. Teranishi, R. Armstrong

引用次数: 2

Towards Distributed Software Resilience in Asynchronous Many- Task Programming Models 异步多任务编程模型中的分布式软件弹性研究

2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) Pub Date : 2020-10-19 DOI: 10.1109/FTXS51974.2020.00007

Nikunj Gupta, J. Mayo, Adrian S. Lemoine, Hartmut Kaiser

{"title":"Towards Distributed Software Resilience in Asynchronous Many- Task Programming Models","authors":"Nikunj Gupta, J. Mayo, Adrian S. Lemoine, Hartmut Kaiser","doi":"10.1109/FTXS51974.2020.00007","DOIUrl":"https://doi.org/10.1109/FTXS51974.2020.00007","url":null,"abstract":"Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we discuss software resilience in AMTs at both local and distributed scale. We choose HPX to prototype our resiliency designs. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay reschedules a task up to n-times until a valid output is returned. Furthermore, we expose algorithm based fault tolerance (ABFT) using user provided predicates (e.g., checksums) to validate the returned results. We benchmark the resiliency scheme for both synthetic and real world applications at local and distributed scale and show that most of the added execution time arises from the replay, replication or data movement of the tasks and not the boilerplate code added to achieve resilience.","PeriodicalId":123780,"journal":{"name":"2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)","volume":"291 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132687281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Workshop Organization 车间组织

2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) Pub Date : 2018-09-01 DOI: 10.1109/PERCOMW.2005.96

R. Badia

引用次数: 0