{"title":"Impacts of Three Soft-Fault Models on Hybrid Parallel Asynchronous Iterative Methods","authors":"Evan Coleman, Erik J. Jensen, M. Sosonkina","doi":"10.1109/CAHPC.2018.8645942","DOIUrl":null,"url":null,"abstract":"This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. The implementations make use of hybrid parallelism where the computational work is distributed over multiple nodes using MPI and parallelized on each node using openMP. A series of experiments is conducted to measure the impact of an undetected soft fault on an asynchronous iterative method, and to compare and contrast several techniques for simulating the occurrence of a fault and then recovering from the effects of the faults. The data shows that the two numerical soft-fault models tested here more consistently than a “bit-flip” model produce bad enough behavior to test a variety of recovery strategies, such as those based on partial checkpointing.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CAHPC.2018.8645942","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. The implementations make use of hybrid parallelism where the computational work is distributed over multiple nodes using MPI and parallelized on each node using openMP. A series of experiments is conducted to measure the impact of an undetected soft fault on an asynchronous iterative method, and to compare and contrast several techniques for simulating the occurrence of a fault and then recovering from the effects of the faults. The data shows that the two numerical soft-fault models tested here more consistently than a “bit-flip” model produce bad enough behavior to test a variety of recovery strategies, such as those based on partial checkpointing.