Understanding a program's resiliency through error propagation

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI:10.1145/3437801.3441589

Zhimin Li, Harshitha Menon, K. Mohror, P. Bremer, Yarden Livant, Valerio Pascucci

{"title":"Understanding a program's resiliency through error propagation","authors":"Zhimin Li, Harshitha Menon, K. Mohror, P. Bremer, Yarden Livant, Valerio Pascucci","doi":"10.1145/3437801.3441589","DOIUrl":null,"url":null,"abstract":"Aggressive technology scaling trends have worsened the transient fault problem in high-performance computing (HPC) systems. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem; a fault introducing an error that is not readily detected nto an HPC simulation. Due to the insidious nature of SDCs, researchers have worked to understand their impact on applications. Previous studies have relied on expensive fault injection campaigns with uniform sampling to provide overall SDC rates, but this solution does not provide any feedback on the code regions without samples. In this research, we develop a method to systematically analyze all fault injection sites in an application with a low number of fault injection experiments. We use fault propagation data from a fault injection experiment to predict the resiliency of other untested fault sites and obtain an approximate fault tolerance threshold value for each site, which represents the largest error that can be introduced at the site without incurring incorrect simulation results. We define the collection of threshold values over all fault sites in the program as a fault tolerance boundary and propose a simple but efficient method to approximate the boundary. In our experiments, we show our method reduces the number of fault injection samples required to understand a program's resiliency by several orders of magnitude when compared with a traditional fault injection study.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"29 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3437801.3441589","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Aggressive technology scaling trends have worsened the transient fault problem in high-performance computing (HPC) systems. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem; a fault introducing an error that is not readily detected nto an HPC simulation. Due to the insidious nature of SDCs, researchers have worked to understand their impact on applications. Previous studies have relied on expensive fault injection campaigns with uniform sampling to provide overall SDC rates, but this solution does not provide any feedback on the code regions without samples. In this research, we develop a method to systematically analyze all fault injection sites in an application with a low number of fault injection experiments. We use fault propagation data from a fault injection experiment to predict the resiliency of other untested fault sites and obtain an approximate fault tolerance threshold value for each site, which represents the largest error that can be introduced at the site without incurring incorrect simulation results. We define the collection of threshold values over all fault sites in the program as a fault tolerance boundary and propose a simple but efficient method to approximate the boundary. In our experiments, we show our method reduces the number of fault injection samples required to understand a program's resiliency by several orders of magnitude when compared with a traditional fault injection study.

查看原文本刊更多论文

通过错误传播了解程序的弹性

激进的技术扩展趋势使高性能计算(HPC)系统中的瞬态故障问题更加严重。有些故障是良性的，但其他故障可能导致静默数据损坏(SDC)，这是一个严重的问题;在HPC模拟中引入不容易检测到的错误的故障。由于SDCs的隐伏性，研究人员一直在努力了解它们对应用程序的影响。以前的研究依赖于昂贵的故障注入活动和均匀采样来提供总体SDC率，但是这种解决方案不提供没有样本的代码区域的任何反馈。在本研究中，我们开发了一种方法来系统地分析在一个应用程序中所有的断层注入点，在少量的断层注入实验。我们使用来自故障注入实验的故障传播数据来预测其他未测试的故障站点的弹性，并获得每个站点的近似容错阈值，该阈值代表在不导致错误模拟结果的情况下可以在站点引入的最大错误。我们将程序中所有故障点的阈值集合定义为容错边界，并提出了一种简单而有效的逼近边界的方法。在我们的实验中，我们表明，与传统的故障注入研究相比，我们的方法将了解程序弹性所需的故障注入样本数量减少了几个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量