{"title":"A Comprehensive Evaluation of the Effects of Input Data on the Resilience of GPU Applications","authors":"Fritz G. Previlon, Charu Kalra, D. Kaeli, P. Rech","doi":"10.1109/DFT.2019.8875269","DOIUrl":null,"url":null,"abstract":"While GPUs are being aggressively deployed in a growing number of computing domains, their resilience to transient faults remains a subject of concern. To gain a better understanding of the inherent vulnerability of GPU applications to transient faults, researchers perform extensive fault injection experiments. However, the conclusions reached based on the results of these fault injection experiments tend to be dependent on the specific input used during the experiments. The dependence of program resilience on changes in program input has not been thoroughly studied for GPU workloads. This paper addresses this issue, presenting extensive analysis on the effects of changes in program input and the resulting GPU reliability. Our work extends and challenges previous studies which reported that input data values do not affect reliability. Our analysis demonstrates that input sizes, as well as biased input values (input with a small set of dominant values) can have a significant impact on application reliability. For applications studied, we can expect a change of as much as 30% in the probability for a fault to cause a failure. Furthermore, we provide guidance on how to predict changes in resilience without repeating exhaustive fault injection experiments,","PeriodicalId":415648,"journal":{"name":"2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DFT.2019.8875269","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
While GPUs are being aggressively deployed in a growing number of computing domains, their resilience to transient faults remains a subject of concern. To gain a better understanding of the inherent vulnerability of GPU applications to transient faults, researchers perform extensive fault injection experiments. However, the conclusions reached based on the results of these fault injection experiments tend to be dependent on the specific input used during the experiments. The dependence of program resilience on changes in program input has not been thoroughly studied for GPU workloads. This paper addresses this issue, presenting extensive analysis on the effects of changes in program input and the resulting GPU reliability. Our work extends and challenges previous studies which reported that input data values do not affect reliability. Our analysis demonstrates that input sizes, as well as biased input values (input with a small set of dominant values) can have a significant impact on application reliability. For applications studied, we can expect a change of as much as 30% in the probability for a fault to cause a failure. Furthermore, we provide guidance on how to predict changes in resilience without repeating exhaustive fault injection experiments,