{"title":"Understanding and Improving GPUs' Reliability Combining Beam Experiments with Fault Simulation","authors":"F. Santos, L. Carro, P. Rech","doi":"10.1109/ETS56758.2023.10174206","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) are being employed in High Performance Computing (HPC) and safety-critical applications, such as autonomous vehicles. This market shift led to significant improvements in the programming frameworks and performance evaluation tools and concerns about their reliability. GPU reliability evaluation is extremely challenging due to the parallel nature and high complexity of GPU architectures. We conducted the first cross-layer GPU reliability evaluation to unveil (and mitigate) GPU vulnerabilities. The proposed evaluation is achieved by comparing and combining extensive high-energy neutron beam experiments, massive fault simulation campaigns at both Register-Transfer Level (RTL) and software levels, and application profiling. Based on this extensive and detailed analysis, a novel accurate methodology to accurately estimate GPUs application FIT rate is proposed. Moreover, by employing the knowledge obtained from the cross-layer reliability evaluation, two novel hardening solutions for HPC and safety-critical applications are proposed: (1) Reduced Precision Duplication With Comparison (RP-DWC), which executes a redundant copy in a reduced precision. RP-DWC delivers excellent fault coverage, up to 86%, with minimal execution time and energy consumption overheads (13% and 24%, respectively). (2) Dedicated software solutions for hardening Convolutional Neural Networks (CNNs) that can correct up to 98% of the CNN errors.","PeriodicalId":211522,"journal":{"name":"2023 IEEE European Test Symposium (ETS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE European Test Symposium (ETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ETS56758.2023.10174206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Graphics Processing Units (GPUs) are being employed in High Performance Computing (HPC) and safety-critical applications, such as autonomous vehicles. This market shift led to significant improvements in the programming frameworks and performance evaluation tools and concerns about their reliability. GPU reliability evaluation is extremely challenging due to the parallel nature and high complexity of GPU architectures. We conducted the first cross-layer GPU reliability evaluation to unveil (and mitigate) GPU vulnerabilities. The proposed evaluation is achieved by comparing and combining extensive high-energy neutron beam experiments, massive fault simulation campaigns at both Register-Transfer Level (RTL) and software levels, and application profiling. Based on this extensive and detailed analysis, a novel accurate methodology to accurately estimate GPUs application FIT rate is proposed. Moreover, by employing the knowledge obtained from the cross-layer reliability evaluation, two novel hardening solutions for HPC and safety-critical applications are proposed: (1) Reduced Precision Duplication With Comparison (RP-DWC), which executes a redundant copy in a reduced precision. RP-DWC delivers excellent fault coverage, up to 86%, with minimal execution time and energy consumption overheads (13% and 24%, respectively). (2) Dedicated software solutions for hardening Convolutional Neural Networks (CNNs) that can correct up to 98% of the CNN errors.