Understanding and Improving GPUs' Reliability Combining Beam Experiments with Fault Simulation

F. Santos, L. Carro, P. Rech
{"title":"Understanding and Improving GPUs' Reliability Combining Beam Experiments with Fault Simulation","authors":"F. Santos, L. Carro, P. Rech","doi":"10.1109/ETS56758.2023.10174206","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) are being employed in High Performance Computing (HPC) and safety-critical applications, such as autonomous vehicles. This market shift led to significant improvements in the programming frameworks and performance evaluation tools and concerns about their reliability. GPU reliability evaluation is extremely challenging due to the parallel nature and high complexity of GPU architectures. We conducted the first cross-layer GPU reliability evaluation to unveil (and mitigate) GPU vulnerabilities. The proposed evaluation is achieved by comparing and combining extensive high-energy neutron beam experiments, massive fault simulation campaigns at both Register-Transfer Level (RTL) and software levels, and application profiling. Based on this extensive and detailed analysis, a novel accurate methodology to accurately estimate GPUs application FIT rate is proposed. Moreover, by employing the knowledge obtained from the cross-layer reliability evaluation, two novel hardening solutions for HPC and safety-critical applications are proposed: (1) Reduced Precision Duplication With Comparison (RP-DWC), which executes a redundant copy in a reduced precision. RP-DWC delivers excellent fault coverage, up to 86%, with minimal execution time and energy consumption overheads (13% and 24%, respectively). (2) Dedicated software solutions for hardening Convolutional Neural Networks (CNNs) that can correct up to 98% of the CNN errors.","PeriodicalId":211522,"journal":{"name":"2023 IEEE European Test Symposium (ETS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE European Test Symposium (ETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ETS56758.2023.10174206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Graphics Processing Units (GPUs) are being employed in High Performance Computing (HPC) and safety-critical applications, such as autonomous vehicles. This market shift led to significant improvements in the programming frameworks and performance evaluation tools and concerns about their reliability. GPU reliability evaluation is extremely challenging due to the parallel nature and high complexity of GPU architectures. We conducted the first cross-layer GPU reliability evaluation to unveil (and mitigate) GPU vulnerabilities. The proposed evaluation is achieved by comparing and combining extensive high-energy neutron beam experiments, massive fault simulation campaigns at both Register-Transfer Level (RTL) and software levels, and application profiling. Based on this extensive and detailed analysis, a novel accurate methodology to accurately estimate GPUs application FIT rate is proposed. Moreover, by employing the knowledge obtained from the cross-layer reliability evaluation, two novel hardening solutions for HPC and safety-critical applications are proposed: (1) Reduced Precision Duplication With Comparison (RP-DWC), which executes a redundant copy in a reduced precision. RP-DWC delivers excellent fault coverage, up to 86%, with minimal execution time and energy consumption overheads (13% and 24%, respectively). (2) Dedicated software solutions for hardening Convolutional Neural Networks (CNNs) that can correct up to 98% of the CNN errors.
波束实验与故障仿真相结合对gpu可靠性的认识与改进
图形处理单元(gpu)被用于高性能计算(HPC)和安全关键应用,如自动驾驶汽车。这种市场转变导致了编程框架和性能评估工具的重大改进以及对其可靠性的关注。由于GPU架构的并行性和高复杂性,GPU可靠性评估极具挑战性。我们进行了第一次跨层GPU可靠性评估,以揭示(并减轻)GPU漏洞。所提出的评估是通过比较和结合大量的高能中子束实验、在注册-传输级别(RTL)和软件级别的大规模故障模拟活动以及应用分析来实现的。在此基础上,提出了一种准确估计gpu应用FIT率的新方法。此外,利用从跨层可靠性评估中获得的知识,针对高性能计算和安全关键应用,提出了两种新的强化方案:(1)降低精度重复与比较(RP-DWC),以降低精度执行冗余副本。RP-DWC提供了出色的故障覆盖率,高达86%,执行时间和能耗开销最小(分别为13%和24%)。(2)强化卷积神经网络(CNN)的专用软件解决方案,可以纠正高达98%的CNN错误。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信