Analysis and Mitigation of Soft-Errors on High Performance Embedded GPUs

L. Sterpone, S. Azimi, C. D. Sio, Filippo Parisi
{"title":"Analysis and Mitigation of Soft-Errors on High Performance Embedded GPUs","authors":"L. Sterpone, S. Azimi, C. D. Sio, Filippo Parisi","doi":"10.1109/ISPDC55340.2022.00022","DOIUrl":null,"url":null,"abstract":"Multiprocessor system-on-chip such as embedded GPUs are becoming very popular in safety-critical applications, such as autonomous and semi-autonomous vehicles. However, these devices can suffer from the effects of soft-errors, such as those produced by radiation effects. These effects are able to generate unpredictable misbehaviors. Fault tolerance oriented to multi-threaded software introduces severe performance degradations due to the redundancy, voting and correction threads operations. In this paper, we propose a new fault injection environment for NVIDIA GPGPU devices and a fault tolerance approach based on error detection and correction threads executed during data transfer operations on embedded GPUs. The fault injection environment is capable of automatically injecting faults into the instructions at SASS level by instrumenting the CUDA binary executable file. The mitigation approach is based on concurrent error detection threads running simultaneously with the memory stream device to host data transfer operations. With several benchmark applications, we evaluate the impact of soft- errors classifying Silent Data Corruption, Detection, Unrecoverable Error and Hang. Finally, the proposed mitigation approach has been validated by soft-error fault injection campaigns on an NVIDIA Pascal Architecture GPU controlled by Quad-Core A57 ARM processor (JETSON TX2) demonstrating an advantage of more than 37% with respect to state of the art solution.","PeriodicalId":389334,"journal":{"name":"2022 21st International Symposium on Parallel and Distributed Computing (ISPDC)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 21st International Symposium on Parallel and Distributed Computing (ISPDC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPDC55340.2022.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Multiprocessor system-on-chip such as embedded GPUs are becoming very popular in safety-critical applications, such as autonomous and semi-autonomous vehicles. However, these devices can suffer from the effects of soft-errors, such as those produced by radiation effects. These effects are able to generate unpredictable misbehaviors. Fault tolerance oriented to multi-threaded software introduces severe performance degradations due to the redundancy, voting and correction threads operations. In this paper, we propose a new fault injection environment for NVIDIA GPGPU devices and a fault tolerance approach based on error detection and correction threads executed during data transfer operations on embedded GPUs. The fault injection environment is capable of automatically injecting faults into the instructions at SASS level by instrumenting the CUDA binary executable file. The mitigation approach is based on concurrent error detection threads running simultaneously with the memory stream device to host data transfer operations. With several benchmark applications, we evaluate the impact of soft- errors classifying Silent Data Corruption, Detection, Unrecoverable Error and Hang. Finally, the proposed mitigation approach has been validated by soft-error fault injection campaigns on an NVIDIA Pascal Architecture GPU controlled by Quad-Core A57 ARM processor (JETSON TX2) demonstrating an advantage of more than 37% with respect to state of the art solution.
高性能嵌入式gpu的软错误分析与缓解
嵌入式gpu等多处理器片上系统在自动驾驶和半自动驾驶汽车等安全关键应用中变得非常流行。然而,这些设备可能受到软误差的影响,例如由辐射效应产生的那些误差。这些影响会产生不可预测的不良行为。面向多线程软件的容错由于冗余、投票和纠错线程操作导致了严重的性能下降。本文提出了一种新的NVIDIA GPGPU设备故障注入环境,并提出了一种基于嵌入式gpu数据传输过程中执行的错误检测和纠错线程的容错方法。故障注入环境能够通过检测CUDA二进制可执行文件自动将故障注入SASS级别的指令中。缓解方法基于与内存流设备同时运行以承载数据传输操作的并发错误检测线程。通过几个基准测试应用程序,我们评估了软错误对静默数据损坏、检测、不可恢复错误和挂起的影响。最后,在由四核A57 ARM处理器(JETSON TX2)控制的NVIDIA Pascal架构GPU上,通过软错误故障注入活动验证了所提出的缓解方法,表明相对于最先进的解决方案,该方法的优势超过37%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信