A Reliability-aware Environment for Design Exploration for GPU Devices

Robert Limas Sierra, Juan-David Guerrero-Balaguera, J. E. R. Condia, M. Reorda
{"title":"A Reliability-aware Environment for Design Exploration for GPU Devices","authors":"Robert Limas Sierra, Juan-David Guerrero-Balaguera, J. E. R. Condia, M. Reorda","doi":"10.1109/DDECS57882.2023.10139643","DOIUrl":null,"url":null,"abstract":"1 Nowadays, GPU platforms have gained wide importance in applications that require high processing power. Unfortunately, the advanced semiconductor technologies used for their manufacturing are prone to different types of faults. Hence, solutions are required to support the exploration of the resilience to faults of different architectures. Based on this motivation, this work presents an environment dedicated to the analysis of the impact of permanent faults on GPU platforms. This environment is based on GPGPU-Sim, with the objective of exploiting the configuration features of this tool and, thus, analyzing the effects of faults when changing the target architecture. To validate the environment and show its usability, a fault campaign has been carried out where three different GPU architectures (Kepler, Volta, and Turing) were used. In addition, each GPU has been modified with an arbitrary number of parallel processing cores (or SMs). Three representative applications (Vector Add, Scalar Product, and Matrix Multiply) were executed on each GPU, and the behavior of each architecture in the presence of permanent faults in the functional (i.e., integer unit and floating-point) units was analyzed. This fault campaign shows the usability of the environment and demonstrates its potential use to support decisions on the best architectural parameters for a given application.","PeriodicalId":220690,"journal":{"name":"2023 26th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 26th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DDECS57882.2023.10139643","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

1 Nowadays, GPU platforms have gained wide importance in applications that require high processing power. Unfortunately, the advanced semiconductor technologies used for their manufacturing are prone to different types of faults. Hence, solutions are required to support the exploration of the resilience to faults of different architectures. Based on this motivation, this work presents an environment dedicated to the analysis of the impact of permanent faults on GPU platforms. This environment is based on GPGPU-Sim, with the objective of exploiting the configuration features of this tool and, thus, analyzing the effects of faults when changing the target architecture. To validate the environment and show its usability, a fault campaign has been carried out where three different GPU architectures (Kepler, Volta, and Turing) were used. In addition, each GPU has been modified with an arbitrary number of parallel processing cores (or SMs). Three representative applications (Vector Add, Scalar Product, and Matrix Multiply) were executed on each GPU, and the behavior of each architecture in the presence of permanent faults in the functional (i.e., integer unit and floating-point) units was analyzed. This fault campaign shows the usability of the environment and demonstrates its potential use to support decisions on the best architectural parameters for a given application.
面向GPU设备设计探索的可靠性感知环境
如今,GPU平台在需要高处理能力的应用中得到了广泛的重视。不幸的是,用于制造它们的先进半导体技术容易出现不同类型的故障。因此,需要解决方案来支持对不同架构的故障弹性的探索。基于这一动机,本工作提出了一个专门用于分析GPU平台上永久故障影响的环境。该环境基于GPGPU-Sim,目的是利用该工具的配置特性,从而分析更改目标体系结构时故障的影响。为了验证环境并展示其可用性,在使用了三种不同的GPU架构(Kepler, Volta和Turing)的情况下进行了故障活动。此外,每个GPU都被修改为任意数量的并行处理内核(或SMs)。在每个GPU上执行了三个代表性的应用程序(向量加法、标量乘积和矩阵乘法),并分析了在功能单元(即整数单元和浮点单元)存在永久故障时每种架构的行为。此错误活动显示了环境的可用性,并演示了它在支持给定应用程序的最佳体系结构参数决策方面的潜在用途。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信