J. E. R. Condia, Juan-David Guerrero-Balaguera, Edward Javier Patiño Nuñez, Robert Limas Sierra, M. Reorda
{"title":"gpu交换板瞬态故障效应对体系结构的影响分析","authors":"J. E. R. Condia, Juan-David Guerrero-Balaguera, Edward Javier Patiño Nuñez, Robert Limas Sierra, M. Reorda","doi":"10.1109/LATS58125.2023.10154504","DOIUrl":null,"url":null,"abstract":"11This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data and Quantum Computing.Graphics Processing Units (GPUs) are crucial in modern safety-critical systems to implement complex and dense algorithms, so their reliability plays an essential role in several domains (e.g., automotive and autonomous machines). In fact, reliability evaluations in GPUs and their internal units are of special interest by their high parallelism and to identify vulnerable structures. In particular, Special Function Unit (SFU) cores, inside GPUs, are highly used in multimedia, scientific computing, and the training of neural networks. However, reliability evaluations in SFUs have remained highly unexplored. This work evaluates the impact of transient faults in the hardware structures of SFUs for GPUs. We focus on evaluating and analyzing two SFU architectures (‘fused’ and ‘modular’) and their relations to energy, area, and reliability impact on GPU workloads. The evaluation resorts to a fine-grain analysis with experiments using an RTL open-source GPU (FlexGripPlus) instrumented with both SFUs. The experimental results on both SFU architectures indicate that modular SFUs are less vulnerable to transient faults (in up to 47% for the analyzed workloads) and are more power efficient (in up to 36.6%) but require additional cost in terms of area (about 27%) in comparison with a fused SFU architecture (base for commercial devices), which seems more vulnerable to faults, but is area efficient.","PeriodicalId":145157,"journal":{"name":"2023 IEEE 24th Latin American Test Symposium (LATS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Analyzing the Architectural Impact of Transient Fault Effects in SFUs of GPUs\",\"authors\":\"J. E. R. Condia, Juan-David Guerrero-Balaguera, Edward Javier Patiño Nuñez, Robert Limas Sierra, M. Reorda\",\"doi\":\"10.1109/LATS58125.2023.10154504\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"11This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data and Quantum Computing.Graphics Processing Units (GPUs) are crucial in modern safety-critical systems to implement complex and dense algorithms, so their reliability plays an essential role in several domains (e.g., automotive and autonomous machines). In fact, reliability evaluations in GPUs and their internal units are of special interest by their high parallelism and to identify vulnerable structures. In particular, Special Function Unit (SFU) cores, inside GPUs, are highly used in multimedia, scientific computing, and the training of neural networks. However, reliability evaluations in SFUs have remained highly unexplored. This work evaluates the impact of transient faults in the hardware structures of SFUs for GPUs. We focus on evaluating and analyzing two SFU architectures (‘fused’ and ‘modular’) and their relations to energy, area, and reliability impact on GPU workloads. The evaluation resorts to a fine-grain analysis with experiments using an RTL open-source GPU (FlexGripPlus) instrumented with both SFUs. The experimental results on both SFU architectures indicate that modular SFUs are less vulnerable to transient faults (in up to 47% for the analyzed workloads) and are more power efficient (in up to 36.6%) but require additional cost in terms of area (about 27%) in comparison with a fused SFU architecture (base for commercial devices), which seems more vulnerable to faults, but is area efficient.\",\"PeriodicalId\":145157,\"journal\":{\"name\":\"2023 IEEE 24th Latin American Test Symposium (LATS)\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE 24th Latin American Test Symposium (LATS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/LATS58125.2023.10154504\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 24th Latin American Test Symposium (LATS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/LATS58125.2023.10154504","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Analyzing the Architectural Impact of Transient Fault Effects in SFUs of GPUs
11This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data and Quantum Computing.Graphics Processing Units (GPUs) are crucial in modern safety-critical systems to implement complex and dense algorithms, so their reliability plays an essential role in several domains (e.g., automotive and autonomous machines). In fact, reliability evaluations in GPUs and their internal units are of special interest by their high parallelism and to identify vulnerable structures. In particular, Special Function Unit (SFU) cores, inside GPUs, are highly used in multimedia, scientific computing, and the training of neural networks. However, reliability evaluations in SFUs have remained highly unexplored. This work evaluates the impact of transient faults in the hardware structures of SFUs for GPUs. We focus on evaluating and analyzing two SFU architectures (‘fused’ and ‘modular’) and their relations to energy, area, and reliability impact on GPU workloads. The evaluation resorts to a fine-grain analysis with experiments using an RTL open-source GPU (FlexGripPlus) instrumented with both SFUs. The experimental results on both SFU architectures indicate that modular SFUs are less vulnerable to transient faults (in up to 47% for the analyzed workloads) and are more power efficient (in up to 36.6%) but require additional cost in terms of area (about 27%) in comparison with a fused SFU architecture (base for commercial devices), which seems more vulnerable to faults, but is area efficient.