Investigating and Reducing the Architectural Impact of Transient Faults in Special Function Units for GPUs

Journal of Electronic Testing Pub Date : 2024-03-21 DOI:10.1007/s10836-024-06107-9

Josie E. Rodriguez Condia, Juan-David Guerrero-Balaguera, Edwar J. Patiño Núñez, Robert Limas, Matteo Sonza Reorda

{"title":"Investigating and Reducing the Architectural Impact of Transient Faults in Special Function Units for GPUs","authors":"Josie E. Rodriguez Condia, Juan-David Guerrero-Balaguera, Edwar J. Patiño Núñez, Robert Limas, Matteo Sonza Reorda","doi":"10.1007/s10836-024-06107-9","DOIUrl":null,"url":null,"abstract":"Ensuring the reliability of GPUs and their internal components is paramount, especially in safety-critical domains like autonomous machines and self-driving cars. These cutting-edge applications heavily rely on GPUs to implement complex algorithms due to their implicit programming flexibility and parallelism, which is crucial for efficient operation. However, as integration technologies advance, there is a growing concern regarding the potential increase in fault sensitivity of the internal components of current GPU generations. In particular, Special Function Unit (SFU) cores inside GPUs are used in multimedia, High-Performance Computing, and neural network training. Despite their frequent usage and critical role in several domains, reliability evaluations on SFUs and the development of effective mitigation solutions have yet to be studied and remain unexplored. This work evaluates the impact of transient faults in the main hardware structures of SFUs in GPUs. In addition, we analyze the main overhead costs and benefits of developing selective-hardening mechanisms for SFUs. We focus on evaluating and analyzing two SFU architectures for GPUs (’fused’ and ’modular’) and their relations to energy, area, and reliability impact on parallel applications. The experiments resort to fine-grain fault injection campaigns on an RTL GPU model (FlexGripPlus) instrumented with both SFUs. The results on both SFU architectures indicate that fused SFUs (in commercial-grade devices) require lower area overhead (about 27%) for their integration in GPUs but are more vulnerable to transient faults (in up to 47% for the analyzed cases) and less power efficient (in up to 36.6%) than modular SFUs. Moreover, the reliability estimation shows that Modular SFUs are structurally more resilient than Fused ones in up to one order of magnitude. Similarly, selective-hardening mechanism based on Triple-Modular Redundancy (TMR) shows that coarse-grain strategies might increase the reliability of the overall SFUs under feasible overhead costs.","PeriodicalId":501485,"journal":{"name":"Journal of Electronic Testing","volume":"86 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Electronic Testing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10836-024-06107-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Ensuring the reliability of GPUs and their internal components is paramount, especially in safety-critical domains like autonomous machines and self-driving cars. These cutting-edge applications heavily rely on GPUs to implement complex algorithms due to their implicit programming flexibility and parallelism, which is crucial for efficient operation. However, as integration technologies advance, there is a growing concern regarding the potential increase in fault sensitivity of the internal components of current GPU generations. In particular, Special Function Unit (SFU) cores inside GPUs are used in multimedia, High-Performance Computing, and neural network training. Despite their frequent usage and critical role in several domains, reliability evaluations on SFUs and the development of effective mitigation solutions have yet to be studied and remain unexplored. This work evaluates the impact of transient faults in the main hardware structures of SFUs in GPUs. In addition, we analyze the main overhead costs and benefits of developing selective-hardening mechanisms for SFUs. We focus on evaluating and analyzing two SFU architectures for GPUs (’fused’ and ’modular’) and their relations to energy, area, and reliability impact on parallel applications. The experiments resort to fine-grain fault injection campaigns on an RTL GPU model (FlexGripPlus) instrumented with both SFUs. The results on both SFU architectures indicate that fused SFUs (in commercial-grade devices) require lower area overhead (about 27%) for their integration in GPUs but are more vulnerable to transient faults (in up to 47% for the analyzed cases) and less power efficient (in up to 36.6%) than modular SFUs. Moreover, the reliability estimation shows that Modular SFUs are structurally more resilient than Fused ones in up to one order of magnitude. Similarly, selective-hardening mechanism based on Triple-Modular Redundancy (TMR) shows that coarse-grain strategies might increase the reliability of the overall SFUs under feasible overhead costs.

Abstract Image

查看原文本刊更多论文

研究和减少 GPU 特殊功能单元中瞬态故障对架构的影响

确保 GPU 及其内部组件的可靠性至关重要，尤其是在自主机器和自动驾驶汽车等安全关键领域。这些尖端应用严重依赖 GPU 来实现复杂的算法，因为 GPU 具有隐含的编程灵活性和并行性，这对高效运行至关重要。然而，随着集成技术的发展，人们越来越关注当前 GPU 内部组件故障敏感性的潜在增加。尤其是 GPU 内部的特殊功能单元（SFU）内核，被广泛应用于多媒体、高性能计算和神经网络训练等领域。尽管 SFU 在多个领域中被频繁使用并发挥着关键作用，但对其可靠性的评估以及有效缓解解决方案的开发仍有待研究和探索。这项工作评估了 GPU 中 SFU 主要硬件结构中瞬态故障的影响。此外，我们还分析了为 SFU 开发选择性硬化机制的主要开销成本和收益。我们重点评估和分析了用于 GPU 的两种 SFU 架构（"融合 "和 "模块化"）及其对并行应用的能量、面积和可靠性影响的关系。实验采用了细粒度故障注入方法，在配备了这两种 SFU 的 RTL GPU 模型（FlexGripPlus）上进行。两种 SFU 架构的实验结果表明，融合 SFU（商用级设备）集成到 GPU 中所需的面积开销较低（约 27%），但与模块化 SFU 相比，更容易受到瞬态故障的影响（在分析的情况下，高达 47%），且功耗较低（高达 36.6%）。此外，可靠性评估表明，模块化 SFU 在结构上比融合型 SFU 更有弹性，最多可提高一个数量级。同样，基于三模块冗余（TMR）的选择性硬化机制表明，粗粒度策略可以在可行的开销成本下提高整体 SFU 的可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Electronic Testing

自引率

0.00%

发文量