Demystifying GPU Reliability: Comparing and Combining Beam Experiments, Fault Simulation, and Profiling

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI:10.1109/IPDPS49936.2021.00037

F. Santos, S. Hari, P. M. Basso, L. Carro, P. Rech

{"title":"Demystifying GPU Reliability: Comparing and Combining Beam Experiments, Fault Simulation, and Profiling","authors":"F. Santos, S. Hari, P. M. Basso, L. Carro, P. Rech","doi":"10.1109/IPDPS49936.2021.00037","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU’s computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than $5 \\times$) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU’s computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than $5 \times$) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.

查看原文本刊更多论文

揭秘GPU可靠性:波束实验、故障模拟与分析的比较与结合

图形处理单元(gpu)已经从多媒体和游戏应用的专用设备转变为高性能计算(HPC)和安全关键应用(如自动驾驶汽车)中使用的通用加速器。这种市场转变导致了GPU计算能力和效率的爆发，编程框架和性能评估工具的重大改进，以及对其硬件可靠性的担忧。在本文中，我们比较并结合了超过1300万年的自然陆地暴露的高能中子束实验，需要超过350个GPU小时(使用SASSIFI和NVBitFI)的广泛架构级断层模拟，以及详细的应用级分析。我们的主要目标是回答GPU可靠性评估中的一个基本开放问题:故障模拟是否提供了可用于预测GPU上运行的工作负载故障率的代表性结果。我们表明，在大多数情况下，基于故障模拟的无声数据损坏预测与实验测量的速率足够接近(差异小于5 \ × $)。我们还分析了一些主要GPU功能单元(包括混合精度和张量核)的可靠性。我们发现GPU资源实例化的方式在整个系统可靠性中起着至关重要的作用，并且功能单元之外的故障会产生大多数可检测的错误。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量