揭秘GPU可靠性:波束实验、故障模拟与分析的比较与结合

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI:10.1109/IPDPS49936.2021.00037

F. Santos, S. Hari, P. M. Basso, L. Carro, P. Rech

{"title":"揭秘GPU可靠性:波束实验、故障模拟与分析的比较与结合","authors":"F. Santos, S. Hari, P. M. Basso, L. Carro, P. Rech","doi":"10.1109/IPDPS49936.2021.00037","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU’s computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than $5 \\times$) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Demystifying GPU Reliability: Comparing and Combining Beam Experiments, Fault Simulation, and Profiling\",\"authors\":\"F. Santos, S. Hari, P. M. Basso, L. Carro, P. Rech\",\"doi\":\"10.1109/IPDPS49936.2021.00037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU’s computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than $5 \\\\times$) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.\",\"PeriodicalId\":372234,\"journal\":{\"name\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"52 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS49936.2021.00037\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

图形处理单元(gpu)已经从多媒体和游戏应用的专用设备转变为高性能计算(HPC)和安全关键应用(如自动驾驶汽车)中使用的通用加速器。这种市场转变导致了GPU计算能力和效率的爆发，编程框架和性能评估工具的重大改进，以及对其硬件可靠性的担忧。在本文中，我们比较并结合了超过1300万年的自然陆地暴露的高能中子束实验，需要超过350个GPU小时(使用SASSIFI和NVBitFI)的广泛架构级断层模拟，以及详细的应用级分析。我们的主要目标是回答GPU可靠性评估中的一个基本开放问题:故障模拟是否提供了可用于预测GPU上运行的工作负载故障率的代表性结果。我们表明，在大多数情况下，基于故障模拟的无声数据损坏预测与实验测量的速率足够接近(差异小于5 \ × $)。我们还分析了一些主要GPU功能单元(包括混合精度和张量核)的可靠性。我们发现GPU资源实例化的方式在整个系统可靠性中起着至关重要的作用，并且功能单元之外的故障会产生大多数可检测的错误。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Demystifying GPU Reliability: Comparing and Combining Beam Experiments, Fault Simulation, and Profiling

Graphics Processing Units (GPUs) have moved from being dedicated devices for multimedia and gaming applications to general-purpose accelerators employed in High-Performance Computing (HPC) and safety-critical applications such as autonomous vehicles. This market shift led to a burst in the GPU’s computing capabilities and efficiency, significant improvements in the programming frameworks and performance evaluation tools, and a concern about their hardware reliability. In this paper, we compare and combine high-energy neutron beam experiments that account for more than 13 million years of natural terrestrial exposure, extensive architectural-level fault simulations that required more than 350 GPU hours (using SASSIFI and NVBitFI), and detailed application-level profiling. Our main goal is to answer one of the fundamental open questions in GPU reliability evaluation: whether fault simulation provides representative results that can be used to predict the failure rates of workloads running on GPUs. We show that, in most cases, fault simulation-based prediction for silent data corruptions is sufficiently close (differences lower than $5 \times$) to the experimentally measured rates. We also analyze the reliability of some of the main GPU functional units (including mixed-precision and tensor cores). We find that the way GPU resources are instantiated plays a critical role in the overall system reliability and that faults outside the functional units generate most detectable errors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量