面向可靠并行计算的GPU控制逻辑故障并发检测

2020 IEEE International Test Conference (ITC) Pub Date : 2020-11-01 DOI:10.1109/ITC44778.2020.9325216

Hiroaki Itsuji, T. Uezono, Tadanobu Toba, Kojiro Ito, M. Hashimoto

{"title":"面向可靠并行计算的GPU控制逻辑故障并发检测","authors":"Hiroaki Itsuji, T. Uezono, Tadanobu Toba, Kojiro Ito, M. Hashimoto","doi":"10.1109/ITC44778.2020.9325216","DOIUrl":null,"url":null,"abstract":"The reliability of GPUs is becoming a major concern due to the increased probability of failures and the high vulnerability of GPUs compared to conventional CPUs in terms of tasks per failure. While there are extensive countermeasures against failures in GPU data units, there are fewer countermeasures for failures in GPU control logics. Currently, software-based techniques, such as inserting signature codes for detecting GPU control-logic failures by comparing the expected signature value with the current signature value, are being utilized. However, in the conventional software-based techniques, application calculations, signature calculations, and signature comparison calculations are executed in sequence, which degrades the application throughputs. We have developed a software-based technique that concurrently detects GPU control-logic failures in a running application while largely maintaining its throughput. Experimental results show that when our technique concurrently executed application calculations, signature calculations, and signature comparison calculations for a matrix multiplication application, the application throughput remains 78% of the original one, whereas 62% is reported in literature. We also developed fault injection simulators specialized for injecting GPU-specific control-logic faults into GPU intermediate codes and found that 100% of GPU-specific failures could be detected both during and after application execution. The proposed approach can be utilized for a wide variety of safety-and reliability-critical applications.","PeriodicalId":251504,"journal":{"name":"2020 IEEE International Test Conference (ITC)","volume":"128 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Concurrent Detection of Failures in GPU Control Logic for Reliable Parallel Computing\",\"authors\":\"Hiroaki Itsuji, T. Uezono, Tadanobu Toba, Kojiro Ito, M. Hashimoto\",\"doi\":\"10.1109/ITC44778.2020.9325216\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The reliability of GPUs is becoming a major concern due to the increased probability of failures and the high vulnerability of GPUs compared to conventional CPUs in terms of tasks per failure. While there are extensive countermeasures against failures in GPU data units, there are fewer countermeasures for failures in GPU control logics. Currently, software-based techniques, such as inserting signature codes for detecting GPU control-logic failures by comparing the expected signature value with the current signature value, are being utilized. However, in the conventional software-based techniques, application calculations, signature calculations, and signature comparison calculations are executed in sequence, which degrades the application throughputs. We have developed a software-based technique that concurrently detects GPU control-logic failures in a running application while largely maintaining its throughput. Experimental results show that when our technique concurrently executed application calculations, signature calculations, and signature comparison calculations for a matrix multiplication application, the application throughput remains 78% of the original one, whereas 62% is reported in literature. We also developed fault injection simulators specialized for injecting GPU-specific control-logic faults into GPU intermediate codes and found that 100% of GPU-specific failures could be detected both during and after application execution. The proposed approach can be utilized for a wide variety of safety-and reliability-critical applications.\",\"PeriodicalId\":251504,\"journal\":{\"name\":\"2020 IEEE International Test Conference (ITC)\",\"volume\":\"128 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE International Test Conference (ITC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ITC44778.2020.9325216\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Test Conference (ITC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITC44778.2020.9325216","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

gpu的可靠性正在成为一个主要的问题，因为与传统的cpu相比，gpu的故障概率增加了，而且在每次故障的任务方面，gpu的脆弱性很高。虽然针对GPU数据单元的故障有广泛的对策，但针对GPU控制逻辑故障的对策较少。目前，基于软件的技术，如插入签名码，通过比较期望的签名值和当前的签名值来检测GPU控制逻辑故障，正在被利用。但是，传统的基于软件的技术中，应用计算、签名计算和签名比较计算是依次进行的，这会降低应用的吞吐量。我们开发了一种基于软件的技术，可以在运行的应用程序中同时检测GPU控制逻辑故障，同时在很大程度上保持其吞吐量。实验结果表明，当我们的技术同时执行矩阵乘法应用程序的应用程序计算、签名计算和签名比较计算时，应用程序吞吐量保持在原始吞吐量的78%，而文献报道的吞吐量为62%。我们还开发了故障注入模拟器，专门用于将GPU特定的控制逻辑故障注入GPU中间代码，并发现在应用程序执行期间和之后都可以检测到100%的GPU特定故障。所提出的方法可用于各种安全性和可靠性关键型应用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Concurrent Detection of Failures in GPU Control Logic for Reliable Parallel Computing

The reliability of GPUs is becoming a major concern due to the increased probability of failures and the high vulnerability of GPUs compared to conventional CPUs in terms of tasks per failure. While there are extensive countermeasures against failures in GPU data units, there are fewer countermeasures for failures in GPU control logics. Currently, software-based techniques, such as inserting signature codes for detecting GPU control-logic failures by comparing the expected signature value with the current signature value, are being utilized. However, in the conventional software-based techniques, application calculations, signature calculations, and signature comparison calculations are executed in sequence, which degrades the application throughputs. We have developed a software-based technique that concurrently detects GPU control-logic failures in a running application while largely maintaining its throughput. Experimental results show that when our technique concurrently executed application calculations, signature calculations, and signature comparison calculations for a matrix multiplication application, the application throughput remains 78% of the original one, whereas 62% is reported in literature. We also developed fault injection simulators specialized for injecting GPU-specific control-logic faults into GPU intermediate codes and found that 100% of GPU-specific failures could be detected both during and after application execution. The proposed approach can be utilized for a wide variety of safety-and reliability-critical applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE International Test Conference (ITC)

自引率

0.00%

发文量