{"title":"gpu对寄存器中永久故障的功能和性能容忍度","authors":"Sotiris Tselonis, Vasilis Dimitsas, D. Gizopoulos","doi":"10.1109/IOLTS.2013.6604089","DOIUrl":null,"url":null,"abstract":"Massively parallel many-core Graphics Processing Unit (GPU) architectures offer significant performance speedup in workloads with thread-level parallelism compared to contemporary multicore CPUs. For this reason, general-purpose computing using GPUs (GPGPU) is a rapidly expanding research direction in different contexts. Unlike graphics processing, GPGPU computing requires reliable operation in the presence of hardware faults whose occurrence probabilities in current and forthcoming advanced manufacturing technologies will be significant. In this paper, we focus on the aspect of tolerance of GPUs to permanent faults in their most critical storage elements: register files. By performing a comprehensive fault injection campaign on a cycle-accurate GPGPU architectural simulator, we first evaluate and classify the behavior of NVIDIA GPU CUDA kernels in the presence of permanent faults in registers. Moreover, we analyze the performance tolerance of GPUs when they operate in degraded mode (less hardware resources, less thread-level parallelism) due to the presence of multiple permanent faults in the registers of their streaming multiprocessors. Our findings confirm the intuitively expected tolerance of these architectures to faults and also quantify it in different configurations and modes.","PeriodicalId":423175,"journal":{"name":"2013 IEEE 19th International On-Line Testing Symposium (IOLTS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"The functional and performance tolerance of GPUs to permanent faults in registers\",\"authors\":\"Sotiris Tselonis, Vasilis Dimitsas, D. Gizopoulos\",\"doi\":\"10.1109/IOLTS.2013.6604089\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Massively parallel many-core Graphics Processing Unit (GPU) architectures offer significant performance speedup in workloads with thread-level parallelism compared to contemporary multicore CPUs. For this reason, general-purpose computing using GPUs (GPGPU) is a rapidly expanding research direction in different contexts. Unlike graphics processing, GPGPU computing requires reliable operation in the presence of hardware faults whose occurrence probabilities in current and forthcoming advanced manufacturing technologies will be significant. In this paper, we focus on the aspect of tolerance of GPUs to permanent faults in their most critical storage elements: register files. By performing a comprehensive fault injection campaign on a cycle-accurate GPGPU architectural simulator, we first evaluate and classify the behavior of NVIDIA GPU CUDA kernels in the presence of permanent faults in registers. Moreover, we analyze the performance tolerance of GPUs when they operate in degraded mode (less hardware resources, less thread-level parallelism) due to the presence of multiple permanent faults in the registers of their streaming multiprocessors. Our findings confirm the intuitively expected tolerance of these architectures to faults and also quantify it in different configurations and modes.\",\"PeriodicalId\":423175,\"journal\":{\"name\":\"2013 IEEE 19th International On-Line Testing Symposium (IOLTS)\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE 19th International On-Line Testing Symposium (IOLTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IOLTS.2013.6604089\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 19th International On-Line Testing Symposium (IOLTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IOLTS.2013.6604089","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
摘要
与当代多核cpu相比,大规模并行多核图形处理单元(GPU)架构在线程级并行性的工作负载中提供了显著的性能加速。因此,基于gpu的通用计算(general-purpose computing using gpu, GPGPU)是一个在不同背景下迅速发展的研究方向。与图形处理不同,GPGPU计算需要在硬件故障存在的情况下可靠运行,而硬件故障在当前和未来的先进制造技术中发生的概率将很大。在本文中,我们重点讨论gpu对其最关键的存储元素:寄存器文件中的永久故障的容忍度。通过在周期精确的GPGPU架构模拟器上执行全面的故障注入活动,我们首先评估和分类寄存器中存在永久故障的NVIDIA GPU CUDA内核的行为。此外,我们还分析了gpu在降级模式(更少的硬件资源,更少的线程级并行性)下运行时的性能容忍度,因为它们的流多处理器的寄存器中存在多个永久故障。我们的研究结果证实了这些架构对故障的直观预期容忍度,并量化了不同配置和模式下的容忍度。
The functional and performance tolerance of GPUs to permanent faults in registers
Massively parallel many-core Graphics Processing Unit (GPU) architectures offer significant performance speedup in workloads with thread-level parallelism compared to contemporary multicore CPUs. For this reason, general-purpose computing using GPUs (GPGPU) is a rapidly expanding research direction in different contexts. Unlike graphics processing, GPGPU computing requires reliable operation in the presence of hardware faults whose occurrence probabilities in current and forthcoming advanced manufacturing technologies will be significant. In this paper, we focus on the aspect of tolerance of GPUs to permanent faults in their most critical storage elements: register files. By performing a comprehensive fault injection campaign on a cycle-accurate GPGPU architectural simulator, we first evaluate and classify the behavior of NVIDIA GPU CUDA kernels in the presence of permanent faults in registers. Moreover, we analyze the performance tolerance of GPUs when they operate in degraded mode (less hardware resources, less thread-level parallelism) due to the presence of multiple permanent faults in the registers of their streaming multiprocessors. Our findings confirm the intuitively expected tolerance of these architectures to faults and also quantify it in different configurations and modes.