Enabling Software Resilience in GPGPU Applications via Partial Thread Protection

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) Pub Date : 2021-03-04 DOI:10.1109/ICSE43902.2021.00114

Lishan Yang, Bin Nie, Adwait Jog, E. Smirni

{"title":"Enabling Software Resilience in GPGPU Applications via Partial Thread Protection","authors":"Lishan Yang, Bin Nie, Adwait Jog, E. Smirni","doi":"10.1109/ICSE43902.2021.00114","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) are widely used by various applications in a broad variety of fields to accelerate their computation but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. By taking advantage of a general purpose GPU application hierarchical organization in threads, warps, and cooperative thread arrays, we propose a methodology that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. This allows to engage partial replication mechanisms for error detection/correction at the warp level. By exploring 12 benchmarks (17 kernels) from 4 benchmark suites, we illustrate that threads can be remapped into reliable or unreliable warps with only 1.63% introduced overhead (on average), and then enable selective protection via replication to those groups of threads that truly need it. Furthermore, we show that thread remapping to different warps does not sacrifice application performance. We show how this remapping facilitates warp replication for error detection and/or correction and achieves average reduction of 20.61% and 27.15% execution cycles, respectively comparing to standard duplication/triplication.","PeriodicalId":305167,"journal":{"name":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE43902.2021.00114","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Graphics Processing Units (GPUs) are widely used by various applications in a broad variety of fields to accelerate their computation but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. By taking advantage of a general purpose GPU application hierarchical organization in threads, warps, and cooperative thread arrays, we propose a methodology that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. This allows to engage partial replication mechanisms for error detection/correction at the warp level. By exploring 12 benchmarks (17 kernels) from 4 benchmark suites, we illustrate that threads can be remapped into reliable or unreliable warps with only 1.63% introduced overhead (on average), and then enable selective protection via replication to those groups of threads that truly need it. Furthermore, we show that thread remapping to different warps does not sacrifice application performance. We show how this remapping facilitates warp replication for error detection and/or correction and achieves average reduction of 20.61% and 27.15% execution cycles, respectively comparing to standard duplication/triplication.

查看原文本刊更多论文

通过部分线程保护在GPGPU应用程序中启用软件弹性

图形处理单元(gpu)被广泛应用于各种领域的各种应用程序，以加速其计算，但仍然容易受到瞬时硬件故障(软错误)的影响，这很容易影响应用程序的输出。通过利用通用GPU应用程序在线程、经线和合作线程数组中的分层组织，我们提出了一种识别线程弹性的方法，旨在将具有相同弹性特征的线程映射到相同的经线。这允许在warp级别使用部分复制机制进行错误检测/纠正。通过研究来自4个基准套件的12个基准(17个内核)，我们说明了线程可以被重新映射到可靠或不可靠的扭曲中，只需要引入1.63%的开销(平均)，然后通过对真正需要它的线程组进行复制来启用选择性保护。此外，我们还表明，将线程重新映射到不同的经线并不会牺牲应用程序的性能。我们展示了这种重新映射是如何促进warp复制进行错误检测和/或纠正的，与标准复制/三次复制相比，它的执行周期分别平均减少了20.61%和27.15%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量