Si Li, Vilas Sridharan, S. Gurumurthi, S. Yalamanchili
{"title":"Software-based dynamic reliability management for GPU applications","authors":"Si Li, Vilas Sridharan, S. Gurumurthi, S. Yalamanchili","doi":"10.1109/IRPS.2016.7574507","DOIUrl":null,"url":null,"abstract":"In this paper we propose a framework for dynamic reliability management (DRM) for GPU applications based on the idea of plug-n-play software-based reliability enhancement (SRE). The approach entails first assessing the vulnerability of GPU kernels to soft errors in program visible structures. This assessment is performed on a low level intermediate program representation rather than the application source. Second, this assessment guides selective injection of code implementing SRE techniques to protect the most vulnerable data. Code injection occurs transparently at runtime using a just-in-time (JIT) compiler. Thus, reliability enhancement is selective, transparent, on-demand, and customizable. This flexible, automated software-based DRM framework can provide an adaptable, cost-effective approach to scaling reliability of large systems. We present the results of a proof of concept implementation on NVIDIA GPUs demonstrating the ability to traverse a range of performance reliability tradeoffs.","PeriodicalId":172129,"journal":{"name":"2016 IEEE International Reliability Physics Symposium (IRPS)","volume":"48 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Reliability Physics Symposium (IRPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRPS.2016.7574507","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
In this paper we propose a framework for dynamic reliability management (DRM) for GPU applications based on the idea of plug-n-play software-based reliability enhancement (SRE). The approach entails first assessing the vulnerability of GPU kernels to soft errors in program visible structures. This assessment is performed on a low level intermediate program representation rather than the application source. Second, this assessment guides selective injection of code implementing SRE techniques to protect the most vulnerable data. Code injection occurs transparently at runtime using a just-in-time (JIT) compiler. Thus, reliability enhancement is selective, transparent, on-demand, and customizable. This flexible, automated software-based DRM framework can provide an adaptable, cost-effective approach to scaling reliability of large systems. We present the results of a proof of concept implementation on NVIDIA GPUs demonstrating the ability to traverse a range of performance reliability tradeoffs.