结合寄存器传输和软件级故障注入揭示gpu漏洞

2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) Pub Date : 2021-06-01 DOI:10.1109/DSN48987.2021.00042

F. Santos, J. E. R. Condia, L. Carro, M. Reorda, P. Rech

{"title":"结合寄存器传输和软件级故障注入揭示gpu漏洞","authors":"F. Santos, J. E. R. Condia, L. Carro, M. Reorda, P. Rech","doi":"10.1109/DSN48987.2021.00042","DOIUrl":null,"url":null,"abstract":"The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register- Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility.","PeriodicalId":222512,"journal":{"name":"2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection\",\"authors\":\"F. Santos, J. E. R. Condia, L. Carro, M. Reorda, P. Rech\",\"doi\":\"10.1109/DSN48987.2021.00042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register- Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility.\",\"PeriodicalId\":222512,\"journal\":{\"name\":\"2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSN48987.2021.00042\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN48987.2021.00042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

摘要

硬件和软件的复杂性使得gpu可靠性评估极具挑战性。在GPU模型上进行低级故障注入，尽管很准确，但会花费很长时间(数月到数年)，而软件故障注入，尽管很快，但无法访问GPU的关键资源，并且通常使用合成故障模型(例如，单比特翻转)，这可能导致不切实际的评估。本文提出将RTL故障注入的准确性与软件故障注入的效率相结合。首先，在RTL GPU模型(FlexGripPlus)上，我们在低级资源中注入了超过150万个错误，这些错误对程序员来说是不受保护和隐藏的，并描述了它们对公共指令输出的影响。我们根据指令操作码和输入特征创建了一个可能对操作输出产生影响的故障池。然后，我们使用软件框架的更新版本(NVBitFI)在应用程序级别注入这些故障效果。我们的策略将故障注入时间从RTL评估所需的数十年减少到数十小时，从而首次允许在gpu上跟踪从硬件到复杂应用程序输出的故障传播。此外，我们提供了一个更现实的故障模型，并表明单个比特翻转注入将低估六个HPC应用程序和两个卷积神经网络的错误率高达48%(平均为18%)。我们开发的RTL故障模型和注入框架可以在公共存储库中使用，以支持第三方评估并简化结果的可重复性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection

The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register- Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

自引率

0.00%

发文量