gpuFI-4: A Microarchitecture-Level Framework for Assessing the Cross-Layer Resilience of Nvidia GPUs

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI:10.1109/ISPASS55109.2022.00004

Dimitris Sartzetakis, G. Papadimitriou, D. Gizopoulos

{"title":"gpuFI-4: A Microarchitecture-Level Framework for Assessing the Cross-Layer Resilience of Nvidia GPUs","authors":"Dimitris Sartzetakis, G. Papadimitriou, D. Gizopoulos","doi":"10.1109/ISPASS55109.2022.00004","DOIUrl":null,"url":null,"abstract":"Pre-silicon reliability evaluation of processors is usually performed at the microarchitecture or at the software level. Recent studies on CPUs have, however, shown that software level approaches can mislead the soft error vulnerability assessment process and drive designers towards wrong error protection decisions. To avoid such pitfalls in the GPUs domain, the availability of microarchitecture level reliability assessment tools is of paramount importance. Although there are several publicly available frameworks for the reliability assessment of GPUs, they only operate at the software level, and do not consider the microarchitecture. This paper aims at accurate microarchitecture level GPU soft error vulnerability assessment. We introduce gpuFI-4: a detailed microarchitecture-level fault injection framework to assess the cross-layer vulnerability of hardware structures and entire GPU chips for single and multiple bit faults, built on top of the state-of-the-art simulator GPGPU-Sim 4.0. We employ gpuFI-4 for fault injection of soft errors on CUDA-enabled Nvidia GPU architectures. The target hardware structures that our framework analyzes are the register file, the shared memory, the LI data and texture caches and the L2 cache, altogether accounting for tens of MBs of on-chip GPU storage. We showcase the features of the tool reporting the vulnerability of three Nvidia GPU chip models: two different modem GPU architectures – RTX 2060 (Turing) and Quadro GV100 (Volta) – and an older generation – GTX Titan (Kepler), for both single-bit and triple-bit fault injections and for twelve different CUDA benchmarks that are simulated on the actual physical instruction set (SASS). Our experiments report the Architectural Vulnerability Factor (AVF) of the GPU chips (which can be only measured at the microarchitecture level) as well as their predicted Failures in Time (FIT) rate when technology information is incorporated in the assessment.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS55109.2022.00004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Pre-silicon reliability evaluation of processors is usually performed at the microarchitecture or at the software level. Recent studies on CPUs have, however, shown that software level approaches can mislead the soft error vulnerability assessment process and drive designers towards wrong error protection decisions. To avoid such pitfalls in the GPUs domain, the availability of microarchitecture level reliability assessment tools is of paramount importance. Although there are several publicly available frameworks for the reliability assessment of GPUs, they only operate at the software level, and do not consider the microarchitecture. This paper aims at accurate microarchitecture level GPU soft error vulnerability assessment. We introduce gpuFI-4: a detailed microarchitecture-level fault injection framework to assess the cross-layer vulnerability of hardware structures and entire GPU chips for single and multiple bit faults, built on top of the state-of-the-art simulator GPGPU-Sim 4.0. We employ gpuFI-4 for fault injection of soft errors on CUDA-enabled Nvidia GPU architectures. The target hardware structures that our framework analyzes are the register file, the shared memory, the LI data and texture caches and the L2 cache, altogether accounting for tens of MBs of on-chip GPU storage. We showcase the features of the tool reporting the vulnerability of three Nvidia GPU chip models: two different modem GPU architectures – RTX 2060 (Turing) and Quadro GV100 (Volta) – and an older generation – GTX Titan (Kepler), for both single-bit and triple-bit fault injections and for twelve different CUDA benchmarks that are simulated on the actual physical instruction set (SASS). Our experiments report the Architectural Vulnerability Factor (AVF) of the GPU chips (which can be only measured at the microarchitecture level) as well as their predicted Failures in Time (FIT) rate when technology information is incorporated in the assessment.

查看原文本刊更多论文

gpuFI-4:用于评估Nvidia gpu跨层弹性的微架构级框架

处理器的预硅可靠性评估通常在微体系结构或软件级别进行。然而，最近对cpu的研究表明，软件级方法可能会误导软错误漏洞评估过程，并促使设计人员做出错误的错误保护决策。为了避免gpu领域的此类缺陷，微架构级可靠性评估工具的可用性至关重要。虽然有几个公开可用的框架用于gpu的可靠性评估，但它们只在软件级别上运行，而不考虑微体系结构。本文旨在精确的进行微架构级GPU软错误漏洞评估。我们介绍了gpuFI-4:一个详细的微架构级故障注入框架，用于评估硬件结构和整个GPU芯片的单比特和多比特故障的跨层漏洞，建立在最先进的模拟器GPGPU-Sim 4.0之上。我们使用GPU -4在支持cuda的Nvidia GPU架构上进行软错误的故障注入。我们的框架分析的目标硬件结构是寄存器文件、共享内存、LI数据和纹理缓存以及L2缓存，总共占片上GPU存储的数十mb。我们展示了报告三种Nvidia GPU芯片模型漏洞的工具的功能:两种不同的调制解调器GPU架构- RTX 2060 (Turing)和Quadro GV100 (Volta) -以及老一代- GTX Titan (Kepler)，用于单比特和三比特故障注入以及在实际物理指令集(SASS)上模拟的12种不同的CUDA基准测试。我们的实验报告了GPU芯片的架构漏洞因子(AVF)(只能在微架构级别进行测量)以及当技术信息被纳入评估时它们的预测失败率(FIT)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

自引率

0.00%

发文量