Dimitris Sartzetakis, G. Papadimitriou, D. Gizopoulos
{"title":"gpuFI-4: A Microarchitecture-Level Framework for Assessing the Cross-Layer Resilience of Nvidia GPUs","authors":"Dimitris Sartzetakis, G. Papadimitriou, D. Gizopoulos","doi":"10.1109/ISPASS55109.2022.00004","DOIUrl":null,"url":null,"abstract":"Pre-silicon reliability evaluation of processors is usually performed at the microarchitecture or at the software level. Recent studies on CPUs have, however, shown that software level approaches can mislead the soft error vulnerability assessment process and drive designers towards wrong error protection decisions. To avoid such pitfalls in the GPUs domain, the availability of microarchitecture level reliability assessment tools is of paramount importance. Although there are several publicly available frameworks for the reliability assessment of GPUs, they only operate at the software level, and do not consider the microarchitecture. This paper aims at accurate microarchitecture level GPU soft error vulnerability assessment. We introduce gpuFI-4: a detailed microarchitecture-level fault injection framework to assess the cross-layer vulnerability of hardware structures and entire GPU chips for single and multiple bit faults, built on top of the state-of-the-art simulator GPGPU-Sim 4.0. We employ gpuFI-4 for fault injection of soft errors on CUDA-enabled Nvidia GPU architectures. The target hardware structures that our framework analyzes are the register file, the shared memory, the LI data and texture caches and the L2 cache, altogether accounting for tens of MBs of on-chip GPU storage. We showcase the features of the tool reporting the vulnerability of three Nvidia GPU chip models: two different modem GPU architectures – RTX 2060 (Turing) and Quadro GV100 (Volta) – and an older generation – GTX Titan (Kepler), for both single-bit and triple-bit fault injections and for twelve different CUDA benchmarks that are simulated on the actual physical instruction set (SASS). Our experiments report the Architectural Vulnerability Factor (AVF) of the GPU chips (which can be only measured at the microarchitecture level) as well as their predicted Failures in Time (FIT) rate when technology information is incorporated in the assessment.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS55109.2022.00004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Pre-silicon reliability evaluation of processors is usually performed at the microarchitecture or at the software level. Recent studies on CPUs have, however, shown that software level approaches can mislead the soft error vulnerability assessment process and drive designers towards wrong error protection decisions. To avoid such pitfalls in the GPUs domain, the availability of microarchitecture level reliability assessment tools is of paramount importance. Although there are several publicly available frameworks for the reliability assessment of GPUs, they only operate at the software level, and do not consider the microarchitecture. This paper aims at accurate microarchitecture level GPU soft error vulnerability assessment. We introduce gpuFI-4: a detailed microarchitecture-level fault injection framework to assess the cross-layer vulnerability of hardware structures and entire GPU chips for single and multiple bit faults, built on top of the state-of-the-art simulator GPGPU-Sim 4.0. We employ gpuFI-4 for fault injection of soft errors on CUDA-enabled Nvidia GPU architectures. The target hardware structures that our framework analyzes are the register file, the shared memory, the LI data and texture caches and the L2 cache, altogether accounting for tens of MBs of on-chip GPU storage. We showcase the features of the tool reporting the vulnerability of three Nvidia GPU chip models: two different modem GPU architectures – RTX 2060 (Turing) and Quadro GV100 (Volta) – and an older generation – GTX Titan (Kepler), for both single-bit and triple-bit fault injections and for twelve different CUDA benchmarks that are simulated on the actual physical instruction set (SASS). Our experiments report the Architectural Vulnerability Factor (AVF) of the GPU chips (which can be only measured at the microarchitecture level) as well as their predicted Failures in Time (FIT) rate when technology information is incorporated in the assessment.