Amir Taherin, Tirthak Patel, G. Georgakoudis, I. Laguna, Devesh Tiwari
{"title":"Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes","authors":"Amir Taherin, Tirthak Patel, G. Georgakoudis, I. Laguna, Devesh Tiwari","doi":"10.1109/DSN48987.2021.00043","DOIUrl":null,"url":null,"abstract":"Understanding the reliability characteristics of supercomputers has been a key focus of the HPC and dependability communities. However, there is no current study that analyzes both the failure and recovery characteristics over multiple generations of a GPU-based supercomputer with multiple GPUs on the same node. This paper bridges that gap and reveals surprising insights based on monitoring and analyzing the failures and repairs on the Tsubame-2 and Tsubame-3 supercomputers.","PeriodicalId":222512,"journal":{"name":"2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN48987.2021.00043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Understanding the reliability characteristics of supercomputers has been a key focus of the HPC and dependability communities. However, there is no current study that analyzes both the failure and recovery characteristics over multiple generations of a GPU-based supercomputer with multiple GPUs on the same node. This paper bridges that gap and reveals surprising insights based on monitoring and analyzing the failures and repairs on the Tsubame-2 and Tsubame-3 supercomputers.