理解大规模高性能计算系统中的GPU错误以及对系统设计和操作的影响

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2015-03-09 DOI:10.1109/HPCA.2015.7056044

Devesh Tiwari, Saurabh Gupta, James H. Rogers, Don E. Maxwell, P. Rech, Sudharshan S. Vazhkudai, Daniel Oliveira, Dave Londo, Nathan Debardeleben, P. Navaux, L. Carro, Arthur S. Bland

{"title":"理解大规模高性能计算系统中的GPU错误以及对系统设计和操作的影响","authors":"Devesh Tiwari, Saurabh Gupta, James H. Rogers, Don E. Maxwell, P. Rech, Sudharshan S. Vazhkudai, Daniel Oliveira, Dave Londo, Nathan Debardeleben, P. Navaux, L. Carro, Arthur S. Bland","doi":"10.1109/HPCA.2015.7056044","DOIUrl":null,"url":null,"abstract":"Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the world's second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"58 1","pages":"331-342"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"138","resultStr":"{\"title\":\"Understanding GPU errors on large-scale HPC systems and the implications for system design and operation\",\"authors\":\"Devesh Tiwari, Saurabh Gupta, James H. Rogers, Don E. Maxwell, P. Rech, Sudharshan S. Vazhkudai, Daniel Oliveira, Dave Londo, Nathan Debardeleben, P. Navaux, L. Carro, Arthur S. Bland\",\"doi\":\"10.1109/HPCA.2015.7056044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the world's second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.\",\"PeriodicalId\":6593,\"journal\":{\"name\":\"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)\",\"volume\":\"58 1\",\"pages\":\"331-342\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"138\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2015.7056044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2015.7056044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 138

摘要

图形硬件性能的提高和可编程性的改进使gpu从特定于图形的加速器发展成为通用的计算设备。泰坦是2014年全球第二快的开放科学超级计算机，由超过18000个gpu组成，来自天体物理学、核聚变、气候和燃烧等各个领域的科学家经常使用它来进行大规模模拟。遗憾的是，虽然gpu的性能效率得到了很好的理解，但它们在大规模计算系统中的弹性特性还没有得到充分的评估。我们提出了一项详细的研究，以全面了解大规模GPU启用系统上的GPU错误。我们的数据是从橡树岭领导计算设施的泰坦超级计算机和洛斯阿拉莫斯国家实验室的GPU集群中收集的。我们还介绍了在洛斯阿拉莫斯中子科学中心(LANSCE)和ISIS(英国Rutherford Appleron实验室)进行的大量中子束测试的结果，以测量不同代gpu的弹性。我们从我们的现场数据和中子束实验中提出了几个发现，并讨论了我们的结果对未来GPU架构师、当前和未来的HPC计算设施以及专注于GPU弹性的研究人员的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Increase in graphics hardware performance and improvements in programmability has enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose computing device. Titan, the world's second fastest supercomputer for open science in 2014, consists of more dum 18,000 GPUs that scientists from various domains such as astrophysics, fusion, climate, and combustion use routinely to run large-scale simulations. Unfortunately, while the performance efficiency of GPUs is well understood, their resilience characteristics in a large-scale computing system have not been fully evaluated. We present a detailed study to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system. Our data was collected from the Titan supercomputer at the Oak Ridge Leadership Computing Facility and a GPU cluster at the Los Alamos National Laboratory. We also present results from our extensive neutron-beam tests, conducted at Los Alamos Neutron Science Center (LANSCE) and at ISIS (Rutherford Appleron Laboratories, UK), to measure the resilience of different generations of GPUs. We present several findings from our field data and neutron-beam experiments, and discuss the implications of our results for future GPU architects, current and future HPC computing facilities, and researchers focusing on GPU resilience.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量