Nathan Debardeleben, S. Blanchard, D. Kaeli, P. Rech
{"title":"大规模高性能计算系统的现场、实验和分析数据,以及对百亿亿级系统设计的影响的评估","authors":"Nathan Debardeleben, S. Blanchard, D. Kaeli, P. Rech","doi":"10.1109/VTS.2015.7116295","DOIUrl":null,"url":null,"abstract":"Reliability is an issue for today's large scale computing systems designers, producers, and users. As we approach exascale, the resilience challenge will become critical due to increase in system-scale. It is then fundamental to understand the nature of errors, evaluate their probability of occurrence, and improve the design to reduce their impact on the overall system. In the paper we will present experimental, field, and analytical data to characterize and quantify errors on accelerators, providing a thorough understanding of errors impact on today and future large-scale systems.","PeriodicalId":187545,"journal":{"name":"2015 IEEE 33rd VLSI Test Symposium (VTS)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design\",\"authors\":\"Nathan Debardeleben, S. Blanchard, D. Kaeli, P. Rech\",\"doi\":\"10.1109/VTS.2015.7116295\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reliability is an issue for today's large scale computing systems designers, producers, and users. As we approach exascale, the resilience challenge will become critical due to increase in system-scale. It is then fundamental to understand the nature of errors, evaluate their probability of occurrence, and improve the design to reduce their impact on the overall system. In the paper we will present experimental, field, and analytical data to characterize and quantify errors on accelerators, providing a thorough understanding of errors impact on today and future large-scale systems.\",\"PeriodicalId\":187545,\"journal\":{\"name\":\"2015 IEEE 33rd VLSI Test Symposium (VTS)\",\"volume\":\"55 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 33rd VLSI Test Symposium (VTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/VTS.2015.7116295\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 33rd VLSI Test Symposium (VTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/VTS.2015.7116295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design
Reliability is an issue for today's large scale computing systems designers, producers, and users. As we approach exascale, the resilience challenge will become critical due to increase in system-scale. It is then fundamental to understand the nature of errors, evaluate their probability of occurrence, and improve the design to reduce their impact on the overall system. In the paper we will present experimental, field, and analytical data to characterize and quantify errors on accelerators, providing a thorough understanding of errors impact on today and future large-scale systems.