Daniel Oliveira, Vinicius Fratin, P. Navaux, I. Koren, P. Rech
{"title":"CAROL-FI:现代高性能计算并行加速器漏洞评估的高效故障注入工具","authors":"Daniel Oliveira, Vinicius Fratin, P. Navaux, I. Koren, P. Rech","doi":"10.1145/3075564.3075598","DOIUrl":null,"url":null,"abstract":"Transient faults are a major problem for large scale HPC systems, and the mitigation of adverse fault effects need to be highly efficient as we approach exascale. We developed a fault injection tool (CAROL-FI) to identify the potential sources of adverse fault effects. With a deeper understanding of such effects, we provide useful insights to design efficient mitigation techniques, like selective hardening of critical portions of the code. We performed a fault injection campaign injecting more than 67,000 faults into an Intel Xeon Phi executing six representative HPC programs. We show that selective hardening can be successfully applied to DGEMM and Hotspot while LavaMD and NW may require a complete code hardening.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"CAROL-FI: an Efficient Fault-Injection Tool for Vulnerability Evaluation of Modern HPC Parallel Accelerators\",\"authors\":\"Daniel Oliveira, Vinicius Fratin, P. Navaux, I. Koren, P. Rech\",\"doi\":\"10.1145/3075564.3075598\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transient faults are a major problem for large scale HPC systems, and the mitigation of adverse fault effects need to be highly efficient as we approach exascale. We developed a fault injection tool (CAROL-FI) to identify the potential sources of adverse fault effects. With a deeper understanding of such effects, we provide useful insights to design efficient mitigation techniques, like selective hardening of critical portions of the code. We performed a fault injection campaign injecting more than 67,000 faults into an Intel Xeon Phi executing six representative HPC programs. We show that selective hardening can be successfully applied to DGEMM and Hotspot while LavaMD and NW may require a complete code hardening.\",\"PeriodicalId\":398898,\"journal\":{\"name\":\"Proceedings of the Computing Frontiers Conference\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Computing Frontiers Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3075564.3075598\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Computing Frontiers Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3075564.3075598","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
CAROL-FI: an Efficient Fault-Injection Tool for Vulnerability Evaluation of Modern HPC Parallel Accelerators
Transient faults are a major problem for large scale HPC systems, and the mitigation of adverse fault effects need to be highly efficient as we approach exascale. We developed a fault injection tool (CAROL-FI) to identify the potential sources of adverse fault effects. With a deeper understanding of such effects, we provide useful insights to design efficient mitigation techniques, like selective hardening of critical portions of the code. We performed a fault injection campaign injecting more than 67,000 faults into an Intel Xeon Phi executing six representative HPC programs. We show that selective hardening can be successfully applied to DGEMM and Hotspot while LavaMD and NW may require a complete code hardening.