Omer Subasi, J. Moreno, O. Unsal, Jesús Labarta, A. Cristal
{"title":"Programmer-directed partial redundancy for resilient HPC","authors":"Omer Subasi, J. Moreno, O. Unsal, Jesús Labarta, A. Cristal","doi":"10.1145/2742854.2742903","DOIUrl":null,"url":null,"abstract":"In this work we propose partial task replication and checkpointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive due to resource costs, we introduce programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs. Results show that our scheme detects and corrects around 65% of SDC errors with only 4% overhead on average.","PeriodicalId":417279,"journal":{"name":"Proceedings of the 12th ACM International Conference on Computing Frontiers","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2742854.2742903","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24
Abstract
In this work we propose partial task replication and checkpointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive due to resource costs, we introduce programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs. Results show that our scheme detects and corrects around 65% of SDC errors with only 4% overhead on average.