{"title":"A Programming Language Approach to Fault Tolerance for Fork-Join Parallelism","authors":"M. Zengin, Viktor Vafeiadis","doi":"10.1109/TASE.2013.22","DOIUrl":null,"url":null,"abstract":"When running big parallel computations on thousands of processors, the probability that an individual processor will fail during the execution cannot be ignored. Computations should be replicated, or else failures should be detected at runtime and failed subcomputations reexecuted. We follow the latter approach and propose a high-level operational semantics that detects computation failures, and allows failed computations to be restarted from the point of failure. We implement this high-level semantics with a lower-level operational semantics that provides a more accurate account of processor failures, and prove in Coq the correspondence between the high- and low-level semantics.","PeriodicalId":346899,"journal":{"name":"2013 International Symposium on Theoretical Aspects of Software Engineering","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Symposium on Theoretical Aspects of Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TASE.2013.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
When running big parallel computations on thousands of processors, the probability that an individual processor will fail during the execution cannot be ignored. Computations should be replicated, or else failures should be detected at runtime and failed subcomputations reexecuted. We follow the latter approach and propose a high-level operational semantics that detects computation failures, and allows failed computations to be restarted from the point of failure. We implement this high-level semantics with a lower-level operational semantics that provides a more accurate account of processor failures, and prove in Coq the correspondence between the high- and low-level semantics.