{"title":"Transparent fault tolerance middleware at user level","authors":"Marcela Castro-León, Dolores Rexachs, E. Luque","doi":"10.1109/HPCSim.2012.6266974","DOIUrl":null,"url":null,"abstract":"We present a design of a transparent fault tolerance middleware for message passing applications. The approach consists in transforming the interconnections used by the application in reliable ones and support log-based rollback recovery protocol. When one of the nodes of the cluster fails, the processes are recovered in a new one and the connections are reestablished. All this work is made automatically and in a transparent way for the application. This service can be optionally activated at runtime at user level. The models used for protection and recovering application and detection of failures are based on RADIC architecture. We have tested this middleware by executing a master-worker (M/W) and SPMD applications which follow different communication patterns.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCSim.2012.6266974","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
We present a design of a transparent fault tolerance middleware for message passing applications. The approach consists in transforming the interconnections used by the application in reliable ones and support log-based rollback recovery protocol. When one of the nodes of the cluster fails, the processes are recovered in a new one and the connections are reestablished. All this work is made automatically and in a transparent way for the application. This service can be optionally activated at runtime at user level. The models used for protection and recovering application and detection of failures are based on RADIC architecture. We have tested this middleware by executing a master-worker (M/W) and SPMD applications which follow different communication patterns.