{"title":"A Fault Tolerance Manager with Distributed Coordinated Checkpoints for Automatic Recovery","authors":"Jorge Villamayor, Dolores Rexachs, E. Luque","doi":"10.1109/HPCS.2017.73","DOIUrl":null,"url":null,"abstract":"Components for High Performance Computing are continuously increasing to achieve more performance and satisfy scientific application users demands. To reduce the Mean Time To Repair in these systems and increment high availability, Fault Tolerance (FT) solutions are required. The checkpoint/restart approach is a widely used mechanism in FT solutions. One of the most used technique to take checkpoints in parallel applications implemented using Message Passing Interface is the coordinated checkpoints. In this paper a Fault Tolerance Manager (FTM) for coordinated checkpoint files is presented, to provide users automatic recovery from failures when losing computing nodes. This proposal makes the configuration of FT simpler and transparent for users without knowledge of their application implementation. Furthermore, system administrators are not required to install libraries in their cluster to support FTM. It takes advantage of node local storage to save checkpoints, and it distributes copies of them along all the computation nodes, avoiding the bottleneck of a central stable storage. This approach is particularly useful in IaaS cloud environments, where users have to pay for centralized stable storage services. This work is based on RADIC, a well- known architecture to provide fault tolerance in a distributed, flexible, automatic and scalable way. Experimental results shows the benefits of the presented approach in a private cluster and a well-known cloud computing environment, Amazon EC2.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2017.73","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Components for High Performance Computing are continuously increasing to achieve more performance and satisfy scientific application users demands. To reduce the Mean Time To Repair in these systems and increment high availability, Fault Tolerance (FT) solutions are required. The checkpoint/restart approach is a widely used mechanism in FT solutions. One of the most used technique to take checkpoints in parallel applications implemented using Message Passing Interface is the coordinated checkpoints. In this paper a Fault Tolerance Manager (FTM) for coordinated checkpoint files is presented, to provide users automatic recovery from failures when losing computing nodes. This proposal makes the configuration of FT simpler and transparent for users without knowledge of their application implementation. Furthermore, system administrators are not required to install libraries in their cluster to support FTM. It takes advantage of node local storage to save checkpoints, and it distributes copies of them along all the computation nodes, avoiding the bottleneck of a central stable storage. This approach is particularly useful in IaaS cloud environments, where users have to pay for centralized stable storage services. This work is based on RADIC, a well- known architecture to provide fault tolerance in a distributed, flexible, automatic and scalable way. Experimental results shows the benefits of the presented approach in a private cluster and a well-known cloud computing environment, Amazon EC2.