A Fault Tolerance Manager with Distributed Coordinated Checkpoints for Automatic Recovery

2017 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2017-07-01 DOI:10.1109/HPCS.2017.73

Jorge Villamayor, Dolores Rexachs, E. Luque

{"title":"A Fault Tolerance Manager with Distributed Coordinated Checkpoints for Automatic Recovery","authors":"Jorge Villamayor, Dolores Rexachs, E. Luque","doi":"10.1109/HPCS.2017.73","DOIUrl":null,"url":null,"abstract":"Components for High Performance Computing are continuously increasing to achieve more performance and satisfy scientific application users demands. To reduce the Mean Time To Repair in these systems and increment high availability, Fault Tolerance (FT) solutions are required. The checkpoint/restart approach is a widely used mechanism in FT solutions. One of the most used technique to take checkpoints in parallel applications implemented using Message Passing Interface is the coordinated checkpoints. In this paper a Fault Tolerance Manager (FTM) for coordinated checkpoint files is presented, to provide users automatic recovery from failures when losing computing nodes. This proposal makes the configuration of FT simpler and transparent for users without knowledge of their application implementation. Furthermore, system administrators are not required to install libraries in their cluster to support FTM. It takes advantage of node local storage to save checkpoints, and it distributes copies of them along all the computation nodes, avoiding the bottleneck of a central stable storage. This approach is particularly useful in IaaS cloud environments, where users have to pay for centralized stable storage services. This work is based on RADIC, a well- known architecture to provide fault tolerance in a distributed, flexible, automatic and scalable way. Experimental results shows the benefits of the presented approach in a private cluster and a well-known cloud computing environment, Amazon EC2.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2017.73","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Components for High Performance Computing are continuously increasing to achieve more performance and satisfy scientific application users demands. To reduce the Mean Time To Repair in these systems and increment high availability, Fault Tolerance (FT) solutions are required. The checkpoint/restart approach is a widely used mechanism in FT solutions. One of the most used technique to take checkpoints in parallel applications implemented using Message Passing Interface is the coordinated checkpoints. In this paper a Fault Tolerance Manager (FTM) for coordinated checkpoint files is presented, to provide users automatic recovery from failures when losing computing nodes. This proposal makes the configuration of FT simpler and transparent for users without knowledge of their application implementation. Furthermore, system administrators are not required to install libraries in their cluster to support FTM. It takes advantage of node local storage to save checkpoints, and it distributes copies of them along all the computation nodes, avoiding the bottleneck of a central stable storage. This approach is particularly useful in IaaS cloud environments, where users have to pay for centralized stable storage services. This work is based on RADIC, a well- known architecture to provide fault tolerance in a distributed, flexible, automatic and scalable way. Experimental results shows the benefits of the presented approach in a private cluster and a well-known cloud computing environment, Amazon EC2.

查看原文本刊更多论文

具有用于自动恢复的分布式协调检查点的容错管理器

高性能计算组件不断增加，以实现更高的性能，满足科学应用用户的需求。为了减少这些系统的平均修复时间并提高高可用性，需要容错(FT)解决方案。检查点/重启方法是FT解决方案中广泛使用的机制。在使用消息传递接口实现的并行应用程序中获取检查点的最常用技术之一是协调检查点。本文提出了一种用于协调检查点文件的容错管理器(FTM)，可以使用户在丢失计算节点时自动从故障中恢复。该方案使得FT的配置对于不了解其应用程序实现的用户来说更简单和透明。此外，系统管理员不需要在其集群中安装库来支持FTM。它利用节点本地存储来保存检查点，并沿所有计算节点分发检查点的副本，避免了中央稳定存储的瓶颈。这种方法在IaaS云环境中特别有用，因为用户必须为集中稳定的存储服务付费。这项工作是基于RADIC，一个著名的架构，以分布式、灵活、自动和可扩展的方式提供容错。实验结果表明了该方法在私有集群和知名云计算环境Amazon EC2中的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量