具有用于自动恢复的分布式协调检查点的容错管理器

Jorge Villamayor, Dolores Rexachs, E. Luque
{"title":"具有用于自动恢复的分布式协调检查点的容错管理器","authors":"Jorge Villamayor, Dolores Rexachs, E. Luque","doi":"10.1109/HPCS.2017.73","DOIUrl":null,"url":null,"abstract":"Components for High Performance Computing are continuously increasing to achieve more performance and satisfy scientific application users demands. To reduce the Mean Time To Repair in these systems and increment high availability, Fault Tolerance (FT) solutions are required. The checkpoint/restart approach is a widely used mechanism in FT solutions. One of the most used technique to take checkpoints in parallel applications implemented using Message Passing Interface is the coordinated checkpoints. In this paper a Fault Tolerance Manager (FTM) for coordinated checkpoint files is presented, to provide users automatic recovery from failures when losing computing nodes. This proposal makes the configuration of FT simpler and transparent for users without knowledge of their application implementation. Furthermore, system administrators are not required to install libraries in their cluster to support FTM. It takes advantage of node local storage to save checkpoints, and it distributes copies of them along all the computation nodes, avoiding the bottleneck of a central stable storage. This approach is particularly useful in IaaS cloud environments, where users have to pay for centralized stable storage services. This work is based on RADIC, a well- known architecture to provide fault tolerance in a distributed, flexible, automatic and scalable way. Experimental results shows the benefits of the presented approach in a private cluster and a well-known cloud computing environment, Amazon EC2.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Fault Tolerance Manager with Distributed Coordinated Checkpoints for Automatic Recovery\",\"authors\":\"Jorge Villamayor, Dolores Rexachs, E. Luque\",\"doi\":\"10.1109/HPCS.2017.73\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Components for High Performance Computing are continuously increasing to achieve more performance and satisfy scientific application users demands. To reduce the Mean Time To Repair in these systems and increment high availability, Fault Tolerance (FT) solutions are required. The checkpoint/restart approach is a widely used mechanism in FT solutions. One of the most used technique to take checkpoints in parallel applications implemented using Message Passing Interface is the coordinated checkpoints. In this paper a Fault Tolerance Manager (FTM) for coordinated checkpoint files is presented, to provide users automatic recovery from failures when losing computing nodes. This proposal makes the configuration of FT simpler and transparent for users without knowledge of their application implementation. Furthermore, system administrators are not required to install libraries in their cluster to support FTM. It takes advantage of node local storage to save checkpoints, and it distributes copies of them along all the computation nodes, avoiding the bottleneck of a central stable storage. This approach is particularly useful in IaaS cloud environments, where users have to pay for centralized stable storage services. This work is based on RADIC, a well- known architecture to provide fault tolerance in a distributed, flexible, automatic and scalable way. Experimental results shows the benefits of the presented approach in a private cluster and a well-known cloud computing environment, Amazon EC2.\",\"PeriodicalId\":115758,\"journal\":{\"name\":\"2017 International Conference on High Performance Computing & Simulation (HPCS)\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 International Conference on High Performance Computing & Simulation (HPCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCS.2017.73\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2017.73","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

高性能计算组件不断增加,以实现更高的性能,满足科学应用用户的需求。为了减少这些系统的平均修复时间并提高高可用性,需要容错(FT)解决方案。检查点/重启方法是FT解决方案中广泛使用的机制。在使用消息传递接口实现的并行应用程序中获取检查点的最常用技术之一是协调检查点。本文提出了一种用于协调检查点文件的容错管理器(FTM),可以使用户在丢失计算节点时自动从故障中恢复。该方案使得FT的配置对于不了解其应用程序实现的用户来说更简单和透明。此外,系统管理员不需要在其集群中安装库来支持FTM。它利用节点本地存储来保存检查点,并沿所有计算节点分发检查点的副本,避免了中央稳定存储的瓶颈。这种方法在IaaS云环境中特别有用,因为用户必须为集中稳定的存储服务付费。这项工作是基于RADIC,一个著名的架构,以分布式、灵活、自动和可扩展的方式提供容错。实验结果表明了该方法在私有集群和知名云计算环境Amazon EC2中的优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Fault Tolerance Manager with Distributed Coordinated Checkpoints for Automatic Recovery
Components for High Performance Computing are continuously increasing to achieve more performance and satisfy scientific application users demands. To reduce the Mean Time To Repair in these systems and increment high availability, Fault Tolerance (FT) solutions are required. The checkpoint/restart approach is a widely used mechanism in FT solutions. One of the most used technique to take checkpoints in parallel applications implemented using Message Passing Interface is the coordinated checkpoints. In this paper a Fault Tolerance Manager (FTM) for coordinated checkpoint files is presented, to provide users automatic recovery from failures when losing computing nodes. This proposal makes the configuration of FT simpler and transparent for users without knowledge of their application implementation. Furthermore, system administrators are not required to install libraries in their cluster to support FTM. It takes advantage of node local storage to save checkpoints, and it distributes copies of them along all the computation nodes, avoiding the bottleneck of a central stable storage. This approach is particularly useful in IaaS cloud environments, where users have to pay for centralized stable storage services. This work is based on RADIC, a well- known architecture to provide fault tolerance in a distributed, flexible, automatic and scalable way. Experimental results shows the benefits of the presented approach in a private cluster and a well-known cloud computing environment, Amazon EC2.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信