Composing resilience techniques: ABFT, periodic and incremental checkpointing

G. Bosilca, Aurélien Bouteiller, T. Hérault, Y. Robert, J. Dongarra
{"title":"Composing resilience techniques: ABFT, periodic and incremental checkpointing","authors":"G. Bosilca, Aurélien Bouteiller, T. Hérault, Y. Robert, J. Dongarra","doi":"10.15803/IJNC.5.1_2","DOIUrl":null,"url":null,"abstract":"Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFT- aware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic program- ming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint.","PeriodicalId":270166,"journal":{"name":"Int. J. Netw. Comput.","volume":"131 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Netw. Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15803/IJNC.5.1_2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24

Abstract

Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFT- aware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic program- ming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint.
组合弹性技术:ABFT、周期性和增量检查点
基于算法的容错(ABFT)方法在容易发生故障的环境中保证了无与伦比的可伸缩性和性能。由于最近对相关机制的理解取得了进展,越来越多的重要算法(包括所有广泛使用的分解)已被证明具有abft能力。在大型应用程序的上下文中,这些算法提供了执行的临时部分,其中数据受到其自身固有属性的保护,因此可以在不需要检查点的情况下通过算法重新计算。然而,尽管典型的科学应用程序在库调用中花费了相当大一部分执行时间,这些库调用可以受到ABFT的保护,但它们交错的部分很难甚至不可能被ABFT保护。因此,对于这些应用程序,唯一实用的容错方法是检查点/重新启动。在本文中,我们提出了一个模型来研究复合协议的效率,该协议在ABFT和检查点/重启之间交替,以有效保护由ABFT感知部分和ABFT不感知部分组成的迭代应用程序。我们还考虑了一种增量检查点复合方法,该方法利用算法知识通过一种新的最优动态规划来计算检查点日期。我们使用模拟器验证这些模型。模型和模拟器表明,通过提供增加检查点之间的间隔同时减少每个检查点的体积的方法,复合方法大大提高了执行平台提供的性能,特别是在规模上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信