{"title":"A proposal and evaluation of a coordinated checkpointing technique using incremental snapshots","authors":"Mamoru Ohara, Masayuki Arai, Satoshi Fukumoto, Kazuhiko Iwasaki","doi":"10.1002/ecjc.20296","DOIUrl":null,"url":null,"abstract":"<p>Coordinated checkpointing techniques ensure that a consistent global state is maintained by means of coordination between processes. The approach requires that application messages temporarily cease to be exchanged but the rollback procedure when recovering from a fault is consequently simplified and the recovery costs are small. With current reductions in communications costs, the importance of coordinated techniques may be seen to be growing. However, in large-scale systems there is a possibility that performance will be seriously impaired due to the frequent halting of the exchange of messages. In this paper we propose a method whereby coordination is performed at only a subset of the checkpoint generation points that are periodically visited while at the remaining points each process independently generates an incremental snapshot. This method aims to both alleviate the performance degradation incurred from coordination and to realize relatively high-speed recovery. In evaluating the effectiveness of this method we estimate the checkpointing overheads and recovery costs using a probabilistic model and simulations and compare it with existing coordination methods. The results show that the proposed method is more effective than existing coordination methods from the perspective of both performance and reliability in environments with a relatively low frequency of messages. In addition, we perform comparisons of two different delta schemes for representing the incremental snapshots and discuss which environments they are each respectively suited to. © 2007 Wiley Periodicals, Inc. Electron Comm Jpn Pt 3, 90(8): 39– 53, 2007; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecjc.20296</p>","PeriodicalId":100407,"journal":{"name":"Electronics and Communications in Japan (Part III: Fundamental Electronic Science)","volume":"90 8","pages":"39-53"},"PeriodicalIF":0.0000,"publicationDate":"2007-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/ecjc.20296","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronics and Communications in Japan (Part III: Fundamental Electronic Science)","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ecjc.20296","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Coordinated checkpointing techniques ensure that a consistent global state is maintained by means of coordination between processes. The approach requires that application messages temporarily cease to be exchanged but the rollback procedure when recovering from a fault is consequently simplified and the recovery costs are small. With current reductions in communications costs, the importance of coordinated techniques may be seen to be growing. However, in large-scale systems there is a possibility that performance will be seriously impaired due to the frequent halting of the exchange of messages. In this paper we propose a method whereby coordination is performed at only a subset of the checkpoint generation points that are periodically visited while at the remaining points each process independently generates an incremental snapshot. This method aims to both alleviate the performance degradation incurred from coordination and to realize relatively high-speed recovery. In evaluating the effectiveness of this method we estimate the checkpointing overheads and recovery costs using a probabilistic model and simulations and compare it with existing coordination methods. The results show that the proposed method is more effective than existing coordination methods from the perspective of both performance and reliability in environments with a relatively low frequency of messages. In addition, we perform comparisons of two different delta schemes for representing the incremental snapshots and discuss which environments they are each respectively suited to. © 2007 Wiley Periodicals, Inc. Electron Comm Jpn Pt 3, 90(8): 39– 53, 2007; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecjc.20296
使用增量快照的协调检查点技术的建议和评估
协调检查点技术确保通过进程之间的协调来维护一致的全局状态。该方法要求暂时停止交换应用程序消息,但因此简化了从故障中恢复时的回滚过程,并且恢复成本很小。随着目前通信成本的降低,协调技术的重要性可能会越来越大。然而,在大型系统中,由于频繁停止消息交换,性能可能会严重受损。在本文中,我们提出了一种方法,即仅在周期性访问的检查点生成点的子集上执行协调,而在其余点上,每个进程独立地生成增量快照。该方法旨在缓解因协调而导致的性能下降,并实现相对高速的恢复。在评估该方法的有效性时,我们使用概率模型和模拟来估计检查点开销和恢复成本,并将其与现有的协调方法进行比较。结果表明,在消息频率相对较低的环境中,从性能和可靠性的角度来看,所提出的方法比现有的协调方法更有效。此外,我们对两种不同的delta方案进行了比较,以表示增量快照,并讨论了它们各自适合的环境。©2007 Wiley Periodicals,股份有限公司Electron Comm Jpn Pt 3,90(8):39-532007;在线发表于Wiley InterScience(www.InterScience.Wiley.com)。DOI 10.1002/ecjc.20296
本文章由计算机程序翻译,如有差异,请以英文原文为准。