Mamoru Ohara, M. Arai, S. Fukumoto, K. Iwasaki
{"title":"A proposal and evaluation of a coordinated checkpointing technique using incremental snapshots","authors":"Mamoru Ohara, M. Arai, S. Fukumoto, K. Iwasaki","doi":"10.1002/ECJC.20296","DOIUrl":null,"url":null,"abstract":"Coordinated checkpointing techniques ensure that a consistent global state is maintained by means of coordination between processes. The approach requires that application messages temporarily cease to be exchanged but the rollback procedure when recovering from a fault is consequently simplified and the recovery costs are small. With current reductions in communications costs, the importance of coordinated techniques may be seen to be growing. However, in large-scale systems there is a possibility that performance will be seriously impaired due to the frequent halting of the exchange of messages. In this paper we propose a method whereby coordination is performed at only a subset of the checkpoint generation points that are periodically visited while at the remaining points each process independently generates an incremental snapshot. This method aims to both alleviate the performance degradation incurred from coordination and to realize relatively high-speed recovery. In evaluating the effectiveness of this method we estimate the checkpointing overheads and recovery costs using a probabilistic model and simulations and compare it with existing coordination methods. The results show that the proposed method is more effective than existing coordination methods from the perspective of both performance and reliability in environments with a relatively low frequency of messages. In addition, we perform comparisons of two different delta schemes for representing the incremental snapshots and discuss which environments they are each respectively suited to. © 2007 Wiley Periodicals, Inc. Electron Comm Jpn Pt 3, 90(8): 39– 53, 2007; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecjc.20296","PeriodicalId":100407,"journal":{"name":"Electronics and Communications in Japan (Part III: Fundamental Electronic Science)","volume":"27 1","pages":"39-53"},"PeriodicalIF":0.0000,"publicationDate":"2007-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronics and Communications in Japan (Part III: Fundamental Electronic Science)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/ECJC.20296","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
对使用增量快照的协调检查点技术的建议和评估
协调检查点技术确保通过进程之间的协调来维护一致的全局状态。该方法要求应用程序消息暂时停止交换,但从故障恢复时的回滚过程因此得到简化,恢复成本也很小。随着目前通信费用的减少,协调技术的重要性可能会越来越大。然而,在大型系统中,由于频繁停止消息交换,性能可能会受到严重损害。在本文中,我们提出了一种方法,该方法仅在定期访问的检查点生成点的子集上执行协调,而在其余点上,每个进程独立地生成增量快照。该方法既可以缓解由于协调而导致的性能下降,又可以实现相对高速的恢复。在评估该方法的有效性时,我们使用概率模型和仿真来估计检查点开销和恢复成本,并将其与现有的协调方法进行比较。结果表明,在消息频率相对较低的环境下,从性能和可靠性的角度来看,所提出的方法比现有的协调方法更有效。此外,我们对用于表示增量快照的两种不同增量模式进行了比较,并讨论了它们各自适合于哪些环境。©2007 Wiley期刊公司电子工程学报,2009,31 (8):393 - 393;在线发表于Wiley InterScience (www.interscience.wiley.com)。DOI 10.1002 / ecjc.20296
本文章由计算机程序翻译,如有差异,请以英文原文为准。