一种使用准同步检查点的低开销恢复技术

Proceedings of 16th International Conference on Distributed Computing Systems Pub Date : 1996-05-27 DOI:10.1109/ICDCS.1996.507906

D. Manivannan, M. Singhal

{"title":"一种使用准同步检查点的低开销恢复技术","authors":"D. Manivannan, M. Singhal","doi":"10.1109/ICDCS.1996.507906","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has the easiness and low overhead of asynchronous checkpointing and the recovery time advantages of synchronous checkpointing. There is no extra message overhead involved during checkpointing and the additional checkpointing overhead is nominal. The algorithm ensures the existence of a recovery line consistent with the latest checkpoint of any process all the time. The recovery algorithm exploits this feature to restore the system to a state consistent with the latest checkpoint of a failed process. The recovery algorithm has no domino effect and a failed process needs only to rollback to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. To avoid domino effect, it uses selective pessimistic message logging at the receiver end. The recovery is asynchronous for single process failure. Neither the recovery algorithm nor the checkpointing algorithm requires the channels to be FIFO. We do not use vector timestamps for determining dependency between checkpoints since vector timestamps generally result in high message overhead during failure-free operation.","PeriodicalId":159322,"journal":{"name":"Proceedings of 16th International Conference on Distributed Computing Systems","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1996-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"140","resultStr":"{\"title\":\"A low-overhead recovery technique using quasi-synchronous checkpointing\",\"authors\":\"D. Manivannan, M. Singhal\",\"doi\":\"10.1109/ICDCS.1996.507906\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has the easiness and low overhead of asynchronous checkpointing and the recovery time advantages of synchronous checkpointing. There is no extra message overhead involved during checkpointing and the additional checkpointing overhead is nominal. The algorithm ensures the existence of a recovery line consistent with the latest checkpoint of any process all the time. The recovery algorithm exploits this feature to restore the system to a state consistent with the latest checkpoint of a failed process. The recovery algorithm has no domino effect and a failed process needs only to rollback to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. To avoid domino effect, it uses selective pessimistic message logging at the receiver end. The recovery is asynchronous for single process failure. Neither the recovery algorithm nor the checkpointing algorithm requires the channels to be FIFO. We do not use vector timestamps for determining dependency between checkpoints since vector timestamps generally result in high message overhead during failure-free operation.\",\"PeriodicalId\":159322,\"journal\":{\"name\":\"Proceedings of 16th International Conference on Distributed Computing Systems\",\"volume\":\"73 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1996-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"140\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of 16th International Conference on Distributed Computing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDCS.1996.507906\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 16th International Conference on Distributed Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.1996.507906","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 140

摘要

本文提出了一种准同步检查点算法和基于此算法的低开销恢复算法。检查点算法通过允许进程异步获取检查点来保持进程的自主性，并使用通信诱导的检查点协调来推进恢复行，这有助于在恢复期间绑定回滚传播。因此，它具有异步检查点的简单性和低开销以及同步检查点的恢复时间优势。在检查点期间不涉及额外的消息开销，并且额外的检查点开销是象征性的。该算法保证任何进程始终存在一条与最新检查点一致的恢复线。恢复算法利用这一特性将系统恢复到与失败进程的最新检查点一致的状态。恢复算法没有多米诺骨牌效应，失败的进程只需要回滚到最新的检查点，并请求其他进程回滚到一致的检查点。为了避免多米诺骨牌效应，它在接收端使用选择性悲观消息日志记录。对于单进程故障，恢复是异步的。无论是恢复算法还是检查点算法都不要求通道是FIFO。我们不使用矢量时间戳来确定检查点之间的依赖关系，因为矢量时间戳通常会在无故障操作期间导致较高的消息开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A low-overhead recovery technique using quasi-synchronous checkpointing

In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has the easiness and low overhead of asynchronous checkpointing and the recovery time advantages of synchronous checkpointing. There is no extra message overhead involved during checkpointing and the additional checkpointing overhead is nominal. The algorithm ensures the existence of a recovery line consistent with the latest checkpoint of any process all the time. The recovery algorithm exploits this feature to restore the system to a state consistent with the latest checkpoint of a failed process. The recovery algorithm has no domino effect and a failed process needs only to rollback to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. To avoid domino effect, it uses selective pessimistic message logging at the receiver end. The recovery is asynchronous for single process failure. Neither the recovery algorithm nor the checkpointing algorithm requires the channels to be FIFO. We do not use vector timestamps for determining dependency between checkpoints since vector timestamps generally result in high message overhead during failure-free operation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of 16th International Conference on Distributed Computing Systems

自引率

0.00%

发文量