{"title":"分布式检查点中回滚传播的量化","authors":"A. Agbaria, H. Attiya, R. Friedman, R. Vitenberg","doi":"10.1109/RELDIS.2001.969737","DOIUrl":null,"url":null,"abstract":"Proposes a new classification of executions with checkpoints that is based on the notion of k-rollback, indicating the maximal number of checkpoints that may need to be rolled back during recovery. The relation between known execution classes is explored, and it is shown that coordinated checkpointing, SZPF (strictly Z-path free) and ZPF (Z-path free) are 1-rollback mechanisms, while ZCF (Z-cycle free) is (n-1)-rollback, where n is the number of participants in an execution. A new class of executions, called d-BC (d-bounded cycles), is introduced, and is shown to be an [(n-1)/spl middot/d]-rollback mechanism (ZCF is a special case of d-BC for d=1). Finally, a d-BC protocol is presented. This protocol has the nice property that it does not impose any control information overhead on an application's messages, yet it only sends a few control messages of its own. Moreover, the protocol maintains information about recovery lines, which enables very efficient discovery of the most recent recovery line that existed a short time before the failure.","PeriodicalId":440881,"journal":{"name":"Proceedings 20th IEEE Symposium on Reliable Distributed Systems","volume":"37 11","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Quantifying rollback propagation in distributed checkpointing\",\"authors\":\"A. Agbaria, H. Attiya, R. Friedman, R. Vitenberg\",\"doi\":\"10.1109/RELDIS.2001.969737\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Proposes a new classification of executions with checkpoints that is based on the notion of k-rollback, indicating the maximal number of checkpoints that may need to be rolled back during recovery. The relation between known execution classes is explored, and it is shown that coordinated checkpointing, SZPF (strictly Z-path free) and ZPF (Z-path free) are 1-rollback mechanisms, while ZCF (Z-cycle free) is (n-1)-rollback, where n is the number of participants in an execution. A new class of executions, called d-BC (d-bounded cycles), is introduced, and is shown to be an [(n-1)/spl middot/d]-rollback mechanism (ZCF is a special case of d-BC for d=1). Finally, a d-BC protocol is presented. This protocol has the nice property that it does not impose any control information overhead on an application's messages, yet it only sends a few control messages of its own. Moreover, the protocol maintains information about recovery lines, which enables very efficient discovery of the most recent recovery line that existed a short time before the failure.\",\"PeriodicalId\":440881,\"journal\":{\"name\":\"Proceedings 20th IEEE Symposium on Reliable Distributed Systems\",\"volume\":\"37 11\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2001-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings 20th IEEE Symposium on Reliable Distributed Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RELDIS.2001.969737\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 20th IEEE Symposium on Reliable Distributed Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RELDIS.2001.969737","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21
摘要
提出基于k-rollback概念的检查点执行的新分类,该分类指示在恢复期间可能需要回滚的检查点的最大数量。探讨了已知执行类之间的关系,表明协调检查点、SZPF(严格无z路径)和ZPF(无z路径)是1-回滚机制,而ZCF(无z循环)是(n-1)-回滚机制,其中n是执行中参与者的数量。引入了一类新的执行,称为d- bc (d-有界循环),并被证明是一种[(n-1)/spl middot/d]-回滚机制(ZCF是d=1时d- bc的特殊情况)。最后,提出了一种d-BC协议。该协议有一个很好的特性,即它不会对应用程序的消息施加任何控制信息开销,但它只发送自己的几个控制消息。此外,协议维护有关恢复线路的信息,这使得能够非常有效地发现在故障发生前很短时间内存在的最近的恢复线路。
Quantifying rollback propagation in distributed checkpointing
Proposes a new classification of executions with checkpoints that is based on the notion of k-rollback, indicating the maximal number of checkpoints that may need to be rolled back during recovery. The relation between known execution classes is explored, and it is shown that coordinated checkpointing, SZPF (strictly Z-path free) and ZPF (Z-path free) are 1-rollback mechanisms, while ZCF (Z-cycle free) is (n-1)-rollback, where n is the number of participants in an execution. A new class of executions, called d-BC (d-bounded cycles), is introduced, and is shown to be an [(n-1)/spl middot/d]-rollback mechanism (ZCF is a special case of d-BC for d=1). Finally, a d-BC protocol is presented. This protocol has the nice property that it does not impose any control information overhead on an application's messages, yet it only sends a few control messages of its own. Moreover, the protocol maintains information about recovery lines, which enables very efficient discovery of the most recent recovery line that existed a short time before the failure.