Stable checkpointing in distributed systems without shared disks

Proceedings International Parallel and Distributed Processing Symposium Pub Date : 2003-04-22 DOI:10.1109/IPDPS.2003.1213392

P. Sobe

引用次数: 18

Abstract

Interacting processes an distributed systems save their checkpoints on local disks for efficiency reasons. But, because local checkpoints get unavailable with failing hosts, redundancy schemes similar to RAID-like storage schemes have to be used. In such systems, checkpoints are stable under a particular fault model because they can get reconstructed in the distributed system. In this paper, two variants of stable checkpoint storage are compared, (a) parity grouping over local checkpoints and (ii) RAID-like distribution of each checkpoint using a software based distributed storage system. An analysis is given to compare costs for collective checkpoint creation, recovery of a single process and rollback of all processes. The results show that despite the differences in detail, checkpointing using a distributed storage system is a reasonable solution.

查看原文本刊更多论文

无共享磁盘的分布式系统中的稳定检查点

出于效率考虑，分布式系统中的交互进程将检查点保存在本地磁盘上。但是，由于本地检查点在故障主机上不可用，因此必须使用类似于类似raid的存储方案的冗余方案。在这样的系统中，检查点在特定的故障模型下是稳定的，因为它们可以在分布式系统中被重建。在本文中，比较了稳定检查点存储的两种变体，(a)在本地检查点上的奇偶分组和(ii)使用基于软件的分布式存储系统对每个检查点进行类似raid的分布。分析比较了创建集体检查点、恢复单个流程和回滚所有流程的成本。结果表明，尽管在细节上存在差异，但使用分布式存储系统的检查点是一种合理的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量