Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters

2005 IEEE International Conference on Cluster Computing Pub Date : 2005-09-01 DOI:10.1109/CLUSTR.2005.347039

Oren Laadan, Dan B. Phung, Jason Nieh

{"title":"Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters","authors":"Oren Laadan, Dan B. Phung, Jason Nieh","doi":"10.1109/CLUSTR.2005.347039","DOIUrl":null,"url":null,"abstract":"We have created ZapC, a novel system for transparent coordinated checkpoint-restart of distributed network applications on commodity clusters. ZapC provides a thin visualization layer on top of the operating system that decouples a distributed application from dependencies on the cluster nodes on which it is executing. This decoupling enables ZapC to checkpoint an entire distributed application across all nodes in a coordinated manner such that it can he restarted from the checkpoint on a different set of cluster nodes at a later time. ZapC checkpoint-restart operations execute in parallel across different cluster nodes, providing faster checkpoint-restart performance. ZapC uniquely supports network state in a transport protocol independent manner, including correctly saving and restoring socket and protocol state for both TCP and UDP connections. We have implemented a ZapC Linux prototype and demonstrate that it provides low visualization overhead and fast checkpoint-restart times for distributed network applications without any application, library, kernel, or network protocol modifications","PeriodicalId":255312,"journal":{"name":"2005 IEEE International Conference on Cluster Computing","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTR.2005.347039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

Abstract

We have created ZapC, a novel system for transparent coordinated checkpoint-restart of distributed network applications on commodity clusters. ZapC provides a thin visualization layer on top of the operating system that decouples a distributed application from dependencies on the cluster nodes on which it is executing. This decoupling enables ZapC to checkpoint an entire distributed application across all nodes in a coordinated manner such that it can he restarted from the checkpoint on a different set of cluster nodes at a later time. ZapC checkpoint-restart operations execute in parallel across different cluster nodes, providing faster checkpoint-restart performance. ZapC uniquely supports network state in a transport protocol independent manner, including correctly saving and restoring socket and protocol state for both TCP and UDP connections. We have implemented a ZapC Linux prototype and demonstrate that it provides low visualization overhead and fast checkpoint-restart times for distributed network applications without any application, library, kernel, or network protocol modifications

查看原文本刊更多论文

透明检查点——商品集群上分布式应用程序的重启

我们创建了ZapC，这是一个用于商品集群上分布式网络应用程序的透明协调检查点重启的新系统。ZapC在操作系统之上提供了一个瘦的可视化层，将分布式应用程序与它所执行的集群节点的依赖关系解耦。这种解耦使ZapC能够以协调的方式跨所有节点检查点整个分布式应用程序，以便以后可以从不同集群节点集上的检查点重新启动它。ZapC检查点重启操作跨不同集群节点并行执行，提供更快的检查点重启性能。ZapC以独立于传输协议的方式唯一地支持网络状态，包括正确保存和恢复TCP和UDP连接的套接字和协议状态。我们已经实现了一个ZapC Linux原型，并演示了它为分布式网络应用程序提供了较低的可视化开销和快速的检查点重新启动时间，而无需修改任何应用程序、库、内核或网络协议

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2005 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量