Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) Pub Date : 2004-09-20 DOI:10.1109/CLUSTR.2004.1392609

Pierre Lemarinier, Aurélien Bouteiller, T. Hérault, Géraud Krawezik, F. Cappello

{"title":"Improved message logging versus improved coordinated checkpointing for fault tolerant MPI","authors":"Pierre Lemarinier, Aurélien Bouteiller, T. Hérault, Géraud Krawezik, F. Cappello","doi":"10.1109/CLUSTR.2004.1392609","DOIUrl":null,"url":null,"abstract":"Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a high fault rate. In a recent paper, we have demonstrated that the main differences between pessimistic sender based message logging and coordinated checkpointing are: 1) the communication latency and 2) the performance penalty in case of faults. Pessimistic message logging increases the latency, due to additional blocking control messages. When faults occur at a high rate, coordinated checkpointing implies a higher performance penalty than message logging due to a higher stress on the checkpoint server. We extend this study to improved versions of message logging and coordinated checkpoint protocols which respectively reduces the latency overhead of pessimistic message logging and the server stress of coordinated checkpoint. We detail the protocols and their implementation into the new MPICH-V fault tolerant framework. We compare their performance against the previous versions and we compare the novel message logging protocols against the improved coordinated checkpointing one using the NAS benchmark on a typical high performance cluster equipped with a high speed network. The contribution of This work is twofold: a) an original message logging protocol and an improved coordinated checkpointing protocol and b) the comparison between them.","PeriodicalId":123512,"journal":{"name":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"82","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTR.2004.1392609","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 82

Abstract

Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a high fault rate. In a recent paper, we have demonstrated that the main differences between pessimistic sender based message logging and coordinated checkpointing are: 1) the communication latency and 2) the performance penalty in case of faults. Pessimistic message logging increases the latency, due to additional blocking control messages. When faults occur at a high rate, coordinated checkpointing implies a higher performance penalty than message logging due to a higher stress on the checkpoint server. We extend this study to improved versions of message logging and coordinated checkpoint protocols which respectively reduces the latency overhead of pessimistic message logging and the server stress of coordinated checkpoint. We detail the protocols and their implementation into the new MPICH-V fault tolerant framework. We compare their performance against the previous versions and we compare the novel message logging protocols against the improved coordinated checkpointing one using the NAS benchmark on a typical high performance cluster equipped with a high speed network. The contribution of This work is twofold: a) an original message logging protocol and an improved coordinated checkpointing protocol and b) the comparison between them.

查看原文本刊更多论文

改进的消息日志记录与改进的容错MPI协调检查点

对于使用MPI库的关键高性能应用程序来说，容错是一个非常重要的问题。有几种协议为消息传递系统提供自动和透明的故障检测和恢复，这些协议对应用程序性能和容忍高故障率的能力有不同的影响。在最近的一篇论文中，我们证明了基于悲观发送者的消息日志记录和协调检查点之间的主要区别是:1)通信延迟和2)发生故障时的性能损失。悲观消息日志记录增加了延迟，因为有额外的阻塞控制消息。当故障频繁发生时，协调检查点意味着比消息日志记录更大的性能损失，因为检查点服务器上的压力更大。我们将研究扩展到消息日志和协调检查点协议的改进版本，它们分别减少了悲观消息日志的延迟开销和协调检查点的服务器压力。我们详细介绍了协议及其在新的MPICH-V容错框架中的实现。我们将它们的性能与以前的版本进行比较，并在配备高速网络的典型高性能集群上使用NAS基准测试，将新的消息日志协议与改进的协调检查点协议进行比较。这项工作的贡献是双重的:a)一个原始的消息记录协议和一个改进的协调检查点协议;b)它们之间的比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)

自引率

0.00%

发文量