Mailbox-based non blocking minimum-process coordinated checkpointing with message logging for hierarchical computational grid (MNMCCP)

2012 2nd International Conference on Advances in Computational Tools for Engineering Applications (ACTEA) Pub Date : 2012-12-01 DOI:10.1109/ICTEA.2012.6462910

G. A. El-Sayed, A. M. Abdullah

{"title":"Mailbox-based non blocking minimum-process coordinated checkpointing with message logging for hierarchical computational grid (MNMCCP)","authors":"G. A. El-Sayed, A. M. Abdullah","doi":"10.1109/ICTEA.2012.6462910","DOIUrl":null,"url":null,"abstract":"Execution of MPI applications on Grid deployments suffers from node and network failure that motivates the use of fault tolerant MPI implementations. So, to ensure good performance in Grid, scalable fault tolerance should be taken into account. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. The most popular approach is with coordinated checkpointing. Traditional coordinated checkpoint suffers from high storage and communication overhead because all processes must participate in the checkpointing at a time; also they all must roll back to the latest consistent global checkpoint even those processes that didn't fail. So in this paper, we propose a mailbox-based non blocking minimum-process coordinated checkpoint protocol (MNMCCP) for hierarchical Grid in which processes on different processors communicate indirectly by sending messages over the network through mailbox-based technique at a shared node. In this protocol, only the processes that communicated since last committed checkpoint will participate in the checkpoint at a time resulting in a reduction in stable storage access and enhanced performance. Our proposed protocol also exploits the mailbox of each process as an events logger since it logs the messages sent to the process in strict FIFO order. This combination of our coordinated protocol with message logging offers several additional advantages including limited computation lost, simplified recovery procedure and bounded recovery time because the effects of a failure are confined only to the processes that fail. From the other hand, using the proposed mailboxes technique ensures the reliable delivery of messages and prevents messages sent to the migrating or faulty process form losing and retransmitting. They will be kept in its associated mailbox, and the process will get them from the mailbox after restarting on the new resource. All of these make our protocol most suitable for highly dynamic environment where processes frequently migrate from one node to another thereby ensures the job to be executed within its deadline making such environment trust worthy.","PeriodicalId":245530,"journal":{"name":"2012 2nd International Conference on Advances in Computational Tools for Engineering Applications (ACTEA)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 2nd International Conference on Advances in Computational Tools for Engineering Applications (ACTEA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTEA.2012.6462910","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Execution of MPI applications on Grid deployments suffers from node and network failure that motivates the use of fault tolerant MPI implementations. So, to ensure good performance in Grid, scalable fault tolerance should be taken into account. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. The most popular approach is with coordinated checkpointing. Traditional coordinated checkpoint suffers from high storage and communication overhead because all processes must participate in the checkpointing at a time; also they all must roll back to the latest consistent global checkpoint even those processes that didn't fail. So in this paper, we propose a mailbox-based non blocking minimum-process coordinated checkpoint protocol (MNMCCP) for hierarchical Grid in which processes on different processors communicate indirectly by sending messages over the network through mailbox-based technique at a shared node. In this protocol, only the processes that communicated since last committed checkpoint will participate in the checkpoint at a time resulting in a reduction in stable storage access and enhanced performance. Our proposed protocol also exploits the mailbox of each process as an events logger since it logs the messages sent to the process in strict FIFO order. This combination of our coordinated protocol with message logging offers several additional advantages including limited computation lost, simplified recovery procedure and bounded recovery time because the effects of a failure are confined only to the processes that fail. From the other hand, using the proposed mailboxes technique ensures the reliable delivery of messages and prevents messages sent to the migrating or faulty process form losing and retransmitting. They will be kept in its associated mailbox, and the process will get them from the mailbox after restarting on the new resource. All of these make our protocol most suitable for highly dynamic environment where processes frequently migrate from one node to another thereby ensures the job to be executed within its deadline making such environment trust worthy.

查看原文本刊更多论文

分层计算网格(MNMCCP)中基于邮箱的非阻塞最小进程协调检查点和消息日志

在网格部署上执行MPI应用程序会受到节点和网络故障的影响，这促使使用容错MPI实现。因此，为了确保网格中的良好性能，应该考虑可伸缩的容错。对容错MPI的研究导致了几种容错MPI环境的发展。最流行的方法是协调检查点。传统的协调检查点存在较高的存储和通信开销，因为所有进程必须同时参与检查点;此外，它们都必须回滚到最新的一致全局检查点，即使那些没有失败的进程也是如此。因此，本文提出了一种基于邮箱的分层网格非阻塞最小进程协调检查点协议(MNMCCP)，在该协议中，不同处理器上的进程通过基于邮箱的技术在共享节点上通过网络发送消息进行间接通信。在此协议中，只有自上次提交检查点以来进行通信的进程才会参与检查点，从而减少稳定的存储访问并提高性能。我们提出的协议还利用每个进程的邮箱作为事件记录器，因为它以严格的FIFO顺序记录发送到进程的消息。我们的协调协议与消息日志的这种组合提供了几个额外的优点，包括有限的计算损失、简化的恢复过程和有限的恢复时间，因为故障的影响仅限于发生故障的进程。另一方面，使用所建议的邮箱技术可确保消息的可靠传递，并防止发送到迁移或故障流程的消息丢失和重传。它们将保存在其关联的邮箱中，并且进程将在重新启动新资源后从邮箱中获取它们。所有这些都使我们的协议最适合高度动态的环境，其中进程经常从一个节点迁移到另一个节点，从而确保在其截止日期内执行作业，从而使这种环境值得信任。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 2nd International Conference on Advances in Computational Tools for Engineering Applications (ACTEA)

自引率

0.00%

发文量