Job-Site Level Fault Tolerance for Cluster and Grid environments

2005 IEEE International Conference on Cluster Computing Pub Date : 2005-09-01 DOI:10.1109/CLUSTR.2005.347043

K. Limaye, C. Leangsuksun, Z. Greenwood, S. Scott, C. Engelmann, Richard Libby, K. Chanchio

{"title":"Job-Site Level Fault Tolerance for Cluster and Grid environments","authors":"K. Limaye, C. Leangsuksun, Z. Greenwood, S. Scott, C. Engelmann, Richard Libby, K. Chanchio","doi":"10.1109/CLUSTR.2005.347043","DOIUrl":null,"url":null,"abstract":"In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system's MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in Beowulf cluster-based grid environments, whereas existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called ''smart failover\" provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state","PeriodicalId":255312,"journal":{"name":"2005 IEEE International Conference on Cluster Computing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTR.2005.347043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

Abstract

In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system's MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in Beowulf cluster-based grid environments, whereas existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called ''smart failover" provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state

查看原文本刊更多论文

集群和网格环境的工作站点级容错

为了在关键任务应用程序中采用高性能集群和网格计算，容错是必要的。在系统中断的情况下，分布式系统中的常见容错技术通常通过检查点恢复和替代资源上的作业复制来实现。第一种方法取决于系统的MTTR，而后一种方法取决于运行副本的备选站点的可用性。有必要通过在作业站点级别主动处理故障来补充这些方法，确保系统的高可用性，而不会丢失用户提交的作业。本文讨论了一种新的容错技术，该技术能够在基于Beowulf集群的网格环境中实现作业现场恢复，而现有技术通过寻找替代资源来放弃故障系统。我们的结果表明，在启用globus的HA-OSCAR中实现我们的方法期间，总体性能得到了相当大的提高。这种称为“智能故障转移”的技术提供了一种透明而优雅的恢复机制，它将作业状态保存在本地作业管理器队列中，并在关键系统事件中定期将这些状态传输到备份服务器。因此，无论何时发生故障转移，备份服务器都能够从上次保存的状态重新启动作业

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2005 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量