集群和网格环境的工作站点级容错

K. Limaye, C. Leangsuksun, Z. Greenwood, S. Scott, C. Engelmann, Richard Libby, K. Chanchio
{"title":"集群和网格环境的工作站点级容错","authors":"K. Limaye, C. Leangsuksun, Z. Greenwood, S. Scott, C. Engelmann, Richard Libby, K. Chanchio","doi":"10.1109/CLUSTR.2005.347043","DOIUrl":null,"url":null,"abstract":"In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system's MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in Beowulf cluster-based grid environments, whereas existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called ''smart failover\" provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state","PeriodicalId":255312,"journal":{"name":"2005 IEEE International Conference on Cluster Computing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"Job-Site Level Fault Tolerance for Cluster and Grid environments\",\"authors\":\"K. Limaye, C. Leangsuksun, Z. Greenwood, S. Scott, C. Engelmann, Richard Libby, K. Chanchio\",\"doi\":\"10.1109/CLUSTR.2005.347043\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system's MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in Beowulf cluster-based grid environments, whereas existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called ''smart failover\\\" provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state\",\"PeriodicalId\":255312,\"journal\":{\"name\":\"2005 IEEE International Conference on Cluster Computing\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2005 IEEE International Conference on Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CLUSTR.2005.347043\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTR.2005.347043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 29

摘要

为了在关键任务应用程序中采用高性能集群和网格计算,容错是必要的。在系统中断的情况下,分布式系统中的常见容错技术通常通过检查点恢复和替代资源上的作业复制来实现。第一种方法取决于系统的MTTR,而后一种方法取决于运行副本的备选站点的可用性。有必要通过在作业站点级别主动处理故障来补充这些方法,确保系统的高可用性,而不会丢失用户提交的作业。本文讨论了一种新的容错技术,该技术能够在基于Beowulf集群的网格环境中实现作业现场恢复,而现有技术通过寻找替代资源来放弃故障系统。我们的结果表明,在启用globus的HA-OSCAR中实现我们的方法期间,总体性能得到了相当大的提高。这种称为“智能故障转移”的技术提供了一种透明而优雅的恢复机制,它将作业状态保存在本地作业管理器队列中,并在关键系统事件中定期将这些状态传输到备份服务器。因此,无论何时发生故障转移,备份服务器都能够从上次保存的状态重新启动作业
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Job-Site Level Fault Tolerance for Cluster and Grid environments
In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system's MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in Beowulf cluster-based grid environments, whereas existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called ''smart failover" provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信