Reliability-aware resource management for computational grid/cluster environments

K. Limaye, C. Leangsuksun, Yudan Liu, Z. Greenwood, S. Scott, Richard Libby, K. Chanchio
{"title":"Reliability-aware resource management for computational grid/cluster environments","authors":"K. Limaye, C. Leangsuksun, Yudan Liu, Z. Greenwood, S. Scott, Richard Libby, K. Chanchio","doi":"10.1109/GRID.2005.1542744","DOIUrl":null,"url":null,"abstract":"The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the collaborative community and should be guaranteed. Especially, in an existing environment where job sites are Beowulf cluster systems, a service node failure may render the whole system outage. Current grid fault tolerance techniques only address these issues in an opportunistic fashion. Thus, there is a need for complementing these approaches by pro actively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. Our grid-aware cluster resource management effort was motivated by the fact that a cluster turns into a popular job site in the computational grid environment. We propose a solution dealing with fault tolerance at the service level complementing the task-based solutions as being done in some recent studies. We discuss various service availability issues related to the grid, and preliminary results obtained while implementing the smart failover and transparent job-queue replication mechanism and the automated grid installation package. Our report entails the benefits outweighing acceptable overhead after implementing our proof-of-concept framework.","PeriodicalId":347929,"journal":{"name":"The 6th IEEE/ACM International Workshop on Grid Computing, 2005.","volume":"81 1-2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 6th IEEE/ACM International Workshop on Grid Computing, 2005.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRID.2005.1542744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the collaborative community and should be guaranteed. Especially, in an existing environment where job sites are Beowulf cluster systems, a service node failure may render the whole system outage. Current grid fault tolerance techniques only address these issues in an opportunistic fashion. Thus, there is a need for complementing these approaches by pro actively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. Our grid-aware cluster resource management effort was motivated by the fact that a cluster turns into a popular job site in the computational grid environment. We propose a solution dealing with fault tolerance at the service level complementing the task-based solutions as being done in some recent studies. We discuss various service availability issues related to the grid, and preliminary results obtained while implementing the smart failover and transparent job-queue replication mechanism and the automated grid installation package. Our report entails the benefits outweighing acceptable overhead after implementing our proof-of-concept framework.
计算网格/集群环境的可靠性感知资源管理
通过网格计算实现的集体资源利用对协作社区的整体计算能力至关重要,应予以保证。特别是,在作业站点是Beowulf集群系统的现有环境中,服务节点故障可能导致整个系统中断。目前的网格容错技术只是以一种机会主义的方式来解决这些问题。因此,有必要通过在作业站点级别主动处理故障来补充这些方法,确保系统的高可用性,而不会丢失用户提交的作业。我们的网格感知集群资源管理工作的动机是这样一个事实:集群在计算网格环境中变成了一个流行的工作站点。我们提出了一种在服务级别处理容错的解决方案,作为对最近一些研究中基于任务的解决方案的补充。我们讨论了与网格相关的各种服务可用性问题,以及在实现智能故障转移和透明作业队列复制机制以及自动化网格安装包时获得的初步结果。在实现我们的概念验证框架之后,我们的报告带来的好处超过了可接受的开销。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信