计算网格的分布式故障管理

M. Affaan, M.A Ansari
{"title":"计算网格的分布式故障管理","authors":"M. Affaan, M.A Ansari","doi":"10.1109/GCC.2006.39","DOIUrl":null,"url":null,"abstract":"Grid resources having heterogeneous architectures, being geographically distributed and interconnected via unreliable network media, are at the risk of failure. Grid environment consists of unreliable resources; therefore, fault tolerant mechanisms can not be ignored. Some scientific jobs require long commitments of grid resources whose failures may not be overlooked. We need a flexible management of these failures by considering the failure of fault manager itself. In this paper we propose the concept of distributed management of failures without engaging the resources for this particular task exclusively. Resources performing the fault management may also participate in serving the long running user jobs. Each sub-job of the main user job is inspected by an individual resource. In case of failure inspector resource takes over in place of inspected resource. Contributions of this paper are: elimination of single point of failure and proposed concept's ability to be integrated with variety of grid middleware","PeriodicalId":280249,"journal":{"name":"2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Distributed Fault Management for Computational Grids\",\"authors\":\"M. Affaan, M.A Ansari\",\"doi\":\"10.1109/GCC.2006.39\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Grid resources having heterogeneous architectures, being geographically distributed and interconnected via unreliable network media, are at the risk of failure. Grid environment consists of unreliable resources; therefore, fault tolerant mechanisms can not be ignored. Some scientific jobs require long commitments of grid resources whose failures may not be overlooked. We need a flexible management of these failures by considering the failure of fault manager itself. In this paper we propose the concept of distributed management of failures without engaging the resources for this particular task exclusively. Resources performing the fault management may also participate in serving the long running user jobs. Each sub-job of the main user job is inspected by an individual resource. In case of failure inspector resource takes over in place of inspected resource. Contributions of this paper are: elimination of single point of failure and proposed concept's ability to be integrated with variety of grid middleware\",\"PeriodicalId\":280249,\"journal\":{\"name\":\"2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/GCC.2006.39\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GCC.2006.39","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

摘要

具有异构体系结构的网格资源,在地理上分布并且通过不可靠的网络媒介相互连接,有失败的风险。网格环境由不可靠的资源组成;因此,容错机制是不容忽视的。一些科学工作需要长期使用网格资源,这些资源的失败可能不容忽视。我们需要通过考虑故障管理器本身的故障来灵活地管理这些故障。在本文中,我们提出了分布式故障管理的概念,而不需要专门为这一特定任务使用资源。执行故障管理的资源也可能参与为长期运行的用户作业提供服务。主用户作业的每个子作业由单个资源检查。如果发生故障,检查员资源将取代被检查的资源。本文的贡献是:消除了单点故障,提出的概念能够与各种网格中间件集成
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Distributed Fault Management for Computational Grids
Grid resources having heterogeneous architectures, being geographically distributed and interconnected via unreliable network media, are at the risk of failure. Grid environment consists of unreliable resources; therefore, fault tolerant mechanisms can not be ignored. Some scientific jobs require long commitments of grid resources whose failures may not be overlooked. We need a flexible management of these failures by considering the failure of fault manager itself. In this paper we propose the concept of distributed management of failures without engaging the resources for this particular task exclusively. Resources performing the fault management may also participate in serving the long running user jobs. Each sub-job of the main user job is inspected by an individual resource. In case of failure inspector resource takes over in place of inspected resource. Contributions of this paper are: elimination of single point of failure and proposed concept's ability to be integrated with variety of grid middleware
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信