计算网格的分布式故障管理

2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06) Pub Date : 2006-10-21 DOI:10.1109/GCC.2006.39

M. Affaan, M.A Ansari

{"title":"计算网格的分布式故障管理","authors":"M. Affaan, M.A Ansari","doi":"10.1109/GCC.2006.39","DOIUrl":null,"url":null,"abstract":"Grid resources having heterogeneous architectures, being geographically distributed and interconnected via unreliable network media, are at the risk of failure. Grid environment consists of unreliable resources; therefore, fault tolerant mechanisms can not be ignored. Some scientific jobs require long commitments of grid resources whose failures may not be overlooked. We need a flexible management of these failures by considering the failure of fault manager itself. In this paper we propose the concept of distributed management of failures without engaging the resources for this particular task exclusively. Resources performing the fault management may also participate in serving the long running user jobs. Each sub-job of the main user job is inspected by an individual resource. In case of failure inspector resource takes over in place of inspected resource. Contributions of this paper are: elimination of single point of failure and proposed concept's ability to be integrated with variety of grid middleware","PeriodicalId":280249,"journal":{"name":"2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Distributed Fault Management for Computational Grids\",\"authors\":\"M. Affaan, M.A Ansari\",\"doi\":\"10.1109/GCC.2006.39\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Grid resources having heterogeneous architectures, being geographically distributed and interconnected via unreliable network media, are at the risk of failure. Grid environment consists of unreliable resources; therefore, fault tolerant mechanisms can not be ignored. Some scientific jobs require long commitments of grid resources whose failures may not be overlooked. We need a flexible management of these failures by considering the failure of fault manager itself. In this paper we propose the concept of distributed management of failures without engaging the resources for this particular task exclusively. Resources performing the fault management may also participate in serving the long running user jobs. Each sub-job of the main user job is inspected by an individual resource. In case of failure inspector resource takes over in place of inspected resource. Contributions of this paper are: elimination of single point of failure and proposed concept's ability to be integrated with variety of grid middleware\",\"PeriodicalId\":280249,\"journal\":{\"name\":\"2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/GCC.2006.39\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GCC.2006.39","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

摘要

具有异构体系结构的网格资源，在地理上分布并且通过不可靠的网络媒介相互连接，有失败的风险。网格环境由不可靠的资源组成;因此，容错机制是不容忽视的。一些科学工作需要长期使用网格资源，这些资源的失败可能不容忽视。我们需要通过考虑故障管理器本身的故障来灵活地管理这些故障。在本文中，我们提出了分布式故障管理的概念，而不需要专门为这一特定任务使用资源。执行故障管理的资源也可能参与为长期运行的用户作业提供服务。主用户作业的每个子作业由单个资源检查。如果发生故障，检查员资源将取代被检查的资源。本文的贡献是:消除了单点故障，提出的概念能够与各种网格中间件集成

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Distributed Fault Management for Computational Grids

Grid resources having heterogeneous architectures, being geographically distributed and interconnected via unreliable network media, are at the risk of failure. Grid environment consists of unreliable resources; therefore, fault tolerant mechanisms can not be ignored. Some scientific jobs require long commitments of grid resources whose failures may not be overlooked. We need a flexible management of these failures by considering the failure of fault manager itself. In this paper we propose the concept of distributed management of failures without engaging the resources for this particular task exclusively. Resources performing the fault management may also participate in serving the long running user jobs. Each sub-job of the main user job is inspected by an individual resource. In case of failure inspector resource takes over in place of inspected resource. Contributions of this paper are: elimination of single point of failure and proposed concept's ability to be integrated with variety of grid middleware

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06)

自引率

0.00%

发文量