Towards Optimal Fault Tolerant Scheduling in Computational Grid

2007 International Conference on Emerging Technologies Pub Date : 2007-11-01 DOI:10.1109/ICET.2007.4516335

M. Imran, I. A. Niaz, S. Haider, N. Hussain, M. Ansari

{"title":"Towards Optimal Fault Tolerant Scheduling in Computational Grid","authors":"M. Imran, I. A. Niaz, S. Haider, N. Hussain, M. Ansari","doi":"10.1109/ICET.2007.4516335","DOIUrl":null,"url":null,"abstract":"Grid environment has significant challenges due to diverse failures encountered during job execution. Computational grids provide the main execution platform for long running jobs. Such jobs require long commitment of grid resources. Therefore fault tolerance in such an environment cannot be ignored. Most of the grid middleware have either ignored failure issues or have developed adhoc solutions. Most of the existing fault tolerance techniques are application dependant and causes cognitive problem. This paper examines existing fault detection and tolerance techniques in various middleware. We have proposed fault tolerant layered grid architecture with cross-layered design. In our approach Hybrid Particle Swarm Optimization (HPSO) algorithm and Anycast technique are used in conjunction with the Globus middleware. We have adopted a proactive and reactive fault management strategy for centralized and distributed environments. The proposed strategy is helpful in identifying root cause of failures and resolving cognitive problem. Our strategy minimizes computation and communication thus achieving higher reliability. Anycast limits the effect of Denial of Service/Distributed Denial of Service D(DoS) attacks nearest to the source of the attack thus achieving better security. Significant performance improvement is achieved through using Anycast before HPSO. The selection of more reliable nodes results in less overhead of checkpointing.","PeriodicalId":346773,"journal":{"name":"2007 International Conference on Emerging Technologies","volume":"117 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 International Conference on Emerging Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICET.2007.4516335","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Grid environment has significant challenges due to diverse failures encountered during job execution. Computational grids provide the main execution platform for long running jobs. Such jobs require long commitment of grid resources. Therefore fault tolerance in such an environment cannot be ignored. Most of the grid middleware have either ignored failure issues or have developed adhoc solutions. Most of the existing fault tolerance techniques are application dependant and causes cognitive problem. This paper examines existing fault detection and tolerance techniques in various middleware. We have proposed fault tolerant layered grid architecture with cross-layered design. In our approach Hybrid Particle Swarm Optimization (HPSO) algorithm and Anycast technique are used in conjunction with the Globus middleware. We have adopted a proactive and reactive fault management strategy for centralized and distributed environments. The proposed strategy is helpful in identifying root cause of failures and resolving cognitive problem. Our strategy minimizes computation and communication thus achieving higher reliability. Anycast limits the effect of Denial of Service/Distributed Denial of Service D(DoS) attacks nearest to the source of the attack thus achieving better security. Significant performance improvement is achieved through using Anycast before HPSO. The selection of more reliable nodes results in less overhead of checkpointing.

查看原文本刊更多论文

计算网格中最优容错调度研究

由于作业执行过程中遇到的各种故障，网格环境面临着巨大的挑战。计算网格为长时间运行的作业提供了主要执行平台。这样的工作需要长期使用网格资源。因此，在这样的环境中容错是不容忽视的。大多数网格中间件要么忽略了故障问题，要么开发了专门的解决方案。现有的容错技术大多依赖于应用程序，会导致认知问题。本文研究了各种中间件中现有的故障检测和容错技术。提出了采用跨层设计的容错分层网格结构。在我们的方法中，混合粒子群优化(HPSO)算法和Anycast技术与Globus中间件结合使用。对于集中式和分布式环境，我们采用了主动和被动的故障管理策略。提出的策略有助于识别失败的根本原因和解决认知问题。我们的策略最大限度地减少计算和通信，从而实现更高的可靠性。任播限制了最接近攻击源的拒绝服务/分布式拒绝服务D(DoS)攻击的影响，从而获得更好的安全性。通过在HPSO之前使用任播，实现了显著的性能改进。选择更可靠的节点可以减少检查点的开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2007 International Conference on Emerging Technologies

自引率

0.00%

发文量