网格环境下基于遗传算法的调度容错研究

2011 World Congress on Information and Communication Technologies Pub Date : 2011-12-01 DOI:10.1109/WICT.2011.6141344

Neeraj Upadhyay, M. Misra

{"title":"网格环境下基于遗传算法的调度容错研究","authors":"Neeraj Upadhyay, M. Misra","doi":"10.1109/WICT.2011.6141344","DOIUrl":null,"url":null,"abstract":"Grid systems differ from traditional distributed systems in terms of their large scale, heterogeneity and dynamism. These factors contribute towards higher frequency of fault occurrences; large scale causes lower values of Mean Time To Failure (MTTF), heterogeneity results in interaction faults (protocol mismatches) between communicating dissimilar nodes and dynamism with dynamically varying resource availability due to resources autonomously entering and leaving the grid effects execution of jobs. Another factor that increases probability of failure of applications is that applications running on grid are long running computations taking days to finish. Incorporating fault tolerance in scheduling algorithms is one of the approaches for handling faults in grid environment. Genetic Algorithms are a popular class of meta-heuristic algorithms used for grid scheduling. These are stochastic search algorithms based on the natural process of fitness based selection and reproduction. This paper combines GA-based scheduling with fault tolerance techniques such as checkpointing (dynamic) by modifying the fitness function. Also certain scenarios such as checkpointing without migration for resources with different downtimes and autonomous nature of grid resource providers are considered in building fitness functions. The motivation behind the work is that scheduling-assisted fault tolerance would help in finding the appropriate schedule for the jobs which would complete in the minimum time possible even when resources are prone to failures and thus help in meeting job deadlines. Simulation results for the proposed techniques are presented with respect to makespan and flowtime and fitness value of the resultant schedule obtained. The results show improvement in makespan and flowtime of the adaptive checkpointing approaches over static checkpointing approach. Also the approach which takes into consideration the last failure times of resources perform better than the approach based only on the mean failure times of resources.","PeriodicalId":178645,"journal":{"name":"2011 World Congress on Information and Communication Technologies","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Incorporating fault tolerance in GA-based scheduling in grid environment\",\"authors\":\"Neeraj Upadhyay, M. Misra\",\"doi\":\"10.1109/WICT.2011.6141344\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Grid systems differ from traditional distributed systems in terms of their large scale, heterogeneity and dynamism. These factors contribute towards higher frequency of fault occurrences; large scale causes lower values of Mean Time To Failure (MTTF), heterogeneity results in interaction faults (protocol mismatches) between communicating dissimilar nodes and dynamism with dynamically varying resource availability due to resources autonomously entering and leaving the grid effects execution of jobs. Another factor that increases probability of failure of applications is that applications running on grid are long running computations taking days to finish. Incorporating fault tolerance in scheduling algorithms is one of the approaches for handling faults in grid environment. Genetic Algorithms are a popular class of meta-heuristic algorithms used for grid scheduling. These are stochastic search algorithms based on the natural process of fitness based selection and reproduction. This paper combines GA-based scheduling with fault tolerance techniques such as checkpointing (dynamic) by modifying the fitness function. Also certain scenarios such as checkpointing without migration for resources with different downtimes and autonomous nature of grid resource providers are considered in building fitness functions. The motivation behind the work is that scheduling-assisted fault tolerance would help in finding the appropriate schedule for the jobs which would complete in the minimum time possible even when resources are prone to failures and thus help in meeting job deadlines. Simulation results for the proposed techniques are presented with respect to makespan and flowtime and fitness value of the resultant schedule obtained. The results show improvement in makespan and flowtime of the adaptive checkpointing approaches over static checkpointing approach. Also the approach which takes into consideration the last failure times of resources perform better than the approach based only on the mean failure times of resources.\",\"PeriodicalId\":178645,\"journal\":{\"name\":\"2011 World Congress on Information and Communication Technologies\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 World Congress on Information and Communication Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WICT.2011.6141344\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 World Congress on Information and Communication Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WICT.2011.6141344","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

网格系统在大规模、异构性和动态性方面不同于传统的分布式系统。这些因素导致故障发生的频率更高;大规模导致平均故障时间(MTTF)值较低，异构性导致通信不同节点之间的交互错误(协议不匹配)，以及由于资源自主进出网格而影响作业执行的动态变化资源可用性的动态变化。另一个增加应用程序失败概率的因素是，在网格上运行的应用程序是长时间运行的计算，需要几天才能完成。在调度算法中引入容错是网格环境下处理故障的途径之一。遗传算法是一类常用的用于网格调度的元启发式算法。这些是随机搜索算法，基于自然适应度的选择和繁殖过程。本文通过修改适应度函数，将基于遗传算法的调度与检查点(动态)等容错技术相结合。此外，在构建适应度函数时还考虑了某些场景，例如不迁移具有不同停机时间的资源的检查点和网格资源提供者的自治性质。这项工作背后的动机是，调度辅助的容错将有助于为作业找到合适的调度，即使在资源容易出现故障的情况下，也能在尽可能短的时间内完成作业，从而有助于满足作业的最后期限。最后给出了该方法的仿真结果，得到了最大完工时间和流时间，并给出了相应的适应度值。结果表明，自适应检查点方法在完工时间和流程时间上优于静态检查点方法。考虑资源最后失效次数的方法比只考虑资源平均失效次数的方法性能更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Incorporating fault tolerance in GA-based scheduling in grid environment

Grid systems differ from traditional distributed systems in terms of their large scale, heterogeneity and dynamism. These factors contribute towards higher frequency of fault occurrences; large scale causes lower values of Mean Time To Failure (MTTF), heterogeneity results in interaction faults (protocol mismatches) between communicating dissimilar nodes and dynamism with dynamically varying resource availability due to resources autonomously entering and leaving the grid effects execution of jobs. Another factor that increases probability of failure of applications is that applications running on grid are long running computations taking days to finish. Incorporating fault tolerance in scheduling algorithms is one of the approaches for handling faults in grid environment. Genetic Algorithms are a popular class of meta-heuristic algorithms used for grid scheduling. These are stochastic search algorithms based on the natural process of fitness based selection and reproduction. This paper combines GA-based scheduling with fault tolerance techniques such as checkpointing (dynamic) by modifying the fitness function. Also certain scenarios such as checkpointing without migration for resources with different downtimes and autonomous nature of grid resource providers are considered in building fitness functions. The motivation behind the work is that scheduling-assisted fault tolerance would help in finding the appropriate schedule for the jobs which would complete in the minimum time possible even when resources are prone to failures and thus help in meeting job deadlines. Simulation results for the proposed techniques are presented with respect to makespan and flowtime and fitness value of the resultant schedule obtained. The results show improvement in makespan and flowtime of the adaptive checkpointing approaches over static checkpointing approach. Also the approach which takes into consideration the last failure times of resources perform better than the approach based only on the mean failure times of resources.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 World Congress on Information and Communication Technologies

自引率

0.00%

发文量