Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

Yang Zhang, A. Mandal, C. Koelbel, K. Cooper
{"title":"Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids","authors":"Yang Zhang, A. Mandal, C. Koelbel, K. Cooper","doi":"10.1109/CCGRID.2009.59","DOIUrl":null,"url":null,"abstract":"Complex scientific workflows are now Increasingly executed on computational grids. In addition to the challenges of managing and scheduling these workflows, reliability challenges arise because of the unreliable nature of large-scale grid infrastructure. Fault tolerance mechanisms like over-provisioning and checkpoint-recovery are used in current grid application management systems to address these reliability challenges. In this work, we propose new approaches that combine these fault tolerance techniques with existing workflow scheduling algorithms. We present a study on the effectiveness of the combined approaches by analyzing their impact on the reliability of workflow execution, workflow performance and resource usage under different reliability models, failure prediction accuracies and workflow application types.","PeriodicalId":118263,"journal":{"name":"2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid","volume":"245 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"68","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2009.59","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 68

Abstract

Complex scientific workflows are now Increasingly executed on computational grids. In addition to the challenges of managing and scheduling these workflows, reliability challenges arise because of the unreliable nature of large-scale grid infrastructure. Fault tolerance mechanisms like over-provisioning and checkpoint-recovery are used in current grid application management systems to address these reliability challenges. In this work, we propose new approaches that combine these fault tolerance techniques with existing workflow scheduling algorithms. We present a study on the effectiveness of the combined approaches by analyzing their impact on the reliability of workflow execution, workflow performance and resource usage under different reliability models, failure prediction accuracies and workflow application types.
计算网格下工作流应用的容错与调度技术
复杂的科学工作流程现在越来越多地在计算网格上执行。除了管理和调度这些工作流的挑战之外,由于大规模网格基础设施的不可靠性,可靠性也面临挑战。当前的网格应用程序管理系统中使用了诸如过度供应和检查点恢复之类的容错机制来解决这些可靠性挑战。在这项工作中,我们提出了将这些容错技术与现有工作流调度算法相结合的新方法。通过分析不同可靠性模型、故障预测精度和工作流应用类型对工作流执行可靠性、工作流性能和资源使用的影响,研究了组合方法的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信