Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid Pub Date : 2009-05-18 DOI:10.1109/CCGRID.2009.59

Yang Zhang, A. Mandal, C. Koelbel, K. Cooper

引用次数: 68

Abstract

Complex scientific workflows are now Increasingly executed on computational grids. In addition to the challenges of managing and scheduling these workflows, reliability challenges arise because of the unreliable nature of large-scale grid infrastructure. Fault tolerance mechanisms like over-provisioning and checkpoint-recovery are used in current grid application management systems to address these reliability challenges. In this work, we propose new approaches that combine these fault tolerance techniques with existing workflow scheduling algorithms. We present a study on the effectiveness of the combined approaches by analyzing their impact on the reliability of workflow execution, workflow performance and resource usage under different reliability models, failure prediction accuracies and workflow application types.

查看原文本刊更多论文

计算网格下工作流应用的容错与调度技术

复杂的科学工作流程现在越来越多地在计算网格上执行。除了管理和调度这些工作流的挑战之外，由于大规模网格基础设施的不可靠性，可靠性也面临挑战。当前的网格应用程序管理系统中使用了诸如过度供应和检查点恢复之类的容错机制来解决这些可靠性挑战。在这项工作中，我们提出了将这些容错技术与现有工作流调度算法相结合的新方法。通过分析不同可靠性模型、故障预测精度和工作流应用类型对工作流执行可靠性、工作流性能和资源使用的影响，研究了组合方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid

自引率

0.00%

发文量