Fault-Tolerant Scheduling for Scientific Workflows in Cloud Environments

2017 IEEE 7th International Advance Computing Conference (IACC) Pub Date : 1900-01-01 DOI:10.1109/IACC.2017.0043

K. Vinay, S.M. Dilip Kumar

引用次数: 15

Abstract

Executing clustered tasks has proven to be an efficient method to improve the computation of Scientific Workflows (SWf) on clouds. However, clustered tasks has a higher probability of suffering from failures than a single task. Therefore, fault tolerance in cloud computing is extremely essential while running large-scale scientific applications. In this paper, a new heuristic called Cluster based Heterogeneous Earliest Finish Time (CHEFT) algorithm to enhance the scheduling and fault tolerance mechanism for SWf in highly distributed cloud environments is proposed. To mitigate the failure of clustered tasks, this algorithm uses idle-time of the provisioned resources to resubmit failed clustered tasks for successful execution of SWf. Experimental results show that the proposed algorithm have convincing impact on the SWf executions and also drastically reduce the resource waste compared to existing task replication techniques. A trace based simulation of five real SWf shows that this algorithm is able to sustain unexpected task failures with minimal cost and makespan.

查看原文本刊更多论文

云环境下科学工作流的容错调度

执行集群任务已被证明是改进云上科学工作流(SWf)计算的一种有效方法。但是，集群任务比单个任务出现故障的概率更高。因此，在运行大规模科学应用程序时，云计算中的容错是极其必要的。本文提出了一种新的启发式算法——基于集群的异构最早完成时间(CHEFT)算法，以增强SWf在高度分布式云环境下的调度和容错机制。为了减轻集群任务的失败，该算法使用所提供资源的空闲时间来重新提交失败的集群任务，以成功执行SWf。实验结果表明，与现有的任务复制技术相比，该算法对SWf的执行有令人信服的影响，并且大大减少了资源浪费。对五个真实SWf的跟踪仿真表明，该算法能够以最小的成本和最大完成时间维持意外的任务失败。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 7th International Advance Computing Conference (IACC)

自引率

0.00%

发文量