Efficient and Fault-Tolerant Static Scheduling for Grids

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI:10.1109/IPDPSW.2013.94

Patrick Cichowski, J. Keller

{"title":"Efficient and Fault-Tolerant Static Scheduling for Grids","authors":"Patrick Cichowski, J. Keller","doi":"10.1109/IPDPSW.2013.94","DOIUrl":null,"url":null,"abstract":"Static task graphs model a variety of parallel applications, and are used to schedule such applications in grid platforms. While the scheduling is static, i.e. done prior to execution, processors might fail or not deliver their performance, especially if the grid comprises nodes with donated time, that may be used or shutdown by their owner at any time. We extend a prior proposal for fault-tolerant grid scheduling with task duplication to also cover situations where tasks take much longer than expected from the schedule as a special kind of fault. Furthermore, we consider the time for communication between dependent tasks when placing duplicates. We evaluate both scenarios with a simulator that injects faults and slowdowns to processors, and workloads from a benchmark suite of task graph with a variety of structures. Our results indicate that the overhead in the fault-free case is negligible, that a processor failure mostly increases the schedule make span only moderately because duplicates can use gapsin the original schedule, and that the effects of a processors lowdown can partly be mitigated by aborting a (slow) task and running its duplicate.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2013.94","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Static task graphs model a variety of parallel applications, and are used to schedule such applications in grid platforms. While the scheduling is static, i.e. done prior to execution, processors might fail or not deliver their performance, especially if the grid comprises nodes with donated time, that may be used or shutdown by their owner at any time. We extend a prior proposal for fault-tolerant grid scheduling with task duplication to also cover situations where tasks take much longer than expected from the schedule as a special kind of fault. Furthermore, we consider the time for communication between dependent tasks when placing duplicates. We evaluate both scenarios with a simulator that injects faults and slowdowns to processors, and workloads from a benchmark suite of task graph with a variety of structures. Our results indicate that the overhead in the fault-free case is negligible, that a processor failure mostly increases the schedule make span only moderately because duplicates can use gapsin the original schedule, and that the effects of a processors lowdown can partly be mitigated by aborting a (slow) task and running its duplicate.

查看原文本刊更多论文

网格的高效和容错静态调度

静态任务图为各种并行应用程序建模，并用于调度网格平台中的此类应用程序。虽然调度是静态的，即在执行之前完成，但处理器可能会失败或无法交付其性能，特别是如果网格包含具有捐赠时间的节点，这些节点可能随时被其所有者使用或关闭。我们扩展了先前的带有任务复制的容错网格调度建议，以涵盖任务花费比计划预期时间长得多的情况，这是一种特殊类型的故障。此外，在放置副本时，我们考虑了依赖任务之间的通信时间。我们使用一个模拟器来评估这两种场景，该模拟器向处理器注入故障和减速，并从具有各种结构的任务图基准套件中评估工作负载。我们的结果表明，在无故障情况下，开销可以忽略不计，处理器故障通常只会适度地增加调度的时间跨度，因为副本可以使用原始调度中的间隙，并且处理器故障的影响可以通过终止(缓慢的)任务并运行其副本来部分减轻。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum

自引率

0.00%

发文量