{"title":"Efficient and Fault-Tolerant Static Scheduling for Grids","authors":"Patrick Cichowski, J. Keller","doi":"10.1109/IPDPSW.2013.94","DOIUrl":null,"url":null,"abstract":"Static task graphs model a variety of parallel applications, and are used to schedule such applications in grid platforms. While the scheduling is static, i.e. done prior to execution, processors might fail or not deliver their performance, especially if the grid comprises nodes with donated time, that may be used or shutdown by their owner at any time. We extend a prior proposal for fault-tolerant grid scheduling with task duplication to also cover situations where tasks take much longer than expected from the schedule as a special kind of fault. Furthermore, we consider the time for communication between dependent tasks when placing duplicates. We evaluate both scenarios with a simulator that injects faults and slowdowns to processors, and workloads from a benchmark suite of task graph with a variety of structures. Our results indicate that the overhead in the fault-free case is negligible, that a processor failure mostly increases the schedule make span only moderately because duplicates can use gapsin the original schedule, and that the effects of a processors lowdown can partly be mitigated by aborting a (slow) task and running its duplicate.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2013.94","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Static task graphs model a variety of parallel applications, and are used to schedule such applications in grid platforms. While the scheduling is static, i.e. done prior to execution, processors might fail or not deliver their performance, especially if the grid comprises nodes with donated time, that may be used or shutdown by their owner at any time. We extend a prior proposal for fault-tolerant grid scheduling with task duplication to also cover situations where tasks take much longer than expected from the schedule as a special kind of fault. Furthermore, we consider the time for communication between dependent tasks when placing duplicates. We evaluate both scenarios with a simulator that injects faults and slowdowns to processors, and workloads from a benchmark suite of task graph with a variety of structures. Our results indicate that the overhead in the fault-free case is negligible, that a processor failure mostly increases the schedule make span only moderately because duplicates can use gapsin the original schedule, and that the effects of a processors lowdown can partly be mitigated by aborting a (slow) task and running its duplicate.