A. Handleman, Arthur G. Rattew, I. Lee, T. Schardl
{"title":"A Hybrid Scheduling Scheme for Parallel Loops","authors":"A. Handleman, Arthur G. Rattew, I. Lee, T. Schardl","doi":"10.1109/IPDPS49936.2021.00067","DOIUrl":null,"url":null,"abstract":"Parallel loops are commonly used parallel constructs to parallelize high-performance scientific applications. In the paradigm of task parallelism, the parallel loop construct is used to express the logical parallelism of the loop, indicating that the iterations in a loop are logically in parallel and let an underlying runtime scheduler determines how to best map the parallel iterations onto available processing cores. Researchers have investigated multiple scheduling schemes for scheduling parallel loops, with the static partitioning and dynamic partitioning being most prevalent. Static partitioning obtains low scheduling overhead while potentially retaining locality benefit in iterative applications that perform a sequence of parallel loops that access the same set of data repeatedly. But static partitioning may perform poorly relatively to dynamic partitioning if the loop iterations contain unbalanced workloads or if the cores can arrive at the loops in different times. We propose a hybrid scheduling scheme, which first schedules loops using static partitioning but then employs dynamic partitioning when load balancing is necessary. Moreover, the work distribution employs a claiming heuristic that allows a core to check for partitions to work on in a semi-deterministic fashion, allowing the scheduling to better retain data locality in the case of iterative applications. Unlike prior work that optimizes for iterative applications, our scheme does not require programmer annotations and can provide provably efficient execution time. In this paper, we discuss the hybrid scheme, prove its correctness, and analyze its scheduling bound. We have also implemented the proposed scheme in a Cilk-based work-stealing platform and experimentally verified that the scheme load balances well and can retain locality for such iterative applications.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Parallel loops are commonly used parallel constructs to parallelize high-performance scientific applications. In the paradigm of task parallelism, the parallel loop construct is used to express the logical parallelism of the loop, indicating that the iterations in a loop are logically in parallel and let an underlying runtime scheduler determines how to best map the parallel iterations onto available processing cores. Researchers have investigated multiple scheduling schemes for scheduling parallel loops, with the static partitioning and dynamic partitioning being most prevalent. Static partitioning obtains low scheduling overhead while potentially retaining locality benefit in iterative applications that perform a sequence of parallel loops that access the same set of data repeatedly. But static partitioning may perform poorly relatively to dynamic partitioning if the loop iterations contain unbalanced workloads or if the cores can arrive at the loops in different times. We propose a hybrid scheduling scheme, which first schedules loops using static partitioning but then employs dynamic partitioning when load balancing is necessary. Moreover, the work distribution employs a claiming heuristic that allows a core to check for partitions to work on in a semi-deterministic fashion, allowing the scheduling to better retain data locality in the case of iterative applications. Unlike prior work that optimizes for iterative applications, our scheme does not require programmer annotations and can provide provably efficient execution time. In this paper, we discuss the hybrid scheme, prove its correctness, and analyze its scheduling bound. We have also implemented the proposed scheme in a Cilk-based work-stealing platform and experimentally verified that the scheme load balances well and can retain locality for such iterative applications.