A Hybrid Scheduling Scheme for Parallel Loops

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI:10.1109/IPDPS49936.2021.00067

A. Handleman, Arthur G. Rattew, I. Lee, T. Schardl

{"title":"A Hybrid Scheduling Scheme for Parallel Loops","authors":"A. Handleman, Arthur G. Rattew, I. Lee, T. Schardl","doi":"10.1109/IPDPS49936.2021.00067","DOIUrl":null,"url":null,"abstract":"Parallel loops are commonly used parallel constructs to parallelize high-performance scientific applications. In the paradigm of task parallelism, the parallel loop construct is used to express the logical parallelism of the loop, indicating that the iterations in a loop are logically in parallel and let an underlying runtime scheduler determines how to best map the parallel iterations onto available processing cores. Researchers have investigated multiple scheduling schemes for scheduling parallel loops, with the static partitioning and dynamic partitioning being most prevalent. Static partitioning obtains low scheduling overhead while potentially retaining locality benefit in iterative applications that perform a sequence of parallel loops that access the same set of data repeatedly. But static partitioning may perform poorly relatively to dynamic partitioning if the loop iterations contain unbalanced workloads or if the cores can arrive at the loops in different times. We propose a hybrid scheduling scheme, which first schedules loops using static partitioning but then employs dynamic partitioning when load balancing is necessary. Moreover, the work distribution employs a claiming heuristic that allows a core to check for partitions to work on in a semi-deterministic fashion, allowing the scheduling to better retain data locality in the case of iterative applications. Unlike prior work that optimizes for iterative applications, our scheme does not require programmer annotations and can provide provably efficient execution time. In this paper, we discuss the hybrid scheme, prove its correctness, and analyze its scheduling bound. We have also implemented the proposed scheme in a Cilk-based work-stealing platform and experimentally verified that the scheme load balances well and can retain locality for such iterative applications.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Parallel loops are commonly used parallel constructs to parallelize high-performance scientific applications. In the paradigm of task parallelism, the parallel loop construct is used to express the logical parallelism of the loop, indicating that the iterations in a loop are logically in parallel and let an underlying runtime scheduler determines how to best map the parallel iterations onto available processing cores. Researchers have investigated multiple scheduling schemes for scheduling parallel loops, with the static partitioning and dynamic partitioning being most prevalent. Static partitioning obtains low scheduling overhead while potentially retaining locality benefit in iterative applications that perform a sequence of parallel loops that access the same set of data repeatedly. But static partitioning may perform poorly relatively to dynamic partitioning if the loop iterations contain unbalanced workloads or if the cores can arrive at the loops in different times. We propose a hybrid scheduling scheme, which first schedules loops using static partitioning but then employs dynamic partitioning when load balancing is necessary. Moreover, the work distribution employs a claiming heuristic that allows a core to check for partitions to work on in a semi-deterministic fashion, allowing the scheduling to better retain data locality in the case of iterative applications. Unlike prior work that optimizes for iterative applications, our scheme does not require programmer annotations and can provide provably efficient execution time. In this paper, we discuss the hybrid scheme, prove its correctness, and analyze its scheduling bound. We have also implemented the proposed scheme in a Cilk-based work-stealing platform and experimentally verified that the scheme load balances well and can retain locality for such iterative applications.

查看原文本刊更多论文

一种并行循环的混合调度方案

并行循环是用于高性能科学应用程序并行化的常用并行结构。在任务并行性范例中，并行循环构造用于表示循环的逻辑并行性，表明循环中的迭代在逻辑上是并行的，并让底层运行时调度器决定如何最好地将并行迭代映射到可用的处理核心。研究人员研究了并行循环调度的多种调度方案，其中静态分区和动态分区最为流行。静态分区获得较低的调度开销，同时在执行一系列重复访问同一组数据的并行循环的迭代应用程序中潜在地保留局部性优势。但是，如果循环迭代包含不平衡的工作负载，或者内核可以在不同的时间到达循环，那么相对于动态分区，静态分区的性能可能会差一些。我们提出了一种混合调度方案，首先使用静态分区调度循环，然后在需要负载平衡时使用动态分区。此外，工作分布采用声明启发式，允许核心以半确定性的方式检查要处理的分区，从而允许调度在迭代应用程序的情况下更好地保留数据位置。与先前针对迭代应用程序进行优化的工作不同，我们的方案不需要程序员注释，并且可以提供可证明的高效执行时间。本文讨论了该混合方案，证明了它的正确性，并分析了它的调度界。我们还在基于gil的工作窃取平台上实现了所提出的方案，并通过实验验证了该方案的负载平衡良好，并且可以为此类迭代应用保留局部性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量