Optimized Execution of Parallel Loops via User-Defined Scheduling Policies

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI:10.1145/3337821.3337913

Seonmyeong Bak, Yanfei Guo, P. Balaji, Vivek Sarkar

{"title":"Optimized Execution of Parallel Loops via User-Defined Scheduling Policies","authors":"Seonmyeong Bak, Yanfei Guo, P. Balaji, Vivek Sarkar","doi":"10.1145/3337821.3337913","DOIUrl":null,"url":null,"abstract":"On-node parallelism continues to increase in importance for high-performance computing and most newly deployed supercomputers have tens of processor cores per node. These higher levels of on-node parallelism exacerbate the impact of load imbalance and locality in parallel computations, and current programming systems notably lack features to enable efficient use of these large numbers of cores or require users to modify codes significantly. Our work is motivated by the need to address application-specific load balance and locality requirements with minimal changes to application codes. In this paper, we propose a new approach to extend the specification of parallel loops via user functions that specify iteration chunks. We also extend the runtime system to invoke these user functions when determining how to create chunks and schedule them on worker threads. Our runtime system starts with subspaces specified in the user functions, performs load balancing of chunks concurrently, and stores the balanced groups of chunks to reduce load imbalance in future invocations. Our approach can be used to improve load balance and locality in many dynamic iterative applications, including graph and sparse matrix applications. We demonstrate the benefits of this work using MiniMD, a miniapp derived from LAMMPS, and three kernels from the GAP Benchmark Suite: Breadth-First Search, Connected Components, and PageRank, each evaluated with six different graph data sets. Our approach achieves geometric mean speedups of 1.16× to 1.54× over four standard OpenMP schedules and 1.07× over the static_steal schedule from recent research.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 48th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3337821.3337913","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

On-node parallelism continues to increase in importance for high-performance computing and most newly deployed supercomputers have tens of processor cores per node. These higher levels of on-node parallelism exacerbate the impact of load imbalance and locality in parallel computations, and current programming systems notably lack features to enable efficient use of these large numbers of cores or require users to modify codes significantly. Our work is motivated by the need to address application-specific load balance and locality requirements with minimal changes to application codes. In this paper, we propose a new approach to extend the specification of parallel loops via user functions that specify iteration chunks. We also extend the runtime system to invoke these user functions when determining how to create chunks and schedule them on worker threads. Our runtime system starts with subspaces specified in the user functions, performs load balancing of chunks concurrently, and stores the balanced groups of chunks to reduce load imbalance in future invocations. Our approach can be used to improve load balance and locality in many dynamic iterative applications, including graph and sparse matrix applications. We demonstrate the benefits of this work using MiniMD, a miniapp derived from LAMMPS, and three kernels from the GAP Benchmark Suite: Breadth-First Search, Connected Components, and PageRank, each evaluated with six different graph data sets. Our approach achieves geometric mean speedups of 1.16× to 1.54× over four standard OpenMP schedules and 1.07× over the static_steal schedule from recent research.

查看原文本刊更多论文

通过自定义调度策略优化并行循环的执行

节点上并行性对于高性能计算的重要性不断增加，大多数新部署的超级计算机每个节点都有数十个处理器内核。这些更高级别的节点上并行性加剧了并行计算中负载不平衡和局部性的影响，并且当前的编程系统明显缺乏能够有效使用这些大量内核或要求用户大量修改代码的功能。我们的工作的动机是需要解决特定于应用程序的负载平衡和局部性需求，同时对应用程序代码进行最小的更改。在本文中，我们提出了一种新的方法，通过指定迭代块的用户函数来扩展并行循环的规范。我们还扩展了运行时系统，以便在确定如何创建块并在工作线程上调度它们时调用这些用户函数。我们的运行时系统从用户函数中指定的子空间开始，并发地执行块的负载平衡，并存储平衡的块组，以减少未来调用中的负载不平衡。我们的方法可用于改善许多动态迭代应用程序的负载平衡和局部性，包括图和稀疏矩阵应用程序。我们使用从LAMMPS派生的迷你应用程序MiniMD和GAP基准测试套件中的三个内核:广度优先搜索、连接组件和PageRank来演示这项工作的好处，每个内核都使用六个不同的图数据集进行评估。我们的方法在四个标准OpenMP调度上实现了1.16到1.54倍的几何平均加速，在最近研究的static_steal调度上实现了1.07倍的几何平均加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 48th International Conference on Parallel Processing

自引率

0.00%

发文量