高性能奶牛:不规则并行循环的成本意识工作窃取:ACM架构和代码优化事务:Vol 0, No ja

IF 1.8 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-11-18 DOI:10.1145/3633331

Prasoon Mishra, V. Krishna Nandivada

{"title":"高性能奶牛:不规则并行循环的成本意识工作窃取:ACM架构和代码优化事务:Vol 0, No ja","authors":"Prasoon Mishra, V. Krishna Nandivada","doi":"10.1145/3633331","DOIUrl":null,"url":null,"abstract":"Parallel libraries such as OpenMP distribute the iterations of parallel-for-loops among the threads, using a programmer-specified scheduling policy. While the existing scheduling policies perform reasonably well in the context of balanced workloads, in computations that involve highly imbalanced workloads it is extremely non-trivial to obtain an efficient distribution of work (even using non-static scheduling methods like dynamic and guided). In this paper, we present a scheme called COst aware Work Stealing (COWS) to efficiently extend the idea of work-stealing to OpenMP. In contrast to the traditional work-stealing schedulers, COWS takes into consideration that (i) not all iterations of a parallel-for-loops may take the same amount of time. (ii) identifying a suitable victim for stealing is important for load-balancing, and (iii) queues lead to significant overheads in traditional work-stealing and should be avoided. We present two variations of COWS: WSRI (a naive work-stealing scheme based on the number of remaining iterations) and WSRW (work-stealing scheme based on the amount of remaining workload). Since in irregular loops like those found in graph analytics, it is not possible to statically compute the cost of the iterations of the parallel-for-loops, we use a combined compile-time + runtime approach, where the remaining workload of a loop is computed efficiently at runtime by utilizing the code generated by our compile-time component. We have performed an evaluation over seven different benchmark programs, using five different input datasets, on two different hardware across a varying number of threads; leading to a total of 275 number of configurations. We show that in 225 out of 275 configurations, compared to the best OpenMP scheduling scheme for that configuration, our approach achieves clear performance gains.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"59 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2023-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loops: ACM Transactions on Architecture and Code Optimization: Vol 0, No ja\",\"authors\":\"Prasoon Mishra, V. Krishna Nandivada\",\"doi\":\"10.1145/3633331\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Parallel libraries such as OpenMP distribute the iterations of parallel-for-loops among the threads, using a programmer-specified scheduling policy. While the existing scheduling policies perform reasonably well in the context of balanced workloads, in computations that involve highly imbalanced workloads it is extremely non-trivial to obtain an efficient distribution of work (even using non-static scheduling methods like dynamic and guided). In this paper, we present a scheme called COst aware Work Stealing (COWS) to efficiently extend the idea of work-stealing to OpenMP. In contrast to the traditional work-stealing schedulers, COWS takes into consideration that (i) not all iterations of a parallel-for-loops may take the same amount of time. (ii) identifying a suitable victim for stealing is important for load-balancing, and (iii) queues lead to significant overheads in traditional work-stealing and should be avoided. We present two variations of COWS: WSRI (a naive work-stealing scheme based on the number of remaining iterations) and WSRW (work-stealing scheme based on the amount of remaining workload). Since in irregular loops like those found in graph analytics, it is not possible to statically compute the cost of the iterations of the parallel-for-loops, we use a combined compile-time + runtime approach, where the remaining workload of a loop is computed efficiently at runtime by utilizing the code generated by our compile-time component. We have performed an evaluation over seven different benchmark programs, using five different input datasets, on two different hardware across a varying number of threads; leading to a total of 275 number of configurations. We show that in 225 out of 275 configurations, compared to the best OpenMP scheduling scheme for that configuration, our approach achieves clear performance gains.\",\"PeriodicalId\":50920,\"journal\":{\"name\":\"ACM Transactions on Architecture and Code Optimization\",\"volume\":\"59 1\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2023-11-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Architecture and Code Optimization\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3633331\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3633331","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

像OpenMP这样的并行库使用程序员指定的调度策略，在线程之间分发Parallel -for循环的迭代。虽然现有的调度策略在平衡工作负载的上下文中执行得相当好，但在涉及高度不平衡工作负载的计算中，获得有效的工作分配是非常重要的(即使使用动态和引导等非静态调度方法)。在本文中，我们提出了一种称为成本感知工作窃取(COst - aware Work Stealing, COWS)的方案，以有效地将工作窃取的思想扩展到OpenMP。与传统的偷取工作的调度器相比，COWS考虑到(i)并不是一个parallel-for-loop的所有迭代都可能花费相同的时间。(ii)确定合适的窃取对象对于负载平衡很重要，(iii)队列在传统的窃取工作中会导致很大的开销，应该避免。我们提出了奶牛的两种变体:wsrri(基于剩余迭代数量的简单工作窃取方案)和WSRW(基于剩余工作负载数量的工作窃取方案)。由于在图形分析中发现的不规则循环中，不可能静态地计算并行for循环迭代的成本，因此我们使用编译时+运行时组合方法，其中循环的剩余工作负载在运行时通过利用编译时组件生成的代码有效地计算。我们对七个不同的基准测试程序进行了评估，使用五个不同的输入数据集，在两个不同的硬件上，在不同数量的线程上;导致总共275个数的配置。我们显示，在275个配置中的225个配置中，与该配置的最佳OpenMP调度方案相比，我们的方法实现了明显的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loops: ACM Transactions on Architecture and Code Optimization: Vol 0, No ja

Parallel libraries such as OpenMP distribute the iterations of parallel-for-loops among the threads, using a programmer-specified scheduling policy. While the existing scheduling policies perform reasonably well in the context of balanced workloads, in computations that involve highly imbalanced workloads it is extremely non-trivial to obtain an efficient distribution of work (even using non-static scheduling methods like dynamic and guided). In this paper, we present a scheme called COst aware Work Stealing (COWS) to efficiently extend the idea of work-stealing to OpenMP.

In contrast to the traditional work-stealing schedulers, COWS takes into consideration that (i) not all iterations of a parallel-for-loops may take the same amount of time. (ii) identifying a suitable victim for stealing is important for load-balancing, and (iii) queues lead to significant overheads in traditional work-stealing and should be avoided. We present two variations of COWS: WSRI (a naive work-stealing scheme based on the number of remaining iterations) and WSRW (work-stealing scheme based on the amount of remaining workload). Since in irregular loops like those found in graph analytics, it is not possible to statically compute the cost of the iterations of the parallel-for-loops, we use a combined compile-time + runtime approach, where the remaining workload of a loop is computed efficiently at runtime by utilizing the code generated by our compile-time component. We have performed an evaluation over seven different benchmark programs, using five different input datasets, on two different hardware across a varying number of threads; leading to a total of 275 number of configurations. We show that in 225 out of 275 configurations, compared to the best OpenMP scheduling scheme for that configuration, our approach achieves clear performance gains.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Architecture and Code Optimization 工程技术-计算机：理论方法

CiteScore

3.60

自引率

6.20%

发文量

审稿时长

6-12 weeks

期刊介绍： ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.