Understanding Parallelization Tradeoffs for Linear Pipelines

Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2018-02-24 DOI:10.1145/3178442.3178443

Aristeidis Mastoras, T. Gross

{"title":"Understanding Parallelization Tradeoffs for Linear Pipelines","authors":"Aristeidis Mastoras, T. Gross","doi":"10.1145/3178442.3178443","DOIUrl":null,"url":null,"abstract":"Pipelining techniques execute some loops with cross-iteration dependences in parallel, by partitioning the loop body into a sequence of stages such that the data dependences are not violated. Obtaining good performance for all kinds of loops is challenging and current techniques, e.g., PS-DSWP and LBPP, have difficulties handling load-imbalanced loops. Particularly, for loop iterations that differ substantially in execution time, these techniques achieve load-balancing by assigning work to threads using round-robin scheduling. Algorithms that rely on work-stealing e.g., Piper, efficiently handle load-imbalanced loops, but the high overhead of the scheduler implies poor performance for fine-grained loops. In this paper, we present Proteas, a programming model to allow tradeoffs between load-balancing, partitioning, mapping, synchronization, chunking, and scheduling. Proteas provides a set of simple directives to express the different mappings to handle a loop's parallelism. Then, a source-to-source compiler generates parallel code to support experimentation with Proteas. The directives allow us to investigate various tradeoffs and achieve good performance according to PS-DSWP and LBPP. In addition, the directives make a meaningful comparison to Piper possible. We present a performance evaluation on a 32-core system for a set of popular pipelined programs selected from three widely-used benchmark suites. The results show the tradeoffs of the different techniques and their parameters. Moreover, the results show that efficient handling of load-imbalanced fine-grained loops remains the main challenge.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"13 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3178442.3178443","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Pipelining techniques execute some loops with cross-iteration dependences in parallel, by partitioning the loop body into a sequence of stages such that the data dependences are not violated. Obtaining good performance for all kinds of loops is challenging and current techniques, e.g., PS-DSWP and LBPP, have difficulties handling load-imbalanced loops. Particularly, for loop iterations that differ substantially in execution time, these techniques achieve load-balancing by assigning work to threads using round-robin scheduling. Algorithms that rely on work-stealing e.g., Piper, efficiently handle load-imbalanced loops, but the high overhead of the scheduler implies poor performance for fine-grained loops. In this paper, we present Proteas, a programming model to allow tradeoffs between load-balancing, partitioning, mapping, synchronization, chunking, and scheduling. Proteas provides a set of simple directives to express the different mappings to handle a loop's parallelism. Then, a source-to-source compiler generates parallel code to support experimentation with Proteas. The directives allow us to investigate various tradeoffs and achieve good performance according to PS-DSWP and LBPP. In addition, the directives make a meaningful comparison to Piper possible. We present a performance evaluation on a 32-core system for a set of popular pipelined programs selected from three widely-used benchmark suites. The results show the tradeoffs of the different techniques and their parameters. Moreover, the results show that efficient handling of load-imbalanced fine-grained loops remains the main challenge.

查看原文本刊更多论文

理解线性管道的并行化权衡

流水线技术通过将循环体划分为一系列阶段来并行执行一些具有交叉迭代依赖关系的循环，从而不违反数据依赖关系。在各种环路中获得良好的性能是具有挑战性的，目前的技术，如PS-DSWP和LBPP，在处理负载不平衡环路方面存在困难。特别是，对于执行时间差异很大的循环迭代，这些技术通过使用循环调度将工作分配给线程来实现负载平衡。依赖于工作窃取的算法，例如Piper，可以有效地处理负载不平衡的循环，但是调度器的高开销意味着细粒度循环的性能很差。在本文中，我们提出了Proteas，一个允许在负载平衡、分区、映射、同步、分块和调度之间进行权衡的编程模型。Proteas提供了一组简单的指令来表达处理循环并行性的不同映射。然后，一个源到源的编译器生成并行代码来支持Proteas的实验。这些指令允许我们根据PS-DSWP和LBPP调查各种权衡并实现良好的性能。此外，这些指令还可以与Piper进行有意义的比较。我们从三个广泛使用的基准套件中选择了一组流行的流水线程序，在32核系统上进行了性能评估。结果显示了不同技术及其参数的权衡。此外，结果表明，有效处理负载不平衡的细粒度循环仍然是主要的挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores

自引率

0.00%

发文量