{"title":"Understanding Parallelization Tradeoffs for Linear Pipelines","authors":"Aristeidis Mastoras, T. Gross","doi":"10.1145/3178442.3178443","DOIUrl":null,"url":null,"abstract":"Pipelining techniques execute some loops with cross-iteration dependences in parallel, by partitioning the loop body into a sequence of stages such that the data dependences are not violated. Obtaining good performance for all kinds of loops is challenging and current techniques, e.g., PS-DSWP and LBPP, have difficulties handling load-imbalanced loops. Particularly, for loop iterations that differ substantially in execution time, these techniques achieve load-balancing by assigning work to threads using round-robin scheduling. Algorithms that rely on work-stealing e.g., Piper, efficiently handle load-imbalanced loops, but the high overhead of the scheduler implies poor performance for fine-grained loops. In this paper, we present Proteas, a programming model to allow tradeoffs between load-balancing, partitioning, mapping, synchronization, chunking, and scheduling. Proteas provides a set of simple directives to express the different mappings to handle a loop's parallelism. Then, a source-to-source compiler generates parallel code to support experimentation with Proteas. The directives allow us to investigate various tradeoffs and achieve good performance according to PS-DSWP and LBPP. In addition, the directives make a meaningful comparison to Piper possible. We present a performance evaluation on a 32-core system for a set of popular pipelined programs selected from three widely-used benchmark suites. The results show the tradeoffs of the different techniques and their parameters. Moreover, the results show that efficient handling of load-imbalanced fine-grained loops remains the main challenge.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"13 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3178442.3178443","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Pipelining techniques execute some loops with cross-iteration dependences in parallel, by partitioning the loop body into a sequence of stages such that the data dependences are not violated. Obtaining good performance for all kinds of loops is challenging and current techniques, e.g., PS-DSWP and LBPP, have difficulties handling load-imbalanced loops. Particularly, for loop iterations that differ substantially in execution time, these techniques achieve load-balancing by assigning work to threads using round-robin scheduling. Algorithms that rely on work-stealing e.g., Piper, efficiently handle load-imbalanced loops, but the high overhead of the scheduler implies poor performance for fine-grained loops. In this paper, we present Proteas, a programming model to allow tradeoffs between load-balancing, partitioning, mapping, synchronization, chunking, and scheduling. Proteas provides a set of simple directives to express the different mappings to handle a loop's parallelism. Then, a source-to-source compiler generates parallel code to support experimentation with Proteas. The directives allow us to investigate various tradeoffs and achieve good performance according to PS-DSWP and LBPP. In addition, the directives make a meaningful comparison to Piper possible. We present a performance evaluation on a 32-core system for a set of popular pipelined programs selected from three widely-used benchmark suites. The results show the tradeoffs of the different techniques and their parameters. Moreover, the results show that efficient handling of load-imbalanced fine-grained loops remains the main challenge.