Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI:10.1109/ICPP.2015.107

Da Li, Hancheng Wu, M. Becchi

{"title":"Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations","authors":"Da Li, Hancheng Wu, M. Becchi","doi":"10.1109/ICPP.2015.107","DOIUrl":null,"url":null,"abstract":"The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 44th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2015.107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.

查看原文本刊更多论文

GPU上的嵌套并行:探索不规则循环和递归计算的并行化模板

在gpu上有效部署具有不规则嵌套并行性的应用程序仍然是一个开放的问题。将不规则代码naïve映射到GPU硬件上通常会导致资源利用率不足，从而限制性能。在这项工作中，我们关注两种显示嵌套并行的计算模式:不规则嵌套循环和并行递归计算。我们特别关注在树和图上操作的递归算法。我们提出了不同的并行化模板，旨在提高这些代码的GPU利用率。具体来说，我们研究了有效地将不规则工作分配到流多处理器和GPU内核的机制。我们的一些并行化模板依赖于动态并行，这是Nvidia最近在其Kepler gpu中引入的一个特性，并作为Open CL 2.0标准的一部分宣布。我们提出了一些机制来最大化嵌套内核执行的工作，并最小化由于调用而产生的开销。我们的结果表明，在具有不规则嵌套循环的应用程序上使用我们的并行化模板可以比不包含负载平衡机制的基线GPU代码提高2-6倍的速度。在递归树遍历算法上使用嵌套的基于并行的并行化模板可以比优化的CPU实现带来显著的加速(高达15-24倍)。然而，在操作图的递归应用程序存在的情况下，嵌套并行的好处仍然不清楚，特别是当递归代码变体需要昂贵的同步时。在这些情况下，所考虑的算法的迭代版本的平并行化可能是可取的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 44th International Conference on Parallel Processing

自引率

0.00%

发文量