CUDA-NP

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14 Pub Date : 2014-02-06 DOI:10.1145/2555243.2555254

Yi Yang, Huiyang Zhou

{"title":"CUDA-NP","authors":"Yi Yang, Huiyang Zhou","doi":"10.1145/2555243.2555254","DOIUrl":null,"url":null,"abstract":"Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both se-quential code and parallel loops. In order to leverage such parallel loops, the latest Nvidia Kepler architecture intro-duces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these bench-marks do not have a very high loop count or high degrees of TLP. Consequently, the benefits of leveraging such par-allel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implemented our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically gen-erates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and effi-ciently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our pro-posed CUDA-NP framework further improves the perfor-mance by up to 6.69 times and 2.18 times on average.","PeriodicalId":447086,"journal":{"name":"Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2555243.2555254","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both se-quential code and parallel loops. In order to leverage such parallel loops, the latest Nvidia Kepler architecture intro-duces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these bench-marks do not have a very high loop count or high degrees of TLP. Consequently, the benefits of leveraging such par-allel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implemented our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically gen-erates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and effi-ciently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our pro-posed CUDA-NP framework further improves the perfor-mance by up to 6.69 times and 2.18 times on average.

查看原文本刊更多论文

CUDA-NP

并行程序由一系列具有不同线程级并行性(TLP)的代码段组成。因此，并行程序中的线程(例如CUDA程序中的GPU内核)仍然包含顺序代码和并行循环是相当常见的。为了利用这种并行循环，最新的Nvidia Kepler架构引入了动态并行，它允许GPU线程启动另一个GPU内核，从而减少了从CPU启动内核的开销。然而，使用动态并行，父线程只能通过全局内存与其子线程通信，并且即使在GPU内启动GPU内核的开销也不是微不足道的。在本文中，我们首先研究了一组包含并行循环的GPGPU基准测试，并强调这些基准测试没有非常高的循环计数或高程度的TLP。因此，使用动态并行性利用这种并行循环的好处太有限，无法抵消其开销。然后，我们提出了在CUDA中利用嵌套并行性的解决方案，称为CUDA- np。使用CUDA-NP，我们最初在GPU程序启动时启用高线程数，并使用控制流为不同的代码段激活不同数量的线程。我们使用基于指令的编译器方法实现了我们提出的CUDA-NP框架。对于GPU内核，应用程序开发人员只需要为可并行化的代码段添加类似openmp的pragmas。然后，我们的CUDA-NP编译器自动生成优化的GPU内核。它同时支持缩减和扫描原语，探索将并行循环迭代分发到线程中的不同方法，并有效地管理片上资源。我们的实验表明，对于一组已经优化并包含嵌套并行性的GPGPU基准测试，我们提出的CUDA-NP框架进一步提高了性能，最高可达6.69倍，平均提高了2.18倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14

自引率

0.00%

发文量