CUDA-NP

Yi Yang, Huiyang Zhou
{"title":"CUDA-NP","authors":"Yi Yang, Huiyang Zhou","doi":"10.1145/2555243.2555254","DOIUrl":null,"url":null,"abstract":"Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both se-quential code and parallel loops. In order to leverage such parallel loops, the latest Nvidia Kepler architecture intro-duces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these bench-marks do not have a very high loop count or high degrees of TLP. Consequently, the benefits of leveraging such par-allel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implemented our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically gen-erates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and effi-ciently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our pro-posed CUDA-NP framework further improves the perfor-mance by up to 6.69 times and 2.18 times on average.","PeriodicalId":447086,"journal":{"name":"Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2555243.2555254","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

Abstract

Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both se-quential code and parallel loops. In order to leverage such parallel loops, the latest Nvidia Kepler architecture intro-duces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these bench-marks do not have a very high loop count or high degrees of TLP. Consequently, the benefits of leveraging such par-allel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implemented our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically gen-erates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and effi-ciently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our pro-posed CUDA-NP framework further improves the perfor-mance by up to 6.69 times and 2.18 times on average.
CUDA-NP
并行程序由一系列具有不同线程级并行性(TLP)的代码段组成。因此,并行程序中的线程(例如CUDA程序中的GPU内核)仍然包含顺序代码和并行循环是相当常见的。为了利用这种并行循环,最新的Nvidia Kepler架构引入了动态并行,它允许GPU线程启动另一个GPU内核,从而减少了从CPU启动内核的开销。然而,使用动态并行,父线程只能通过全局内存与其子线程通信,并且即使在GPU内启动GPU内核的开销也不是微不足道的。在本文中,我们首先研究了一组包含并行循环的GPGPU基准测试,并强调这些基准测试没有非常高的循环计数或高程度的TLP。因此,使用动态并行性利用这种并行循环的好处太有限,无法抵消其开销。然后,我们提出了在CUDA中利用嵌套并行性的解决方案,称为CUDA- np。使用CUDA-NP,我们最初在GPU程序启动时启用高线程数,并使用控制流为不同的代码段激活不同数量的线程。我们使用基于指令的编译器方法实现了我们提出的CUDA-NP框架。对于GPU内核,应用程序开发人员只需要为可并行化的代码段添加类似openmp的pragmas。然后,我们的CUDA-NP编译器自动生成优化的GPU内核。它同时支持缩减和扫描原语,探索将并行循环迭代分发到线程中的不同方法,并有效地管理片上资源。我们的实验表明,对于一组已经优化并包含嵌套并行性的GPGPU基准测试,我们提出的CUDA-NP框架进一步提高了性能,最高可达6.69倍,平均提高了2.18倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信