Taming irregular applications via advanced dynamic parallelism on GPUs

Jing Zhang, Ashwin M. Aji, Michael L. Chu, Hao Wang, Wu-chun Feng
{"title":"Taming irregular applications via advanced dynamic parallelism on GPUs","authors":"Jing Zhang, Ashwin M. Aji, Michael L. Chu, Hao Wang, Wu-chun Feng","doi":"10.1145/3203217.3203243","DOIUrl":null,"url":null,"abstract":"On recent GPU architectures, dynamic parallelism, which enables the launching of kernels from the GPU without CPU involvement, provides a way to improve the performance of irregular applications by generating child kernels dynamically to reduce workload imbalance and improve GPU utilization. However, in practice, dynamic parallelism does not improve performance due to high kernel launch overhead and low child kernel occupancy. Consequently, most existing studies focus on mitigating the kernel launch overhead. As the kernel launch overhead has decreased due to algorithmic redesigns and hardware architectural innovations, the organization of subtasks to child kernels becomes a new performance bottleneck. We present an in-depth characterization of existing software approaches for dynamic parallelism optimizations on the latest GPUs. We observe that current approaches of subtask aggregation, which use the \"one-size-fits-all\" method by treating all subtasks equally, can under-utilize resources and degrade overall performance, as different subtasks require various configurations for optimal performance. To address this problem, we leverage statistical and machine-learning techniques and propose a performance modeling and task scheduling tool that can (1) analyze the performance characteristics of subtasks to identify the critical performance factors, (2) predict the performance of new subtasks, and (3) generate the optimal aggregation strategy for new subtasks. Experimental results show that our approach with the optimal subtask aggregation strategy can achieve up to a 1.8-fold speedup over the existing task aggregation approach for dynamic parallelism.","PeriodicalId":127096,"journal":{"name":"Proceedings of the 15th ACM International Conference on Computing Frontiers","volume":"243 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3203217.3203243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

On recent GPU architectures, dynamic parallelism, which enables the launching of kernels from the GPU without CPU involvement, provides a way to improve the performance of irregular applications by generating child kernels dynamically to reduce workload imbalance and improve GPU utilization. However, in practice, dynamic parallelism does not improve performance due to high kernel launch overhead and low child kernel occupancy. Consequently, most existing studies focus on mitigating the kernel launch overhead. As the kernel launch overhead has decreased due to algorithmic redesigns and hardware architectural innovations, the organization of subtasks to child kernels becomes a new performance bottleneck. We present an in-depth characterization of existing software approaches for dynamic parallelism optimizations on the latest GPUs. We observe that current approaches of subtask aggregation, which use the "one-size-fits-all" method by treating all subtasks equally, can under-utilize resources and degrade overall performance, as different subtasks require various configurations for optimal performance. To address this problem, we leverage statistical and machine-learning techniques and propose a performance modeling and task scheduling tool that can (1) analyze the performance characteristics of subtasks to identify the critical performance factors, (2) predict the performance of new subtasks, and (3) generate the optimal aggregation strategy for new subtasks. Experimental results show that our approach with the optimal subtask aggregation strategy can achieve up to a 1.8-fold speedup over the existing task aggregation approach for dynamic parallelism.
通过gpu上的高级动态并行性来驯服不规则应用程序
在最近的GPU架构中,动态并行性使内核从GPU启动而无需CPU参与,提供了一种通过动态生成子内核来提高不规则应用程序性能的方法,以减少工作负载不平衡并提高GPU利用率。然而,在实践中,由于高内核启动开销和低子内核占用,动态并行并不能提高性能。因此,大多数现有的研究都集中在减轻内核启动开销上。由于算法的重新设计和硬件架构的创新,内核启动开销降低了,子任务到子内核的组织成为一个新的性能瓶颈。我们提出了一个深入表征现有的软件方法动态并行优化的最新gpu。我们观察到,当前的子任务聚合方法使用“一刀切”的方法,通过平等地对待所有子任务,可能会利用不足的资源并降低整体性能,因为不同的子任务需要不同的配置才能获得最佳性能。为了解决这个问题,我们利用统计和机器学习技术,并提出了一种性能建模和任务调度工具,该工具可以(1)分析子任务的性能特征以识别关键性能因素,(2)预测新子任务的性能,以及(3)为新子任务生成最佳聚合策略。实验结果表明,采用最优子任务聚合策略的方法可以实现比现有任务聚合方法高达1.8倍的动态并行化加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信