Taming irregular applications via advanced dynamic parallelism on GPUs

Proceedings of the 15th ACM International Conference on Computing Frontiers Pub Date : 2018-05-08 DOI:10.1145/3203217.3203243

Jing Zhang, Ashwin M. Aji, Michael L. Chu, Hao Wang, Wu-chun Feng

{"title":"Taming irregular applications via advanced dynamic parallelism on GPUs","authors":"Jing Zhang, Ashwin M. Aji, Michael L. Chu, Hao Wang, Wu-chun Feng","doi":"10.1145/3203217.3203243","DOIUrl":null,"url":null,"abstract":"On recent GPU architectures, dynamic parallelism, which enables the launching of kernels from the GPU without CPU involvement, provides a way to improve the performance of irregular applications by generating child kernels dynamically to reduce workload imbalance and improve GPU utilization. However, in practice, dynamic parallelism does not improve performance due to high kernel launch overhead and low child kernel occupancy. Consequently, most existing studies focus on mitigating the kernel launch overhead. As the kernel launch overhead has decreased due to algorithmic redesigns and hardware architectural innovations, the organization of subtasks to child kernels becomes a new performance bottleneck. We present an in-depth characterization of existing software approaches for dynamic parallelism optimizations on the latest GPUs. We observe that current approaches of subtask aggregation, which use the \"one-size-fits-all\" method by treating all subtasks equally, can under-utilize resources and degrade overall performance, as different subtasks require various configurations for optimal performance. To address this problem, we leverage statistical and machine-learning techniques and propose a performance modeling and task scheduling tool that can (1) analyze the performance characteristics of subtasks to identify the critical performance factors, (2) predict the performance of new subtasks, and (3) generate the optimal aggregation strategy for new subtasks. Experimental results show that our approach with the optimal subtask aggregation strategy can achieve up to a 1.8-fold speedup over the existing task aggregation approach for dynamic parallelism.","PeriodicalId":127096,"journal":{"name":"Proceedings of the 15th ACM International Conference on Computing Frontiers","volume":"243 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3203217.3203243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

On recent GPU architectures, dynamic parallelism, which enables the launching of kernels from the GPU without CPU involvement, provides a way to improve the performance of irregular applications by generating child kernels dynamically to reduce workload imbalance and improve GPU utilization. However, in practice, dynamic parallelism does not improve performance due to high kernel launch overhead and low child kernel occupancy. Consequently, most existing studies focus on mitigating the kernel launch overhead. As the kernel launch overhead has decreased due to algorithmic redesigns and hardware architectural innovations, the organization of subtasks to child kernels becomes a new performance bottleneck. We present an in-depth characterization of existing software approaches for dynamic parallelism optimizations on the latest GPUs. We observe that current approaches of subtask aggregation, which use the "one-size-fits-all" method by treating all subtasks equally, can under-utilize resources and degrade overall performance, as different subtasks require various configurations for optimal performance. To address this problem, we leverage statistical and machine-learning techniques and propose a performance modeling and task scheduling tool that can (1) analyze the performance characteristics of subtasks to identify the critical performance factors, (2) predict the performance of new subtasks, and (3) generate the optimal aggregation strategy for new subtasks. Experimental results show that our approach with the optimal subtask aggregation strategy can achieve up to a 1.8-fold speedup over the existing task aggregation approach for dynamic parallelism.

查看原文本刊更多论文

通过gpu上的高级动态并行性来驯服不规则应用程序

在最近的GPU架构中，动态并行性使内核从GPU启动而无需CPU参与，提供了一种通过动态生成子内核来提高不规则应用程序性能的方法，以减少工作负载不平衡并提高GPU利用率。然而，在实践中，由于高内核启动开销和低子内核占用，动态并行并不能提高性能。因此，大多数现有的研究都集中在减轻内核启动开销上。由于算法的重新设计和硬件架构的创新，内核启动开销降低了，子任务到子内核的组织成为一个新的性能瓶颈。我们提出了一个深入表征现有的软件方法动态并行优化的最新gpu。我们观察到，当前的子任务聚合方法使用“一刀切”的方法，通过平等地对待所有子任务，可能会利用不足的资源并降低整体性能，因为不同的子任务需要不同的配置才能获得最佳性能。为了解决这个问题，我们利用统计和机器学习技术，并提出了一种性能建模和任务调度工具，该工具可以(1)分析子任务的性能特征以识别关键性能因素，(2)预测新子任务的性能，以及(3)为新子任务生成最佳聚合策略。实验结果表明，采用最优子任务聚合策略的方法可以实现比现有任务聚合方法高达1.8倍的动态并行化加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 15th ACM International Conference on Computing Frontiers

自引率

0.00%

发文量