Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels

2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) Pub Date : 2016-09-01 DOI:10.1109/MASCOTS.2016.58

Islam Harb, W. Feng

{"title":"Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels","authors":"Islam Harb, W. Feng","doi":"10.1109/MASCOTS.2016.58","DOIUrl":null,"url":null,"abstract":"There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication. In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms. Although this topic has been addressed in previous research studies, there has been neither a solid quantification of such overhead, nor guidance on when to use each of the different approaches. Therefore, we quantify the synchronization overhead relative to the number of kernel launches and the input data sizes. The quantification, in turn, provides insight as to when to use each of the aforementioned synchronization mechanisms in a target application. Our results show that implicit CPU synchronization has a significant overhead that hurts the application performance when using medium to large data sizes with relatively large number of kernel launches (i.e. ~1100-5000). Hence, it is recommended to use explicit CPU synchronization with these configurations. In addition, among the three different approaches, we conclude that dynamic parallelism (DP) is the most efficient with small data sizes (i.e., ~128k bytes), regardless of the number of kernel launches. Also, Dynamic Parallelism (DP), implicitly, performs inter-block (i.e. global) synchronization with no CPU intervention. Therefore, DP significantly reduces the power consumed by the CPU and PCIe for global synchronization. Our findings show that DP reduces the power consumption by ~8-10%. However, DP-based synchronization is a trade-off, in which it is accompanied by ~2-5% performance loss.","PeriodicalId":129389,"journal":{"name":"2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS.2016.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication. In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms. Although this topic has been addressed in previous research studies, there has been neither a solid quantification of such overhead, nor guidance on when to use each of the different approaches. Therefore, we quantify the synchronization overhead relative to the number of kernel launches and the input data sizes. The quantification, in turn, provides insight as to when to use each of the aforementioned synchronization mechanisms in a target application. Our results show that implicit CPU synchronization has a significant overhead that hurts the application performance when using medium to large data sizes with relatively large number of kernel launches (i.e. ~1100-5000). Hence, it is recommended to use explicit CPU synchronization with these configurations. In addition, among the three different approaches, we conclude that dynamic parallelism (DP) is the most efficient with small data sizes (i.e., ~128k bytes), regardless of the number of kernel launches. Also, Dynamic Parallelism (DP), implicitly, performs inter-block (i.e. global) synchronization with no CPU intervention. Therefore, DP significantly reduces the power consumed by the CPU and PCIe for global synchronization. Our findings show that DP reduces the power consumption by ~8-10%. However, DP-based synchronization is a trade-off, in which it is accompanied by ~2-5% performance loss.

查看原文本刊更多论文

GPU内核高效同步的性能和功耗特征

gpu缺乏对流多处理器(SMs)之间显式同步的支持，这对gpu有效执行块间通信的性能产生了不利影响。在本文中，我们提出了几种使用显式/隐式基于cpu和动态并行(DP)机制的块间同步方法。虽然这一主题在以前的研究中已经得到了解决，但既没有对这种开销进行可靠的量化，也没有关于何时使用每种不同方法的指导。因此，我们量化了与内核启动次数和输入数据大小相关的同步开销。反过来，量化提供了在目标应用程序中何时使用上述每种同步机制的洞察力。我们的结果表明，当使用中大型数据量和相对大量的内核启动(例如~1100-5000)时，隐式CPU同步具有显著的开销，会损害应用程序的性能。因此，建议对这些配置使用显式CPU同步。此外，在三种不同的方法中，我们得出结论，动态并行(DP)对于小数据大小(即~128k字节)是最有效的，而不管内核启动的次数。此外，动态并行(DP)隐式地执行块间(即全局)同步，而无需CPU干预。因此，DP可以显著降低CPU和PCIe的功耗，实现全局同步。我们的研究结果表明，DP降低了约8-10%的功耗。然而，基于dp的同步是一种权衡，伴随着~2-5%的性能损失。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)

自引率

0.00%

发文量