Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels

Islam Harb, W. Feng
{"title":"Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels","authors":"Islam Harb, W. Feng","doi":"10.1109/MASCOTS.2016.58","DOIUrl":null,"url":null,"abstract":"There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication. In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms. Although this topic has been addressed in previous research studies, there has been neither a solid quantification of such overhead, nor guidance on when to use each of the different approaches. Therefore, we quantify the synchronization overhead relative to the number of kernel launches and the input data sizes. The quantification, in turn, provides insight as to when to use each of the aforementioned synchronization mechanisms in a target application. Our results show that implicit CPU synchronization has a significant overhead that hurts the application performance when using medium to large data sizes with relatively large number of kernel launches (i.e. ~1100-5000). Hence, it is recommended to use explicit CPU synchronization with these configurations. In addition, among the three different approaches, we conclude that dynamic parallelism (DP) is the most efficient with small data sizes (i.e., ~128k bytes), regardless of the number of kernel launches. Also, Dynamic Parallelism (DP), implicitly, performs inter-block (i.e. global) synchronization with no CPU intervention. Therefore, DP significantly reduces the power consumed by the CPU and PCIe for global synchronization. Our findings show that DP reduces the power consumption by ~8-10%. However, DP-based synchronization is a trade-off, in which it is accompanied by ~2-5% performance loss.","PeriodicalId":129389,"journal":{"name":"2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS.2016.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication. In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms. Although this topic has been addressed in previous research studies, there has been neither a solid quantification of such overhead, nor guidance on when to use each of the different approaches. Therefore, we quantify the synchronization overhead relative to the number of kernel launches and the input data sizes. The quantification, in turn, provides insight as to when to use each of the aforementioned synchronization mechanisms in a target application. Our results show that implicit CPU synchronization has a significant overhead that hurts the application performance when using medium to large data sizes with relatively large number of kernel launches (i.e. ~1100-5000). Hence, it is recommended to use explicit CPU synchronization with these configurations. In addition, among the three different approaches, we conclude that dynamic parallelism (DP) is the most efficient with small data sizes (i.e., ~128k bytes), regardless of the number of kernel launches. Also, Dynamic Parallelism (DP), implicitly, performs inter-block (i.e. global) synchronization with no CPU intervention. Therefore, DP significantly reduces the power consumed by the CPU and PCIe for global synchronization. Our findings show that DP reduces the power consumption by ~8-10%. However, DP-based synchronization is a trade-off, in which it is accompanied by ~2-5% performance loss.
GPU内核高效同步的性能和功耗特征
gpu缺乏对流多处理器(SMs)之间显式同步的支持,这对gpu有效执行块间通信的性能产生了不利影响。在本文中,我们提出了几种使用显式/隐式基于cpu和动态并行(DP)机制的块间同步方法。虽然这一主题在以前的研究中已经得到了解决,但既没有对这种开销进行可靠的量化,也没有关于何时使用每种不同方法的指导。因此,我们量化了与内核启动次数和输入数据大小相关的同步开销。反过来,量化提供了在目标应用程序中何时使用上述每种同步机制的洞察力。我们的结果表明,当使用中大型数据量和相对大量的内核启动(例如~1100-5000)时,隐式CPU同步具有显著的开销,会损害应用程序的性能。因此,建议对这些配置使用显式CPU同步。此外,在三种不同的方法中,我们得出结论,动态并行(DP)对于小数据大小(即~128k字节)是最有效的,而不管内核启动的次数。此外,动态并行(DP)隐式地执行块间(即全局)同步,而无需CPU干预。因此,DP可以显著降低CPU和PCIe的功耗,实现全局同步。我们的研究结果表明,DP降低了约8-10%的功耗。然而,基于dp的同步是一种权衡,伴随着~2-5%的性能损失。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信