Parallelizing general histogram application for CUDA architectures

Ugljesa Milic, Isaac Gelado, Nikola Puzovic, Alex Ramírez, M. Tomasevic
{"title":"Parallelizing general histogram application for CUDA architectures","authors":"Ugljesa Milic, Isaac Gelado, Nikola Puzovic, Alex Ramírez, M. Tomasevic","doi":"10.1109/SAMOS.2013.6621100","DOIUrl":null,"url":null,"abstract":"Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. This especially holds in case of platforms that contain one or several massively parallel devices like CUDA-capable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. The first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. The second one uses the Thrust library to sort the input elements and then to search for upper bounds according to bin widths. For both algorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For both algorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SAMOS.2013.6621100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. This especially holds in case of platforms that contain one or several massively parallel devices like CUDA-capable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. The first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. The second one uses the Thrust library to sort the input elements and then to search for upper bounds according to bin widths. For both algorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For both algorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters.
CUDA架构下的并行化通用直方图应用
直方图是数据分析中常用的一种工具。尽管它的串行版本很容易实现,但提供一种高效且可扩展的方式来并行化它可能是一项挑战。这尤其适用于包含一个或多个大规模并行设备(如CUDA-capable gpu)的平台,因为存在域分解、全局内存使用等问题。在本文中,我们比较了在gpu上实现通用直方图的两种方法。第一种算法基于存储在共享内存中的每个线程块的bin计数器的私有副本。第二个使用Thrust库对输入元素进行排序,然后根据bin宽度搜索上限。对于这两种算法,我们分析了顺序版本的加速如何取决于输入集合的大小、箱子的数量以及输入元素的类型和分布。我们还通过内核执行实现了主机CPU和CUDA设备之间数据传输的重叠。对于这两种算法,我们详细分析了它们的优缺点。例如,私有化策略可以比实际输入的排序搜索快2倍,但只能支持有限数量的箱子。另一方面,当我们使用字符作为输入时,排序搜索策略的加速速度比私有化高50%左右,并且可以支持无限数量的箱子。最后,我们根据输入参数的特征和值进行了探索,以确定最优算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信