Parallelizing general histogram application for CUDA architectures

2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS) Pub Date : 2013-07-15 DOI:10.1109/SAMOS.2013.6621100

Ugljesa Milic, Isaac Gelado, Nikola Puzovic, Alex Ramírez, M. Tomasevic

{"title":"Parallelizing general histogram application for CUDA architectures","authors":"Ugljesa Milic, Isaac Gelado, Nikola Puzovic, Alex Ramírez, M. Tomasevic","doi":"10.1109/SAMOS.2013.6621100","DOIUrl":null,"url":null,"abstract":"Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. This especially holds in case of platforms that contain one or several massively parallel devices like CUDA-capable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. The first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. The second one uses the Thrust library to sort the input elements and then to search for upper bounds according to bin widths. For both algorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For both algorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SAMOS.2013.6621100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. This especially holds in case of platforms that contain one or several massively parallel devices like CUDA-capable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. The first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. The second one uses the Thrust library to sort the input elements and then to search for upper bounds according to bin widths. For both algorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For both algorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters.

查看原文本刊更多论文

CUDA架构下的并行化通用直方图应用

直方图是数据分析中常用的一种工具。尽管它的串行版本很容易实现，但提供一种高效且可扩展的方式来并行化它可能是一项挑战。这尤其适用于包含一个或多个大规模并行设备(如CUDA-capable gpu)的平台，因为存在域分解、全局内存使用等问题。在本文中，我们比较了在gpu上实现通用直方图的两种方法。第一种算法基于存储在共享内存中的每个线程块的bin计数器的私有副本。第二个使用Thrust库对输入元素进行排序，然后根据bin宽度搜索上限。对于这两种算法，我们分析了顺序版本的加速如何取决于输入集合的大小、箱子的数量以及输入元素的类型和分布。我们还通过内核执行实现了主机CPU和CUDA设备之间数据传输的重叠。对于这两种算法，我们详细分析了它们的优缺点。例如，私有化策略可以比实际输入的排序搜索快2倍，但只能支持有限数量的箱子。另一方面，当我们使用字符作为输入时，排序搜索策略的加速速度比私有化高50%左右，并且可以支持无限数量的箱子。最后，我们根据输入参数的特征和值进行了探索，以确定最优算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)

自引率

0.00%

发文量