Algorithmic strategies for optimizing the parallel reduction primitive in CUDA

2012 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2012-07-02 DOI:10.1109/HPCSim.2012.6266966

Pedro J. Martín, Luis F. Ayuso, Roberto Torres, Antonio Gavilanes

{"title":"Algorithmic strategies for optimizing the parallel reduction primitive in CUDA","authors":"Pedro J. Martín, Luis F. Ayuso, Roberto Torres, Antonio Gavilanes","doi":"10.1109/HPCSim.2012.6266966","DOIUrl":null,"url":null,"abstract":"Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCSim.2012.6266966","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.

查看原文本刊更多论文

CUDA中并行约简原语优化的算法策略

许多通用应用程序通过执行一组众所周知的数据并行原语来利用图形处理单元(gpu)。这些原语通常会被主机多次调用，因此它们的吞吐量对整个系统的性能有很大的影响。因此，研究新的算法策略以优化其在当前设备上的实现对GPU社区来说是一个有趣的话题。在本文中，我们着重于优化约简原语，它仅仅是使用二进制关联算子将数据序列约简为单个值。尽管基于树的算法和基于序列的算法已经在gpu上实现，但还没有对这两种算法的性能进行比较。因此，我们的第一个贡献是在CUDA上提出最先进的约简算法的实验研究。接下来，我们将介绍两种算法优化，它们被集成到最快的解决方案(基于顺序的算法)中，从而进一步提高其吞吐量。最后，我们将此方法复制到原语的分段版本，当输入由几个独立的段组成时，该方法适用。在这种情况下，不清楚哪种算法表现出最好的性能，因为吞吐量很大程度上取决于输入段的分布。根据我们的研究结果，基于树的算法在小片段上运行更快，而顺序方法在大中型片段上运行更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量