Padding free bank conflict resolution for CUDA-based matrix transpose algorithm

15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD) Pub Date : 2014-08-01 DOI:10.1109/SNPD.2014.6888709

A. Khan, M. Al-Mouhamed, Allam Fatayar, A. Almousa, A. Baqais, M. Assayony

{"title":"Padding free bank conflict resolution for CUDA-based matrix transpose algorithm","authors":"A. Khan, M. Al-Mouhamed, Allam Fatayar, A. Almousa, A. Baqais, M. Assayony","doi":"10.1109/SNPD.2014.6888709","DOIUrl":null,"url":null,"abstract":"Matrix Transposition is an important linear algebra procedure that has deep impact in various computational science and engineering applications. Several factors hinder the expected performance of large matrix transpose on Graphic Processing Units (GPUs). The degradation in performance involves the memory access pattern such as coalesced access in the global memory and bank conflict in the shared memory of streaming multiprocessors within the GPU. In this paper, two matrix transpose algorithms are proposed to alleviate the aforementioned issues of ensuring coalesced access and conflict free bank access. The proposed algorithms have comparable execution times with the NVIDIA SDK bank conflict - free matrix transpose implementation. The main advantage of proposed algorithms is that they eliminate bank conflicts while allocating shared memory exactly equal to the tile size (T × T) of the problem space. However, to the best of our knowledge an extra space of Tx(T +1) needs to be allocated in the published research. We have also applied the proposed transpose algorithm to recursive Gaussian implementation of NVIDIA SDK and achieved about 6% improvement in performance.","PeriodicalId":272932,"journal":{"name":"15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SNPD.2014.6888709","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Matrix Transposition is an important linear algebra procedure that has deep impact in various computational science and engineering applications. Several factors hinder the expected performance of large matrix transpose on Graphic Processing Units (GPUs). The degradation in performance involves the memory access pattern such as coalesced access in the global memory and bank conflict in the shared memory of streaming multiprocessors within the GPU. In this paper, two matrix transpose algorithms are proposed to alleviate the aforementioned issues of ensuring coalesced access and conflict free bank access. The proposed algorithms have comparable execution times with the NVIDIA SDK bank conflict - free matrix transpose implementation. The main advantage of proposed algorithms is that they eliminate bank conflicts while allocating shared memory exactly equal to the tile size (T × T) of the problem space. However, to the best of our knowledge an extra space of Tx(T +1) needs to be allocated in the published research. We have also applied the proposed transpose algorithm to recursive Gaussian implementation of NVIDIA SDK and achieved about 6% improvement in performance.

查看原文本刊更多论文

基于cuda的矩阵转置算法的无填充银行冲突解决

矩阵转置是一个重要的线性代数过程，在各种计算科学和工程应用中有着深远的影响。有几个因素阻碍了图形处理单元(gpu)上大矩阵转置的预期性能。性能下降涉及内存访问模式，如全局内存中的合并访问和GPU内流多处理器共享内存中的银行冲突。本文提出了两种矩阵转置算法，以缓解上述保证合并访问和无冲突银行访问的问题。所提出的算法与NVIDIA SDK库无冲突矩阵转置实现的执行时间相当。所提出的算法的主要优点是它们在分配与问题空间的块大小(T × T)完全相等的共享内存时消除了银行冲突。然而，据我们所知，在已发表的研究中需要分配额外的空间Tx(T +1)。我们还将所提出的转置算法应用于NVIDIA SDK的递归高斯实现中，性能提高了约6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)

自引率

0.00%

发文量