In-place transposition of rectangular matrices on accelerators

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI:10.1145/2555243.2555266

I-Jui Sung, Juan Gómez-Luna, José María González-Linares, Nicolás Guil Mata, Wen-mei W. Hwu

{"title":"In-place transposition of rectangular matrices on accelerators","authors":"I-Jui Sung, Juan Gómez-Luna, José María González-Linares, Nicolás Guil Mata, Wen-mei W. Hwu","doi":"10.1145/2555243.2555266","DOIUrl":null,"url":null,"abstract":"Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPUs to achieve good performance. In this paper we present the first known in-place matrix transposition approach for the GPUs. Our implementation is based on a novel 3-stage transposition algorithm where each stage is performed using an elementary tiled-wise transposition. Additionally, when transposition is done as part of the memory transfer between GPU and host, our staged approach allows hiding transposition overhead by overlap with PCIe transfer. We show that the 3-stage algorithm allows larger tiles and achieves 3X speedup over a traditional 4-stage algorithm, with both algorithms based on our high-performance elementary transpositions on the GPU. We also show our proposed low-level optimizations improve the sustained throughput to more than 20 GB/s. Finally, we propose an asynchronous execution scheme that allows CPU threads to delegate in-place matrix transposition to GPU, achieving a throughput of more than 3.4 GB/s (including data transfers costs), and improving current multithreaded implementations of in-place transposition on CPU.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"221 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2555243.2555266","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPUs to achieve good performance. In this paper we present the first known in-place matrix transposition approach for the GPUs. Our implementation is based on a novel 3-stage transposition algorithm where each stage is performed using an elementary tiled-wise transposition. Additionally, when transposition is done as part of the memory transfer between GPU and host, our staged approach allows hiding transposition overhead by overlap with PCIe transfer. We show that the 3-stage algorithm allows larger tiles and achieves 3X speedup over a traditional 4-stage algorithm, with both algorithms based on our high-performance elementary transpositions on the GPU. We also show our proposed low-level optimizations improve the sustained throughput to more than 20 GB/s. Finally, we propose an asynchronous execution scheme that allows CPU threads to delegate in-place matrix transposition to GPU, achieving a throughput of more than 3.4 GB/s (including data transfers costs), and improving current multithreaded implementations of in-place transposition on CPU.

查看原文本刊更多论文

矩形矩阵在加速器上的原位移位

矩阵变换是许多数值算法(如FFT)的重要算法组成部分。它也被用来转换数组的存储布局。随着越来越多的代数库卸载到gpu上，高性能的就地转换变得必要。直观地说，由于可用的板载内存容量有限和高吞吐量，就地转换应该非常适合GPU架构。然而，直接应用CPU就地转置算法缺乏gpu所需的并行性和局部性来获得良好的性能。在本文中，我们提出了第一个已知的gpu的原位矩阵变换方法。我们的实现基于一种新颖的3阶段换位算法，其中每个阶段都使用基本的平铺式换位执行。此外，当转置作为GPU和主机之间内存传输的一部分完成时，我们的分阶段方法允许通过与PCIe传输重叠来隐藏转置开销。我们证明了3阶段算法允许更大的贴图，并且比传统的4阶段算法实现了3倍的加速，这两种算法都基于我们在GPU上的高性能基本换位。我们还展示了我们提出的低级优化将持续吞吐量提高到20 GB/s以上。最后，我们提出了一种异步执行方案，该方案允许CPU线程将原地矩阵转置委托给GPU，实现了超过3.4 GB/s的吞吐量(包括数据传输成本)，并改进了当前CPU上原地矩阵转置的多线程实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming

自引率

0.00%

发文量