Small Discrete Fourier Transforms on GPUs

2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing Pub Date : 2011-05-23 DOI:10.1109/CCGrid.2011.14

S. Mitra, A. Srinivasan

{"title":"Small Discrete Fourier Transforms on GPUs","authors":"S. Mitra, A. Srinivasan","doi":"10.1109/CCGrid.2011.14","DOIUrl":null,"url":null,"abstract":"Efficient implementations of the Discrete Fourier Transform (DFT) for GPUs provide good performance with large data sizes, but are not competitive with CPU code for small data sizes. On the other hand, several applications perform multiple DFTs on small data sizes. In fact, even algorithms for large data sizes use a divide-and-conquer approach, where eventually small DFTs need to be performed. We discuss our DFT implementation, which is efficient for multiple small DFTs. One feature of our implementation is the use of the asymptotically slow matrix multiplication approach for small data sizes, which improves performance on the GPU due to its regular memory access and computational patterns. We combine this algorithm with the mixed radix algorithm for 1-D, 2-D, and 3-D complex DFTs. We also demonstrate the effect of different optimization techniques. When GPUs are used to accelerate a component of an application running on the host, it is important that decisions taken to optimize the GPU performance not affect the performance of the rest of the application on the host. One feature of our implementation is that we use a data layout that is not optimal for the GPU so that the overall effect on the application is better. Our implementation performs up to two orders of magnitude faster than cuFFT on an NVIDIA GeForce 9800 GTX GPU and up to one to two orders of magnitude faster than FFTW on a CPU for multiple small DFTs. Furthermore, we show that our implementation can accelerate the performance of a Quantum Monte Carlo application for which cuFFT is not effective. The primary contributions of this work lie in demonstrating the utility of the matrix multiplication approach and also in providing an implementation that is efficient for small DFTs when a GPU is used to accelerate an application running on the host.","PeriodicalId":376385,"journal":{"name":"2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"14 8 Pt 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2011.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Efficient implementations of the Discrete Fourier Transform (DFT) for GPUs provide good performance with large data sizes, but are not competitive with CPU code for small data sizes. On the other hand, several applications perform multiple DFTs on small data sizes. In fact, even algorithms for large data sizes use a divide-and-conquer approach, where eventually small DFTs need to be performed. We discuss our DFT implementation, which is efficient for multiple small DFTs. One feature of our implementation is the use of the asymptotically slow matrix multiplication approach for small data sizes, which improves performance on the GPU due to its regular memory access and computational patterns. We combine this algorithm with the mixed radix algorithm for 1-D, 2-D, and 3-D complex DFTs. We also demonstrate the effect of different optimization techniques. When GPUs are used to accelerate a component of an application running on the host, it is important that decisions taken to optimize the GPU performance not affect the performance of the rest of the application on the host. One feature of our implementation is that we use a data layout that is not optimal for the GPU so that the overall effect on the application is better. Our implementation performs up to two orders of magnitude faster than cuFFT on an NVIDIA GeForce 9800 GTX GPU and up to one to two orders of magnitude faster than FFTW on a CPU for multiple small DFTs. Furthermore, we show that our implementation can accelerate the performance of a Quantum Monte Carlo application for which cuFFT is not effective. The primary contributions of this work lie in demonstrating the utility of the matrix multiplication approach and also in providing an implementation that is efficient for small DFTs when a GPU is used to accelerate an application running on the host.

查看原文本刊更多论文

gpu上的小离散傅里叶变换

gpu的离散傅里叶变换(DFT)的有效实现在处理大数据量时提供了良好的性能，但在处理小数据量时无法与CPU代码竞争。另一方面，一些应用程序对小数据量执行多个dft。事实上，即使是大数据规模的算法也使用分而治之的方法，最终需要执行较小的dft。我们讨论了我们的DFT实现，它对多个小DFT是有效的。我们实现的一个特点是对小数据量使用渐近缓慢的矩阵乘法方法，由于其常规的内存访问和计算模式，这提高了GPU上的性能。我们将该算法与1-D、2-D和3-D复dft的混合基数算法相结合。我们还演示了不同优化技术的效果。当使用GPU来加速在主机上运行的应用程序的组件时，优化GPU性能的决策不能影响主机上其他应用程序的性能，这一点很重要。我们实现的一个特点是，我们使用的数据布局对GPU来说不是最优的，因此对应用程序的整体效果更好。我们的实现比NVIDIA GeForce 9800 GTX GPU上的fft快两个数量级，在多个小dft上比CPU上的FFTW快一到两个数量级。此外，我们还证明了我们的实现可以加速cuFFT无效的量子蒙特卡罗应用程序的性能。这项工作的主要贡献在于展示了矩阵乘法方法的实用性，并且在使用GPU加速主机上运行的应用程序时，提供了一种对小型dft有效的实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

自引率

0.00%

发文量