CUDA gpu的自动调优3-D FFT库

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis Pub Date : 2009-11-14 DOI:10.1145/1654059.1654090

Akira Nukada, S. Matsuoka

{"title":"CUDA gpu的自动调优3-D FFT库","authors":"Akira Nukada, S. Matsuoka","doi":"10.1145/1654059.1654090","DOIUrl":null,"url":null,"abstract":"Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. Although auto-tuning has been implemented on GPUs for dense kernels such as DGEMM and stencils, this is the first instance that has been applied comprehensively to bandwidth intensive and complex kernels such as 3-D FFTs. Bandwidth intensive optimizations such as selecting the number of threads and inserting padding to avoid bank conflicts on shared memory are systematically applied. Our resulting autotuner is fast and results in performance that essentially beats all 3-D FFT implementations on a single processor to date, and moreover exhibits stable performance irrespective of problem sizes or the underlying GPU hardware.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"8 9","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"140","resultStr":"{\"title\":\"Auto-tuning 3-D FFT library for CUDA GPUs\",\"authors\":\"Akira Nukada, S. Matsuoka\",\"doi\":\"10.1145/1654059.1654090\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. Although auto-tuning has been implemented on GPUs for dense kernels such as DGEMM and stencils, this is the first instance that has been applied comprehensively to bandwidth intensive and complex kernels such as 3-D FFTs. Bandwidth intensive optimizations such as selecting the number of threads and inserting padding to avoid bank conflicts on shared memory are systematically applied. Our resulting autotuner is fast and results in performance that essentially beats all 3-D FFT implementations on a single processor to date, and moreover exhibits stable performance irrespective of problem sizes or the underlying GPU hardware.\",\"PeriodicalId\":371415,\"journal\":{\"name\":\"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis\",\"volume\":\"8 9\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"140\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1654059.1654090\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1654059.1654090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 140

摘要

gpu上现有的fft实现针对特定的变换大小(如2的幂)进行了优化，并且表现出不稳定和峰值性能，即在实践中出现的其他大小中表现不佳。我们在CUDA上的新自动调整3-D FFT为不同变换大小的FFT生成高性能的CUDA内核，缓解了这个问题。虽然自动调优已经在gpu上实现了密集内核(如DGEMM和stencils)，但这是第一个全面应用于带宽密集型和复杂内核(如3-D fft)的实例。系统地应用了带宽密集型优化，例如选择线程数量和插入填充以避免共享内存上的银行冲突。我们得到的自动调谐器速度很快，其性能基本上超过了迄今为止在单个处理器上实现的所有3-D FFT，而且无论问题大小或底层GPU硬件如何，都表现出稳定的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Auto-tuning 3-D FFT library for CUDA GPUs

Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. Although auto-tuning has been implemented on GPUs for dense kernels such as DGEMM and stencils, this is the first instance that has been applied comprehensively to bandwidth intensive and complex kernels such as 3-D FFTs. Bandwidth intensive optimizations such as selecting the number of threads and inserting padding to avoid bank conflicts on shared memory are systematically applied. Our resulting autotuner is fast and results in performance that essentially beats all 3-D FFT implementations on a single processor to date, and moreover exhibits stable performance irrespective of problem sizes or the underlying GPU hardware.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

自引率

0.00%

发文量