带宽密集的三维FFT内核gpu使用CUDA

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2008-11-15 DOI:10.1145/1413370.1413376

Akira Nukada, Y. Ogata, Toshio Endo, S. Matsuoka

{"title":"带宽密集的三维FFT内核gpu使用CUDA","authors":"Akira Nukada, Y. Ogata, Toshio Endo, S. Matsuoka","doi":"10.1145/1413370.1413376","DOIUrl":null,"url":null,"abstract":"Most GPU performance ldquohypesrdquo have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"129","resultStr":"{\"title\":\"Bandwidth intensive 3-D FFT kernel for GPUs using CUDA\",\"authors\":\"Akira Nukada, Y. Ogata, Toshio Endo, S. Matsuoka\",\"doi\":\"10.1145/1413370.1413376\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Most GPU performance ldquohypesrdquo have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces.\",\"PeriodicalId\":230761,\"journal\":{\"name\":\"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"129\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1413370.1413376\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1413370.1413376","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 129

摘要

大多数GPU性能指标都集中在对内存带宽要求较小的紧密耦合应用上，例如n体，但GPU也是需要大量内存带宽的商品向量机;然而，对有效的编程方法的研究却很少。我们新的3-D FFT内核，用NVIDIA CUDA编写，在高端GPU上实现近80 GFLOPS，比任何现有GPU上的FFT实现(包括CUFFT)快三倍以上。仔细的编程技术被用来充分利用现代GPU硬件特性，同时克服它们的局限性，包括片上共享内存利用率，通过适当的本地化优化线程和寄存器的数量，以及避免低速跨行内存访问。我们的内核应用于实际应用程序，在功耗和成本与性能指标方面实现了数量级的提升。卡外带宽限制仍然是一个问题，这可以通过将应用程序内核限制在卡内来缓解，而理想的解决方案是促进更快的GPU接口。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Most GPU performance ldquohypesrdquo have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量