使用GPU张量内核加速非2次幂大小的傅里叶变换

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI:10.1109/IPDPS49936.2021.00059

Louis Pisha, Lukasz Ligowski

{"title":"使用GPU张量内核加速非2次幂大小的傅里叶变换","authors":"Louis Pisha, Lukasz Ligowski","doi":"10.1109/IPDPS49936.2021.00059","DOIUrl":null,"url":null,"abstract":"Fourier transforms whose sizes are powers of two or have only small prime factors have been extensively studied, and optimized implementations are typically memory-bound. However, handling arbitrary transform sizes—which may be prime or have large prime factors—is difficult. Direct discrete Fourier transform (DFT) implementations involve extra computation, while fast Fourier transform (FFT)-style factorized decompositions introduce additional overheads in register use, multiprocessor occupancy, and memory traffic. Tensor Cores are hardware units included in modern GPUs which perform matrix multiply-adds at a much higher throughput than normal GPU floating-point instructions. Because of their higher throughput and better uniformity across sizes, DFT/FFT implementations using Tensor Cores can surpass the performance of existing DFT/FFT implementations for difficult sizes. We present key insights in this approach, including complex number representation, efficient mapping of odd sizes to Tensor Cores (whose dimensions are all powers of 2), and adding a size 2 or size 4 epilogue transform at very low cost. Furthermore, we describe a method for emulating FP32 precision while using lower-precision Tensor Cores to accelerate the computation. For large batch sizes, our fastest Tensor Core implementation per size is at least 10% faster than the state-of-the-art cuFFT library in 49% of supported sizes for FP64 (double) precision and 42% of supported sizes for FP32 precision. The numerical accuracy of the results matches that of cuFFT for FP64 and is degraded by only about 0.3 bits on average for emulated FP32. To our knowledge, this is the first application of Tensor Cores to FFT computation which meets the accuracy and exceeds the speed of the state of the art.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Accelerating non-power-of-2 size Fourier transforms with GPU Tensor Cores\",\"authors\":\"Louis Pisha, Lukasz Ligowski\",\"doi\":\"10.1109/IPDPS49936.2021.00059\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fourier transforms whose sizes are powers of two or have only small prime factors have been extensively studied, and optimized implementations are typically memory-bound. However, handling arbitrary transform sizes—which may be prime or have large prime factors—is difficult. Direct discrete Fourier transform (DFT) implementations involve extra computation, while fast Fourier transform (FFT)-style factorized decompositions introduce additional overheads in register use, multiprocessor occupancy, and memory traffic. Tensor Cores are hardware units included in modern GPUs which perform matrix multiply-adds at a much higher throughput than normal GPU floating-point instructions. Because of their higher throughput and better uniformity across sizes, DFT/FFT implementations using Tensor Cores can surpass the performance of existing DFT/FFT implementations for difficult sizes. We present key insights in this approach, including complex number representation, efficient mapping of odd sizes to Tensor Cores (whose dimensions are all powers of 2), and adding a size 2 or size 4 epilogue transform at very low cost. Furthermore, we describe a method for emulating FP32 precision while using lower-precision Tensor Cores to accelerate the computation. For large batch sizes, our fastest Tensor Core implementation per size is at least 10% faster than the state-of-the-art cuFFT library in 49% of supported sizes for FP64 (double) precision and 42% of supported sizes for FP32 precision. The numerical accuracy of the results matches that of cuFFT for FP64 and is degraded by only about 0.3 bits on average for emulated FP32. To our knowledge, this is the first application of Tensor Cores to FFT computation which meets the accuracy and exceeds the speed of the state of the art.\",\"PeriodicalId\":372234,\"journal\":{\"name\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS49936.2021.00059\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00059","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

傅里叶变换的大小是2的幂或只有小的素数因子的傅里叶变换已经得到了广泛的研究，优化的实现通常是内存受限的。然而，处理任意的变换大小(可能是素数或具有较大的素数因子)是困难的。直接离散傅立叶变换(DFT)的实现涉及额外的计算，而快速傅立叶变换(FFT)风格的因式分解在寄存器使用、多处理器占用和内存流量方面引入了额外的开销。张量核是包含在现代GPU中的硬件单元，它以比普通GPU浮点指令高得多的吞吐量执行矩阵乘法。由于它们具有更高的吞吐量和更好的跨尺寸均匀性，使用Tensor Cores的DFT/FFT实现可以在困难尺寸下超越现有DFT/FFT实现的性能。我们提出了这种方法的关键见解，包括复数表示，奇数大小到张量核(其维度都是2的幂)的有效映射，以及以非常低的成本添加大小为2或4的后记变换。此外，我们还描述了一种模拟FP32精度的方法，同时使用低精度的张量核来加速计算。对于大规模批处理，我们最快的Tensor Core实现在FP64(双倍)精度支持尺寸的49%和FP32精度支持尺寸的42%上，比最先进的cuFFT库快至少10%。结果表明，该算法的数值精度与FP64的cuFFT算法相当，而仿真FP32的cuFFT算法的精度平均仅下降0.3位。据我们所知，这是第一次将张量核应用于FFT计算，满足精度并超过当前技术水平的速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accelerating non-power-of-2 size Fourier transforms with GPU Tensor Cores

Fourier transforms whose sizes are powers of two or have only small prime factors have been extensively studied, and optimized implementations are typically memory-bound. However, handling arbitrary transform sizes—which may be prime or have large prime factors—is difficult. Direct discrete Fourier transform (DFT) implementations involve extra computation, while fast Fourier transform (FFT)-style factorized decompositions introduce additional overheads in register use, multiprocessor occupancy, and memory traffic. Tensor Cores are hardware units included in modern GPUs which perform matrix multiply-adds at a much higher throughput than normal GPU floating-point instructions. Because of their higher throughput and better uniformity across sizes, DFT/FFT implementations using Tensor Cores can surpass the performance of existing DFT/FFT implementations for difficult sizes. We present key insights in this approach, including complex number representation, efficient mapping of odd sizes to Tensor Cores (whose dimensions are all powers of 2), and adding a size 2 or size 4 epilogue transform at very low cost. Furthermore, we describe a method for emulating FP32 precision while using lower-precision Tensor Cores to accelerate the computation. For large batch sizes, our fastest Tensor Core implementation per size is at least 10% faster than the state-of-the-art cuFFT library in 49% of supported sizes for FP64 (double) precision and 42% of supported sizes for FP32 precision. The numerical accuracy of the results matches that of cuFFT for FP64 and is degraded by only about 0.3 bits on average for emulated FP32. To our knowledge, this is the first application of Tensor Cores to FFT computation which meets the accuracy and exceeds the speed of the state of the art.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量