Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuan-Xi Peng, Cui Wang
{"title":"一个GPU加速的高效混合精度大规模FFT框架","authors":"Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuan-Xi Peng, Cui Wang","doi":"10.1145/3605148","DOIUrl":null,"url":null,"abstract":"Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 × faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53× and 9.48× on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"88 1","pages":"1 - 23"},"PeriodicalIF":1.5000,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework\",\"authors\":\"Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuan-Xi Peng, Cui Wang\",\"doi\":\"10.1145/3605148\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 × faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53× and 9.48× on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.\",\"PeriodicalId\":50920,\"journal\":{\"name\":\"ACM Transactions on Architecture and Code Optimization\",\"volume\":\"88 1\",\"pages\":\"1 - 23\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Architecture and Code Optimization\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3605148\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3605148","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework
Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 × faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53× and 9.48× on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.
期刊介绍:
ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.