Semyon Khokhriakov, Ravi Reddy Manumachu, Alexey L. Lastovetsky
{"title":"多核处理器上多线程二维FFT的性能优化:挑战与解决方法","authors":"Semyon Khokhriakov, Ravi Reddy Manumachu, Alexey L. Lastovetsky","doi":"10.1109/HIPCW.2018.8634318","DOIUrl":null,"url":null,"abstract":"Fast Fourier transform (FFT) is a key routine employed in application domains such as molecular dynamics, computational fluid dynamics, signal processing, image processing, and condition monitoring systems. Its performance on latest multicore platforms is therefore of paramount concern to the high performance computing community. The inherent complexities however in these platforms such as severe resource contention and non-uniform memory access (NUMA) pose formidable challenges. We study in this work the performance profiles of multithreaded 2D fast Fourier transforms provided in three highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel MKL FFT on a modern Intel Haswell multicore processor consisting of thirty-six cores. First, we show that all the three routines demonstrate drastic performance variations and therefore their average performances are considerably lower than their peak performances. The ratio of average to peak performance for the 2D FFT routines from the three packages are 40%, 30%, and 24%. We demonstrate that the average and peak performance of FFTW-2.1.5, last updated in 1999, is better than FFTW-3.3.7 suggesting that extensive machine optimization using architecture-specific techniques can be harmful in the long run since hardware platforms undergo drastic changes. We also show that while the average performance of Intel MKL FFT is better than FFTW-3.3.7, it is outperformed by FFTW-3.3.7 for many problem sizes. Also the width of the performance variations for Intel MKL FFT are severe compared to FFTW-3.3.7. Based on our study, we conclude that improving the average performance of FFT by removal of performance variations on modern multicore processors constitutes a tremendous research challenge. We propose three possible solution approaches to remove the performance variations and suggest future directions.","PeriodicalId":401060,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW)","volume":"165 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Performance Optimization of Multithreaded 2D FFT on Multicore Processors: Challenges and Solution Approaches\",\"authors\":\"Semyon Khokhriakov, Ravi Reddy Manumachu, Alexey L. Lastovetsky\",\"doi\":\"10.1109/HIPCW.2018.8634318\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fast Fourier transform (FFT) is a key routine employed in application domains such as molecular dynamics, computational fluid dynamics, signal processing, image processing, and condition monitoring systems. Its performance on latest multicore platforms is therefore of paramount concern to the high performance computing community. The inherent complexities however in these platforms such as severe resource contention and non-uniform memory access (NUMA) pose formidable challenges. We study in this work the performance profiles of multithreaded 2D fast Fourier transforms provided in three highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel MKL FFT on a modern Intel Haswell multicore processor consisting of thirty-six cores. First, we show that all the three routines demonstrate drastic performance variations and therefore their average performances are considerably lower than their peak performances. The ratio of average to peak performance for the 2D FFT routines from the three packages are 40%, 30%, and 24%. We demonstrate that the average and peak performance of FFTW-2.1.5, last updated in 1999, is better than FFTW-3.3.7 suggesting that extensive machine optimization using architecture-specific techniques can be harmful in the long run since hardware platforms undergo drastic changes. We also show that while the average performance of Intel MKL FFT is better than FFTW-3.3.7, it is outperformed by FFTW-3.3.7 for many problem sizes. Also the width of the performance variations for Intel MKL FFT are severe compared to FFTW-3.3.7. Based on our study, we conclude that improving the average performance of FFT by removal of performance variations on modern multicore processors constitutes a tremendous research challenge. We propose three possible solution approaches to remove the performance variations and suggest future directions.\",\"PeriodicalId\":401060,\"journal\":{\"name\":\"2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW)\",\"volume\":\"165 6\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HIPCW.2018.8634318\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HIPCW.2018.8634318","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance Optimization of Multithreaded 2D FFT on Multicore Processors: Challenges and Solution Approaches
Fast Fourier transform (FFT) is a key routine employed in application domains such as molecular dynamics, computational fluid dynamics, signal processing, image processing, and condition monitoring systems. Its performance on latest multicore platforms is therefore of paramount concern to the high performance computing community. The inherent complexities however in these platforms such as severe resource contention and non-uniform memory access (NUMA) pose formidable challenges. We study in this work the performance profiles of multithreaded 2D fast Fourier transforms provided in three highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel MKL FFT on a modern Intel Haswell multicore processor consisting of thirty-six cores. First, we show that all the three routines demonstrate drastic performance variations and therefore their average performances are considerably lower than their peak performances. The ratio of average to peak performance for the 2D FFT routines from the three packages are 40%, 30%, and 24%. We demonstrate that the average and peak performance of FFTW-2.1.5, last updated in 1999, is better than FFTW-3.3.7 suggesting that extensive machine optimization using architecture-specific techniques can be harmful in the long run since hardware platforms undergo drastic changes. We also show that while the average performance of Intel MKL FFT is better than FFTW-3.3.7, it is outperformed by FFTW-3.3.7 for many problem sizes. Also the width of the performance variations for Intel MKL FFT are severe compared to FFTW-3.3.7. Based on our study, we conclude that improving the average performance of FFT by removal of performance variations on modern multicore processors constitutes a tremendous research challenge. We propose three possible solution approaches to remove the performance variations and suggest future directions.