Performance Optimization of Multithreaded 2D FFT on Multicore Processors: Challenges and Solution Approaches

Semyon Khokhriakov, Ravi Reddy Manumachu, Alexey L. Lastovetsky
{"title":"Performance Optimization of Multithreaded 2D FFT on Multicore Processors: Challenges and Solution Approaches","authors":"Semyon Khokhriakov, Ravi Reddy Manumachu, Alexey L. Lastovetsky","doi":"10.1109/HIPCW.2018.8634318","DOIUrl":null,"url":null,"abstract":"Fast Fourier transform (FFT) is a key routine employed in application domains such as molecular dynamics, computational fluid dynamics, signal processing, image processing, and condition monitoring systems. Its performance on latest multicore platforms is therefore of paramount concern to the high performance computing community. The inherent complexities however in these platforms such as severe resource contention and non-uniform memory access (NUMA) pose formidable challenges. We study in this work the performance profiles of multithreaded 2D fast Fourier transforms provided in three highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel MKL FFT on a modern Intel Haswell multicore processor consisting of thirty-six cores. First, we show that all the three routines demonstrate drastic performance variations and therefore their average performances are considerably lower than their peak performances. The ratio of average to peak performance for the 2D FFT routines from the three packages are 40%, 30%, and 24%. We demonstrate that the average and peak performance of FFTW-2.1.5, last updated in 1999, is better than FFTW-3.3.7 suggesting that extensive machine optimization using architecture-specific techniques can be harmful in the long run since hardware platforms undergo drastic changes. We also show that while the average performance of Intel MKL FFT is better than FFTW-3.3.7, it is outperformed by FFTW-3.3.7 for many problem sizes. Also the width of the performance variations for Intel MKL FFT are severe compared to FFTW-3.3.7. Based on our study, we conclude that improving the average performance of FFT by removal of performance variations on modern multicore processors constitutes a tremendous research challenge. We propose three possible solution approaches to remove the performance variations and suggest future directions.","PeriodicalId":401060,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW)","volume":"165 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HIPCW.2018.8634318","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Fast Fourier transform (FFT) is a key routine employed in application domains such as molecular dynamics, computational fluid dynamics, signal processing, image processing, and condition monitoring systems. Its performance on latest multicore platforms is therefore of paramount concern to the high performance computing community. The inherent complexities however in these platforms such as severe resource contention and non-uniform memory access (NUMA) pose formidable challenges. We study in this work the performance profiles of multithreaded 2D fast Fourier transforms provided in three highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel MKL FFT on a modern Intel Haswell multicore processor consisting of thirty-six cores. First, we show that all the three routines demonstrate drastic performance variations and therefore their average performances are considerably lower than their peak performances. The ratio of average to peak performance for the 2D FFT routines from the three packages are 40%, 30%, and 24%. We demonstrate that the average and peak performance of FFTW-2.1.5, last updated in 1999, is better than FFTW-3.3.7 suggesting that extensive machine optimization using architecture-specific techniques can be harmful in the long run since hardware platforms undergo drastic changes. We also show that while the average performance of Intel MKL FFT is better than FFTW-3.3.7, it is outperformed by FFTW-3.3.7 for many problem sizes. Also the width of the performance variations for Intel MKL FFT are severe compared to FFTW-3.3.7. Based on our study, we conclude that improving the average performance of FFT by removal of performance variations on modern multicore processors constitutes a tremendous research challenge. We propose three possible solution approaches to remove the performance variations and suggest future directions.
多核处理器上多线程二维FFT的性能优化:挑战与解决方法
快速傅里叶变换(FFT)是分子动力学、计算流体动力学、信号处理、图像处理和状态监测系统等应用领域的重要方法。因此,它在最新多核平台上的性能是高性能计算社区最关心的问题。然而,这些平台中固有的复杂性,如严重的资源争用和非统一内存访问(NUMA),带来了巨大的挑战。在这项工作中,我们研究了三个高度优化的软件包,FFTW-2.1.5, FFTW-3.3.7和英特尔MKL FFT在现代英特尔Haswell多核处理器上提供的多线程2D快速傅里叶变换的性能概况。首先,我们展示了所有三个例程都表现出剧烈的性能变化,因此它们的平均性能大大低于它们的峰值性能。三个包的2D FFT例程的平均性能与峰值性能之比分别为40%、30%和24%。我们证明了1999年更新的FFTW-2.1.5的平均和峰值性能优于FFTW-3.3.7,这表明从长远来看,使用特定于体系结构的技术进行广泛的机器优化可能是有害的,因为硬件平台经历了剧烈的变化。我们还表明,虽然英特尔MKL FFT的平均性能优于FFTW-3.3.7,但在许多问题规模上,它的性能都优于FFTW-3.3.7。此外,与FFTW-3.3.7相比,英特尔MKL FFT的性能变化宽度也很严重。基于我们的研究,我们得出结论,通过消除现代多核处理器上的性能变化来提高FFT的平均性能构成了一个巨大的研究挑战。我们提出了三种可能的解决方案来消除性能差异,并提出了未来的方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信