FFT data distribution in plane-waves DFT codes. A case study from Quantum ESPRESSO

F. Affinito, C. Cavazzoni
{"title":"FFT data distribution in plane-waves DFT codes. A case study from Quantum ESPRESSO","authors":"F. Affinito, C. Cavazzoni","doi":"10.1145/2966884.2966892","DOIUrl":null,"url":null,"abstract":"Density Functional Theory calculations with plane waves and pseudopotentials represent one of the most important simulation techniques in high performance computing. Together with parallel linear algebra (ZGEMM and matrix diagonalization), the most important bottleneck results from the Fast Fourier Transform (FFT), required, for example, when the local potential is applied to the wavefunction. In these calculations, the existence of a cutoff on the plane waves is reflected on a spherical domain for the FFT. After a 1D FFT is performed on pencils distributed among processors, data is transposed with a MPI_Alltoall and a 2D FFT is executed [2]. Typically, the workload of the FFT is not particularly high, since grid sizes do not exceed (103 102)3. However, the load distribution is crucial and the consequent impact of collective communications becomes a critical factor for achieving a high parallel efficiency. Quantum ESPRESSO [3] is one of the most used codes based on plane-wave DFT in the community of material science. It has been successfully ported and optimized on a large number of HPC infrastructures all over the world. The parallel structure of Quantum ESPRESSO is mainly based on several layers of MPI communicators, plus a finer grain OpenMP parallelization. Recently, the parallelization structure of the FFT was deeply refactored. The combination of two different data distributions, i.e. bands and taskgroups, allow the underlyinghardware to be hierarchically filled and two different layers of communications to be tuned. In particular, with sufficient memory, by tuning the number of taskgroups one can fit all the data required to perform a single 3D FFT reducing the impact of the MPI_Alltoall between the 1D and 2D FFTs. In order to better check the results of the parametrization of the parallel distributions, a miniapp [1] containing only the FFT kernel was extracted from the Quantum ESPRESSO distribution. This miniapp is also important for the future activity of code design of novel architectures. We present and discuss the profiling data obtained from the QE-FFT miniapp and the impact on the communication pattern deriving from the choice of the parallelization parameters.","PeriodicalId":264069,"journal":{"name":"Proceedings of the 23rd European MPI Users' Group Meeting","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2966884.2966892","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Density Functional Theory calculations with plane waves and pseudopotentials represent one of the most important simulation techniques in high performance computing. Together with parallel linear algebra (ZGEMM and matrix diagonalization), the most important bottleneck results from the Fast Fourier Transform (FFT), required, for example, when the local potential is applied to the wavefunction. In these calculations, the existence of a cutoff on the plane waves is reflected on a spherical domain for the FFT. After a 1D FFT is performed on pencils distributed among processors, data is transposed with a MPI_Alltoall and a 2D FFT is executed [2]. Typically, the workload of the FFT is not particularly high, since grid sizes do not exceed (103 102)3. However, the load distribution is crucial and the consequent impact of collective communications becomes a critical factor for achieving a high parallel efficiency. Quantum ESPRESSO [3] is one of the most used codes based on plane-wave DFT in the community of material science. It has been successfully ported and optimized on a large number of HPC infrastructures all over the world. The parallel structure of Quantum ESPRESSO is mainly based on several layers of MPI communicators, plus a finer grain OpenMP parallelization. Recently, the parallelization structure of the FFT was deeply refactored. The combination of two different data distributions, i.e. bands and taskgroups, allow the underlyinghardware to be hierarchically filled and two different layers of communications to be tuned. In particular, with sufficient memory, by tuning the number of taskgroups one can fit all the data required to perform a single 3D FFT reducing the impact of the MPI_Alltoall between the 1D and 2D FFTs. In order to better check the results of the parametrization of the parallel distributions, a miniapp [1] containing only the FFT kernel was extracted from the Quantum ESPRESSO distribution. This miniapp is also important for the future activity of code design of novel architectures. We present and discuss the profiling data obtained from the QE-FFT miniapp and the impact on the communication pattern deriving from the choice of the parallelization parameters.
平面波DFT码中的FFT数据分布。一个来自Quantum ESPRESSO的案例研究
利用平面波和伪势的密度泛函理论计算是高性能计算中最重要的模拟技术之一。与并行线性代数(ZGEMM和矩阵对角化)一起,最重要的瓶颈来自于快速傅里叶变换(FFT),例如,当局部电位应用于波函数时。在这些计算中,存在于平面波上的截止被反射到FFT的球面域上。在对分布在处理器之间的铅笔执行1D FFT后,使用MPI_Alltoall对数据进行转置,并执行2D FFT。通常,FFT的工作负载不是特别高,因为网格大小不超过(103 102)3。然而,负载分配是至关重要的,随之而来的集体通信的影响成为实现高并行效率的关键因素。量子ESPRESSO[3]是目前材料科学界最常用的基于平面波DFT的编码之一。它已经在世界各地的大量HPC基础设施上成功移植和优化。量子ESPRESSO的并行结构主要基于多层MPI通信器,加上更细粒度的OpenMP并行化。近年来,人们对FFT的并行化结构进行了深入的重构。两种不同数据分布的组合,即频带和任务组,允许分层填充底层硬件,并调整两个不同的通信层。特别是,有足够的内存,通过调整任务组的数量,可以适应执行单个3D FFT所需的所有数据,减少1D和2D FFT之间MPI_Alltoall的影响。为了更好地检验并行分布的参数化结果,从Quantum ESPRESSO分布中提取了一个仅包含FFT内核的miniapp[1]。这个小应用程序对未来新架构的代码设计活动也很重要。我们展示并讨论了从QE-FFT miniapp中获得的分析数据,以及并行化参数的选择对通信模式的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信