GPU-enabled extreme-scale turbulence simulations: Fourier pseudo-spectral algorithms at the exascale using OpenMP offloading

IF 7.2 2区 物理与天体物理 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
P.K. Yeung , Kiran Ravikumar , Stephen Nichols , Rohini Uma-Vaideswaran
{"title":"GPU-enabled extreme-scale turbulence simulations: Fourier pseudo-spectral algorithms at the exascale using OpenMP offloading","authors":"P.K. Yeung ,&nbsp;Kiran Ravikumar ,&nbsp;Stephen Nichols ,&nbsp;Rohini Uma-Vaideswaran","doi":"10.1016/j.cpc.2024.109364","DOIUrl":null,"url":null,"abstract":"<div><p>Fourier pseudo-spectral methods for nonlinear partial differential equations are of wide interest in many areas of advanced computational science, including direct numerical simulation of three-dimensional (3-D) turbulence governed by the Navier-Stokes equations in fluid dynamics. This paper presents a new capability for simulating turbulence at a new record resolution up to 35 trillion grid points, on the world's first exascale computer, <em>Frontier</em>, comprising AMD MI250x GPUs with HPE's Slingshot interconnect and operated by the US Department of Energy's Oak Ridge Leadership Computing Facility (OLCF). Key programming strategies designed to take maximum advantage of the machine architecture involve performing almost all computations on the GPU which has the same memory capacity as the CPU, performing all-to-all communication among sets of parallel processes directly on the GPU, and targeting GPUs efficiently using OpenMP offloading for intensive number-crunching including 1-D Fast Fourier Transforms (FFT) performed using AMD ROCm library calls. With 99% of computing power on Frontier being on the GPU, leaving the CPU idle leads to a net performance gain via avoiding the overhead of data movement between host and device except when needed for some I/O purposes. Memory footprint including the size of communication buffers for MPI_ALLTOALL is managed carefully to maximize the largest problem size possible for a given node count.</p><p>Detailed performance data including separate contributions from different categories of operations to the elapsed wall time per step are reported for five grid resolutions, from 2048<sup>3</sup> on a single node to 32768<sup>3</sup> on 4096 or 8192 nodes out of 9408 on the system. Both 1D and 2D domain decompositions which divide a 3D periodic domain into slabs and pencils respectively are implemented. The present code suite (labeled by the acronym GESTS, GPUs for Extreme Scale Turbulence Simulations) achieves a figure of merit (in grid points per second) exceeding goals set in the Center for Accelerated Application Readiness (CAAR) program for Frontier. The performance attained is highly favorable in both weak scaling and strong scaling, with notable departures only for 2048<sup>3</sup> where communication is entirely intra-node, and for 32768<sup>3</sup>, where a challenge due to small message sizes does arise. Communication performance is addressed further using a lightweight test code that performs all-to-all communication in a manner matching the full turbulence simulation code. Performance at large problem sizes is affected by both small message size due to high node counts as well as dragonfly network topology features on the machine, but is consistent with official expectations of sustained performance on Frontier. Overall, although not perfect, the scalability achieved at the extreme problem size of 32768<sup>3</sup> (and up to 8192 nodes — which corresponds to hardware rated at just under 1 exaflop/sec of theoretical peak computational performance) is arguably better than the scalability observed using prior state-of-the-art algorithms on Frontier's predecessor machine (<em>Summit</em>) at OLCF. New science results for the study of intermittency in turbulence enabled by this code and its extensions are to be reported separately in the near future.</p></div>","PeriodicalId":285,"journal":{"name":"Computer Physics Communications","volume":"306 ","pages":"Article 109364"},"PeriodicalIF":7.2000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Physics Communications","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S001046552400287X","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Fourier pseudo-spectral methods for nonlinear partial differential equations are of wide interest in many areas of advanced computational science, including direct numerical simulation of three-dimensional (3-D) turbulence governed by the Navier-Stokes equations in fluid dynamics. This paper presents a new capability for simulating turbulence at a new record resolution up to 35 trillion grid points, on the world's first exascale computer, Frontier, comprising AMD MI250x GPUs with HPE's Slingshot interconnect and operated by the US Department of Energy's Oak Ridge Leadership Computing Facility (OLCF). Key programming strategies designed to take maximum advantage of the machine architecture involve performing almost all computations on the GPU which has the same memory capacity as the CPU, performing all-to-all communication among sets of parallel processes directly on the GPU, and targeting GPUs efficiently using OpenMP offloading for intensive number-crunching including 1-D Fast Fourier Transforms (FFT) performed using AMD ROCm library calls. With 99% of computing power on Frontier being on the GPU, leaving the CPU idle leads to a net performance gain via avoiding the overhead of data movement between host and device except when needed for some I/O purposes. Memory footprint including the size of communication buffers for MPI_ALLTOALL is managed carefully to maximize the largest problem size possible for a given node count.

Detailed performance data including separate contributions from different categories of operations to the elapsed wall time per step are reported for five grid resolutions, from 20483 on a single node to 327683 on 4096 or 8192 nodes out of 9408 on the system. Both 1D and 2D domain decompositions which divide a 3D periodic domain into slabs and pencils respectively are implemented. The present code suite (labeled by the acronym GESTS, GPUs for Extreme Scale Turbulence Simulations) achieves a figure of merit (in grid points per second) exceeding goals set in the Center for Accelerated Application Readiness (CAAR) program for Frontier. The performance attained is highly favorable in both weak scaling and strong scaling, with notable departures only for 20483 where communication is entirely intra-node, and for 327683, where a challenge due to small message sizes does arise. Communication performance is addressed further using a lightweight test code that performs all-to-all communication in a manner matching the full turbulence simulation code. Performance at large problem sizes is affected by both small message size due to high node counts as well as dragonfly network topology features on the machine, but is consistent with official expectations of sustained performance on Frontier. Overall, although not perfect, the scalability achieved at the extreme problem size of 327683 (and up to 8192 nodes — which corresponds to hardware rated at just under 1 exaflop/sec of theoretical peak computational performance) is arguably better than the scalability observed using prior state-of-the-art algorithms on Frontier's predecessor machine (Summit) at OLCF. New science results for the study of intermittency in turbulence enabled by this code and its extensions are to be reported separately in the near future.

GPU 支持的极端尺度湍流模拟:使用 OpenMP 卸载在 exascale 上运行傅立叶伪光谱算法
用于非线性偏微分方程的傅立叶伪谱方法在先进计算科学的许多领域都受到广泛关注,其中包括流体动力学中纳维-斯托克斯方程控制的三维(3-D)湍流的直接数值模拟。本文介绍了在世界首台超大规模计算机 Frontier 上以创纪录的 35 万亿网格点分辨率模拟湍流的新功能,Frontier 由 AMD MI250x GPU 和 HPE 的 Slingshot 互连组成,由美国能源部橡树岭领先计算设施 (OLCF) 负责运行。为最大限度地利用计算机架构而设计的主要编程策略包括:在 GPU 上执行几乎所有计算(GPU 具有与 CPU 相同的内存容量);直接在 GPU 上执行并行进程集之间的全对全通信;使用 OpenMP 卸载高效地针对 GPU 进行密集型数字运算,包括使用 AMD ROCm 库调用执行的一维快速傅立叶变换 (FFT)。由于 Frontier 上 99% 的计算能力都在 GPU 上,CPU 闲置可避免主机和设备之间的数据移动开销(某些 I/O 用途需要时除外),从而实现净性能提升。详细的性能数据(包括不同类别的操作对每一步所耗费的壁时间的贡献)针对五种网格分辨率进行了报告,从单个节点上的 20483 到 4096 节点上的 327683 或系统上 9408 个节点中的 8192 个。实现了一维和二维域分解,分别将三维周期域划分为板块和铅笔。本代码套件(缩写为 GESTS,GPUs for Extreme Scale Turbulence Simulations)的性能指标(每秒网格点数)超过了加速应用准备中心(CAAR)为前沿计划设定的目标。无论是弱扩展还是强扩展,所取得的性能都非常出色,只有 20483 和 327683 的性能存在明显差异,前者完全是节点内通信,而后者则因信息规模较小而面临挑战。使用轻量级测试代码进一步解决了通信性能问题,该代码以与完整湍流模拟代码相匹配的方式执行全对全通信。大问题规模下的性能会受到高节点数导致的小信息量以及机器上蜻蜓网络拓扑特征的影响,但与官方对 Frontier 持续性能的预期一致。总体而言,尽管并不完美,但在 327683(最多 8192 个节点--相当于理论峰值计算性能略低于 1 exaflop/秒的额定硬件)的极端问题规模下实现的可扩展性,可以说优于在 OLCF 的 Frontier 前代机器(Summit)上使用先前最先进算法观察到的可扩展性。在不久的将来,将单独报告利用该代码及其扩展功能研究湍流间歇性的新科学成果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computer Physics Communications
Computer Physics Communications 物理-计算机:跨学科应用
CiteScore
12.10
自引率
3.20%
发文量
287
审稿时长
5.3 months
期刊介绍: The focus of CPC is on contemporary computational methods and techniques and their implementation, the effectiveness of which will normally be evidenced by the author(s) within the context of a substantive problem in physics. Within this setting CPC publishes two types of paper. Computer Programs in Physics (CPiP) These papers describe significant computer programs to be archived in the CPC Program Library which is held in the Mendeley Data repository. The submitted software must be covered by an approved open source licence. Papers and associated computer programs that address a problem of contemporary interest in physics that cannot be solved by current software are particularly encouraged. Computational Physics Papers (CP) These are research papers in, but are not limited to, the following themes across computational physics and related disciplines. mathematical and numerical methods and algorithms; computational models including those associated with the design, control and analysis of experiments; and algebraic computation. Each will normally include software implementation and performance details. The software implementation should, ideally, be available via GitHub, Zenodo or an institutional repository.In addition, research papers on the impact of advanced computer architecture and special purpose computers on computing in the physical sciences and software topics related to, and of importance in, the physical sciences may be considered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信