高阶MPI-Kokkos加速流体求解器的性能

IF 3.4 2区物理与天体物理 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computer Physics Communications Pub Date : 2025-09-25 DOI:10.1016/j.cpc.2025.109873

Filipp Sporykhin , Holger Homann

{"title":"高阶MPI-Kokkos加速流体求解器的性能","authors":"Filipp Sporykhin , Holger Homann","doi":"10.1016/j.cpc.2025.109873","DOIUrl":null,"url":null,"abstract":"<div><div>This work discusses the performance of a modern numerical scheme for fluid dynamical problems on modern high-performance computing (HPC) architectures. Our code implements a spatial nodal discontinuous Galerkin (NDG) scheme that we test up to an order of convergence of eight. It is temporally coupled to a set of Runge-Kutta (RK) methods of orders up to six. The code integrates the linear advection equations as well as the isothermal Euler equations in one, two, and three dimensions. In order to target modern hardware involving many-core Central Processing Units (CPUs) and accelerators such as Graphic Processing Units (GPUs) we use the Kokkos library in conjunction with the Message Passing Interface (MPI) to run our single source code on various NVidia and AMD GPU systems.</div><div>By means of one- and two-dimensional simulations of simple test equations we find that the higher the order the faster is the code. Eighth-order simulations attain a given global error with much less computing time than third- or fourth-order simulations. The RK scheme has a smaller impact on the code performance and a classical fourth-order scheme seems to generally be a good choice.</div><div>The code performs very well on all considered HPC GPUs. We observe very good scaling properties up to 64 AMD MI250x GPUs and we show that the scaling properties are the same in two and three dimensions. The many-CPU performance is also very good and perfect weak scaling is observed up to many hundreds of CPU cores using MPI. We note that small grid-size simulations are faster on CPUs than on GPUs while GPUs win significantly over CPUs for simulations involving more than 10<sup>7</sup> degrees of freedom (<span><math><mo>≈</mo><msup><mrow><mn>3100</mn></mrow><mrow><mn>2</mn></mrow></msup></math></span> grid points). When it comes to the environmental impact of numerical simulations we estimate that GPUs consume less energy than CPUs for large grid-size simulations but more energy on small grids. Further, we observe a tendency that the more modern is the GPU the larger needs to be the grid in order to use it efficiently. This yields a rebound effect because larger simulations need longer computing times and in turn more energy that is not compensated by the energy efficiency gain of the newer GPUs.</div></div>","PeriodicalId":285,"journal":{"name":"Computer Physics Communications","volume":"318 ","pages":"Article 109873"},"PeriodicalIF":3.4000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of a high-order MPI-Kokkos accelerated fluid solver\",\"authors\":\"Filipp Sporykhin , Holger Homann\",\"doi\":\"10.1016/j.cpc.2025.109873\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This work discusses the performance of a modern numerical scheme for fluid dynamical problems on modern high-performance computing (HPC) architectures. Our code implements a spatial nodal discontinuous Galerkin (NDG) scheme that we test up to an order of convergence of eight. It is temporally coupled to a set of Runge-Kutta (RK) methods of orders up to six. The code integrates the linear advection equations as well as the isothermal Euler equations in one, two, and three dimensions. In order to target modern hardware involving many-core Central Processing Units (CPUs) and accelerators such as Graphic Processing Units (GPUs) we use the Kokkos library in conjunction with the Message Passing Interface (MPI) to run our single source code on various NVidia and AMD GPU systems.</div><div>By means of one- and two-dimensional simulations of simple test equations we find that the higher the order the faster is the code. Eighth-order simulations attain a given global error with much less computing time than third- or fourth-order simulations. The RK scheme has a smaller impact on the code performance and a classical fourth-order scheme seems to generally be a good choice.</div><div>The code performs very well on all considered HPC GPUs. We observe very good scaling properties up to 64 AMD MI250x GPUs and we show that the scaling properties are the same in two and three dimensions. The many-CPU performance is also very good and perfect weak scaling is observed up to many hundreds of CPU cores using MPI. We note that small grid-size simulations are faster on CPUs than on GPUs while GPUs win significantly over CPUs for simulations involving more than 10<sup>7</sup> degrees of freedom (<span><math><mo>≈</mo><msup><mrow><mn>3100</mn></mrow><mrow><mn>2</mn></mrow></msup></math></span> grid points). When it comes to the environmental impact of numerical simulations we estimate that GPUs consume less energy than CPUs for large grid-size simulations but more energy on small grids. Further, we observe a tendency that the more modern is the GPU the larger needs to be the grid in order to use it efficiently. This yields a rebound effect because larger simulations need longer computing times and in turn more energy that is not compensated by the energy efficiency gain of the newer GPUs.</div></div>\",\"PeriodicalId\":285,\"journal\":{\"name\":\"Computer Physics Communications\",\"volume\":\"318 \",\"pages\":\"Article 109873\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Physics Communications\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0010465525003753\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Physics Communications","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0010465525003753","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

本文讨论了现代高性能计算（HPC）体系结构中流体动力学问题的现代数值格式的性能。我们的代码实现了一个空间节点不连续伽辽金（NDG）方案，我们测试了它的收敛阶为8。它暂时耦合到一组最多六阶的龙格-库塔（RK）方法。代码集成了线性平流方程以及等温欧拉方程在一个，两个和三个维度。为了瞄准涉及多核中央处理单元（cpu）和图形处理单元（GPU）等加速器的现代硬件，我们使用Kokkos库与消息传递接口（MPI）一起在各种NVidia和AMD GPU系统上运行我们的单一源代码。通过对简单试验方程的一维和二维模拟，我们发现阶数越高，编码速度越快。与三阶或四阶模拟相比，八阶模拟可以用更少的计算时间获得给定的全局误差。RK方案对代码性能的影响较小，而经典的四阶方案似乎通常是一个不错的选择。代码在所有考虑的HPC gpu上执行得非常好。我们观察到非常好的缩放特性，高达64 AMD MI250x gpu，我们表明缩放特性在二维和三维是相同的。多CPU性能也非常好，并且使用MPI可以观察到数百个CPU内核的完美弱缩放。我们注意到，小网格大小的模拟在cpu上比在gpu上更快，而gpu在涉及超过107自由度（≈31002网格点）的模拟中明显优于cpu。当谈到数值模拟对环境的影响时，我们估计gpu在大型网格模拟中消耗的能量比cpu少，但在小型网格模拟中消耗的能量更多。此外，我们观察到一种趋势，即GPU越现代，为了有效地使用它，网格的需求就越大。这产生了反弹效应，因为更大的模拟需要更长的计算时间，反过来又需要更多的能量，而这些能量并没有得到更新的gpu的能量效率增益的补偿。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance of a high-order MPI-Kokkos accelerated fluid solver

This work discusses the performance of a modern numerical scheme for fluid dynamical problems on modern high-performance computing (HPC) architectures. Our code implements a spatial nodal discontinuous Galerkin (NDG) scheme that we test up to an order of convergence of eight. It is temporally coupled to a set of Runge-Kutta (RK) methods of orders up to six. The code integrates the linear advection equations as well as the isothermal Euler equations in one, two, and three dimensions. In order to target modern hardware involving many-core Central Processing Units (CPUs) and accelerators such as Graphic Processing Units (GPUs) we use the Kokkos library in conjunction with the Message Passing Interface (MPI) to run our single source code on various NVidia and AMD GPU systems.

By means of one- and two-dimensional simulations of simple test equations we find that the higher the order the faster is the code. Eighth-order simulations attain a given global error with much less computing time than third- or fourth-order simulations. The RK scheme has a smaller impact on the code performance and a classical fourth-order scheme seems to generally be a good choice.

The code performs very well on all considered HPC GPUs. We observe very good scaling properties up to 64 AMD MI250x GPUs and we show that the scaling properties are the same in two and three dimensions. The many-CPU performance is also very good and perfect weak scaling is observed up to many hundreds of CPU cores using MPI. We note that small grid-size simulations are faster on CPUs than on GPUs while GPUs win significantly over CPUs for simulations involving more than 10⁷ degrees of freedom (

\approx 3100^{2}

grid points). When it comes to the environmental impact of numerical simulations we estimate that GPUs consume less energy than CPUs for large grid-size simulations but more energy on small grids. Further, we observe a tendency that the more modern is the GPU the larger needs to be the grid in order to use it efficiently. This yields a rebound effect because larger simulations need longer computing times and in turn more energy that is not compensated by the energy efficiency gain of the newer GPUs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Physics Communications 物理-计算机：跨学科应用

CiteScore

12.10

自引率

3.20%

发文量

287

审稿时长

5.3 months

期刊介绍： The focus of CPC is on contemporary computational methods and techniques and their implementation, the effectiveness of which will normally be evidenced by the author(s) within the context of a substantive problem in physics. Within this setting CPC publishes two types of paper. Computer Programs in Physics (CPiP) These papers describe significant computer programs to be archived in the CPC Program Library which is held in the Mendeley Data repository. The submitted software must be covered by an approved open source licence. Papers and associated computer programs that address a problem of contemporary interest in physics that cannot be solved by current software are particularly encouraged. Computational Physics Papers (CP) These are research papers in, but are not limited to, the following themes across computational physics and related disciplines. mathematical and numerical methods and algorithms; computational models including those associated with the design, control and analysis of experiments; and algebraic computation. Each will normally include software implementation and performance details. The software implementation should, ideally, be available via GitHub, Zenodo or an institutional repository.In addition, research papers on the impact of advanced computer architecture and special purpose computers on computing in the physical sciences and software topics related to, and of importance in, the physical sciences may be considered.