Node-Level optimization of a 3D Block-Based Multiresolution Compressible Flow Solver with Emphasis on Performance Portability

2019 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2019-07-01 DOI:10.1109/HPCS48598.2019.9188088

N. Hoppe, S. Adami, N. Adams, I. Pasichnyk, M. Allalen

{"title":"Node-Level optimization of a 3D Block-Based Multiresolution Compressible Flow Solver with Emphasis on Performance Portability","authors":"N. Hoppe, S. Adami, N. Adams, I. Pasichnyk, M. Allalen","doi":"10.1109/HPCS48598.2019.9188088","DOIUrl":null,"url":null,"abstract":"Despite the enormous increase in computational power in the last decades, the numerical study of complex flows remains challenging. State-of-the-art techniques to simulate hyperbolic flows with discontinuities rely on computationally demanding nonlinear schemes, such as Riemann solvers with weighted essentially non-oscillatory (WENO) stencils and characteristic decompositioning. To handle this complexity the numerical load can be reduced via a multiresolution (MR) algorithm with local time stepping (LTS) running on modern high-performance computing (HPC) systems. Eventually, the main challenge lies in an efficitent utilization of the available HPC hardware. In this work, we evaluate the performance improvement for a Message Passing Interface (MPI)-parallelized MR solver using single instruction multiple data (SIMD) optimizations. We present straight-forward code modifications that allow for auto-vectorization by the compiler, while maintaining the modularity of the code at comparable performance. We demonstrate performance improvements for representative Euler flow examples on both Intel Haswell and Intel Knights Landing Xeon Phi microarchitecture (KNL) clusters. The tests show single-core speedups of 1.7 (1.9) and average speedups of 1.4 (1.6) for the Haswell (KNL).","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS48598.2019.9188088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Despite the enormous increase in computational power in the last decades, the numerical study of complex flows remains challenging. State-of-the-art techniques to simulate hyperbolic flows with discontinuities rely on computationally demanding nonlinear schemes, such as Riemann solvers with weighted essentially non-oscillatory (WENO) stencils and characteristic decompositioning. To handle this complexity the numerical load can be reduced via a multiresolution (MR) algorithm with local time stepping (LTS) running on modern high-performance computing (HPC) systems. Eventually, the main challenge lies in an efficitent utilization of the available HPC hardware. In this work, we evaluate the performance improvement for a Message Passing Interface (MPI)-parallelized MR solver using single instruction multiple data (SIMD) optimizations. We present straight-forward code modifications that allow for auto-vectorization by the compiler, while maintaining the modularity of the code at comparable performance. We demonstrate performance improvements for representative Euler flow examples on both Intel Haswell and Intel Knights Landing Xeon Phi microarchitecture (KNL) clusters. The tests show single-core speedups of 1.7 (1.9) and average speedups of 1.4 (1.6) for the Haswell (KNL).

查看原文本刊更多论文

基于3D块的多分辨率可压缩流求解器的节点级优化，重点是性能可移植性

尽管近几十年来计算能力有了巨大的提高，但复杂流动的数值研究仍然具有挑战性。目前最先进的模拟不连续双曲流的技术依赖于计算要求很高的非线性格式，例如带有加权本质非振荡(WENO)模板的黎曼解算器和特征分解。为了处理这种复杂性，可以通过在现代高性能计算(HPC)系统上运行具有本地时间步进(LTS)的多分辨率(MR)算法来减少数值负载。最后，主要的挑战在于有效地利用可用的HPC硬件。在这项工作中，我们评估了使用单指令多数据(SIMD)优化的消息传递接口(MPI)并行MR求解器的性能改进。我们提供了直接的代码修改，允许编译器自动向量化，同时在相当的性能下保持代码的模块化。我们在英特尔Haswell和英特尔Knights Landing Xeon Phi微架构(KNL)集群上演示了具有代表性的欧拉流示例的性能改进。测试显示Haswell (KNL)的单核加速为1.7(1.9)，平均加速为1.4(1.6)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量