Performance Improvement of a Scalable High-Order Compressible Flow Solver on Unstructured Hexahedral Grids

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI:10.1145/3368474.3368480

Kazuma Tago, T. Haga, S. Tsutsumi, R. Takaki

{"title":"Performance Improvement of a Scalable High-Order Compressible Flow Solver on Unstructured Hexahedral Grids","authors":"Kazuma Tago, T. Haga, S. Tsutsumi, R. Takaki","doi":"10.1145/3368474.3368480","DOIUrl":null,"url":null,"abstract":"This paper describes LS-FLOW-HO, a high-order compressible flow solver based on the Flux Reconstruction(FR) method, and its performance optimization. The Flux Reconstruction method achieves arbitrary high-order accuracy on unstructured grids and is suitable for many core architectures because of the local data sets (Stencil) involved in spatial discretization. This study focuses on the performance optimization of the PRIMEHPC FX100, a Fujitsu scalar supercomputer. First, the execution time of sample code that uses the BLAS library is compared with that of code that uses a sparse matrix multiplication which calculates only non-zero values. It is found that the sparse matrix multiplication takes less time than using DGEMM for hexahedral elements when the degree of interpolation polynomial is higher than 2. Using sparse matrix multiplication, hot spot tuning was done by extracting each subroutine code from LS-FLOW-HO. The speedup was confirmed by changing the array structure in the cell boundary, improving the memory/cache access latency by the sequential memory access, and increasing loop length by loop collapsing. Applying these tunings to LS-FLOW-HO, execution time was reduced by up to 40%, and reached 10.23% of the theoretical FLOPS peak using 16 threads of OpenMP on a single node. The performance on Intel Haswell was also shown as the execution time is reduced by about 49%. It was confirmed that the proposed techniques are effective on other processors. Finally, sustained strong scaling performance for real application to supersonic jets is demonstrated using 32 to 3200 nodes (1024 to 102400 cores).","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3368474.3368480","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper describes LS-FLOW-HO, a high-order compressible flow solver based on the Flux Reconstruction(FR) method, and its performance optimization. The Flux Reconstruction method achieves arbitrary high-order accuracy on unstructured grids and is suitable for many core architectures because of the local data sets (Stencil) involved in spatial discretization. This study focuses on the performance optimization of the PRIMEHPC FX100, a Fujitsu scalar supercomputer. First, the execution time of sample code that uses the BLAS library is compared with that of code that uses a sparse matrix multiplication which calculates only non-zero values. It is found that the sparse matrix multiplication takes less time than using DGEMM for hexahedral elements when the degree of interpolation polynomial is higher than 2. Using sparse matrix multiplication, hot spot tuning was done by extracting each subroutine code from LS-FLOW-HO. The speedup was confirmed by changing the array structure in the cell boundary, improving the memory/cache access latency by the sequential memory access, and increasing loop length by loop collapsing. Applying these tunings to LS-FLOW-HO, execution time was reduced by up to 40%, and reached 10.23% of the theoretical FLOPS peak using 16 threads of OpenMP on a single node. The performance on Intel Haswell was also shown as the execution time is reduced by about 49%. It was confirmed that the proposed techniques are effective on other processors. Finally, sustained strong scaling performance for real application to supersonic jets is demonstrated using 32 to 3200 nodes (1024 to 102400 cores).

查看原文本刊更多论文

非结构六面体网格上可伸缩高阶可压缩流求解器的性能改进

本文介绍了基于通量重构(Flux Reconstruction, FR)方法的高阶可压缩流求解器LS-FLOW-HO及其性能优化。通量重建方法在非结构化网格上实现任意高阶精度，并且由于局部数据集(Stencil)涉及空间离散，因此适用于许多核心体系结构。本研究主要针对Fujitsu标量超级计算机PRIMEHPC FX100进行性能优化。首先，将使用BLAS库的示例代码的执行时间与使用仅计算非零值的稀疏矩阵乘法的代码的执行时间进行比较。研究发现，当插值多项式的次数大于2时，稀疏矩阵乘法比DGEMM法在六面体元上的运算时间要短。利用稀疏矩阵乘法，从LS-FLOW-HO中提取各子程序代码，实现热点调优。通过改变单元边界中的数组结构，通过顺序内存访问改善内存/缓存访问延迟，以及通过循环崩溃增加循环长度来确认加速。将这些调优应用到LS-FLOW-HO上，执行时间最多减少了40%，并且在单个节点上使用16个OpenMP线程时达到了理论FLOPS峰值的10.23%。英特尔Haswell上的性能也显示为执行时间减少了大约49%。实验结果表明，该方法在其他处理器上也是有效的。最后，使用32到3200个节点(1024到102400个内核)演示了超音速喷气机实际应用的持续强大缩放性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

自引率

0.00%

发文量