GPU Optimization of ILU-Preconditioned GCR for Solving 19-Diagonal Linear Equations in GRAPES

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Concurrency and Computation-Practice & Experience Pub Date : 2025-08-12 DOI:10.1002/cpe.70217

Feng Zhang, Jinrong Jiang, Junlin Wei, Xuebin Chi, Huadong Xiao, Qingu Jiang, Xiangjun Wu, Sa Xiao, Lian Zhao, Youyun Li

{"title":"GPU Optimization of ILU-Preconditioned GCR for Solving 19-Diagonal Linear Equations in GRAPES","authors":"Feng Zhang, Jinrong Jiang, Junlin Wei, Xuebin Chi, Huadong Xiao, Qingu Jiang, Xiangjun Wu, Sa Xiao, Lian Zhao, Youyun Li","doi":"10.1002/cpe.70217","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>This article investigates the GPU optimization of solving 19-diagonal asymmetric linear systems within the numerical weather prediction model GRAPES. Such systems are commonly encountered when solving partial differential equations on 3D structured grids using finite difference methods. The five-diagonal patch-ILU preconditioner, which retains the essential connection coefficient, is well-suited for GPU platforms as it accelerates linear iterative convergence by approximately tenfold and offers a degree of parallelism. However, the forward-backward substitution process, used to solve the upper and lower triangular equations generated by the five-diagonal patch-ILU preconditioner, remains a major performance bottleneck on the GPU due to serial data dependencies. We designed the Shuffle Thomas algorithm, leveraging the GPU's shuffle functionality for data reuse, achieving efficient memory coalescing and data reuse, significantly enhancing memory throughput. Further exploiting the diagonal direction's parallelism in the substitution process, we designed the Divided Shuffle Thomas algorithm, doubling the instruction-level parallelism. This approach achieved a <span></span><math>\n <semantics>\n <mrow>\n <mn>11</mn>\n <mo>.</mo>\n <mn>42</mn>\n <mo>×</mo>\n </mrow>\n <annotation>$$ 11.42\\times $$</annotation>\n </semantics></math> to <span></span><math>\n <semantics>\n <mrow>\n <mn>15</mn>\n <mo>.</mo>\n <mn>11</mn>\n <mo>×</mo>\n </mrow>\n <annotation>$$ 15.11\\times $$</annotation>\n </semantics></math> speedup compared to cuSPARSE-gpsv. Our GCR solver on the Hygon DCU platform demonstrated a 5.41 to 8.47 times performance improvement over the CPU implementation with the same number of computing nodes, achieving higher computational efficiency with fewer processes. This has the potential to significantly enhance the computational efficiency for high-resolution numerical weather forecasting.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 21-22","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70217","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

This article investigates the GPU optimization of solving 19-diagonal asymmetric linear systems within the numerical weather prediction model GRAPES. Such systems are commonly encountered when solving partial differential equations on 3D structured grids using finite difference methods. The five-diagonal patch-ILU preconditioner, which retains the essential connection coefficient, is well-suited for GPU platforms as it accelerates linear iterative convergence by approximately tenfold and offers a degree of parallelism. However, the forward-backward substitution process, used to solve the upper and lower triangular equations generated by the five-diagonal patch-ILU preconditioner, remains a major performance bottleneck on the GPU due to serial data dependencies. We designed the Shuffle Thomas algorithm, leveraging the GPU's shuffle functionality for data reuse, achieving efficient memory coalescing and data reuse, significantly enhancing memory throughput. Further exploiting the diagonal direction's parallelism in the substitution process, we designed the Divided Shuffle Thomas algorithm, doubling the instruction-level parallelism. This approach achieved a $11.42 \times$ to $15.11 \times$ speedup compared to cuSPARSE-gpsv. Our GCR solver on the Hygon DCU platform demonstrated a 5.41 to 8.47 times performance improvement over the CPU implementation with the same number of computing nodes, achieving higher computational efficiency with fewer processes. This has the potential to significantly enhance the computational efficiency for high-resolution numerical weather forecasting.

查看原文本刊更多论文

求解19对角线性方程组的ilu -预处理GCR的GPU优化

本文研究了数值天气预报模式GRAPES中19对角线非对称线性系统的GPU优化问题。在使用有限差分方法求解三维结构网格上的偏微分方程时，通常会遇到这样的系统。保留基本连接系数的五对角线patch-ILU预调节器非常适合GPU平台，因为它将线性迭代收敛速度加快了大约十倍，并提供了一定程度的并行性。然而，由于串行数据依赖性，用于求解由五对角线patch-ILU预调节器生成的上下三角方程的前向后替换过程仍然是GPU的主要性能瓶颈。我们设计了Shuffle Thomas算法，利用GPU的Shuffle功能进行数据重用，实现了高效的内存合并和数据重用，显著提高了内存吞吐量。进一步利用对角线方向在替换过程中的并行性，我们设计了divide Shuffle Thomas算法，将指令级并行性提高了一倍。这种方法获得了11分。42 × $$ 11.42\times $$到15。与cuSPARSE-gpsv相比，加速了11倍$$ 15.11\times $$。我们的GCR求解器在Hygon DCU平台上的性能比CPU实现在相同计算节点数量下的性能提高了5.41到8.47倍，以更少的进程实现了更高的计算效率。这有可能显著提高高分辨率数值天气预报的计算效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Concurrency and Computation-Practice & Experience 工程技术-计算机：理论方法

CiteScore

5.00

自引率

10.00%

发文量

664

审稿时长

9.6 months

期刊介绍： Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of: Parallel and distributed computing; High-performance computing; Computational and data science; Artificial intelligence and machine learning; Big data applications, algorithms, and systems; Network science; Ontologies and semantics; Security and privacy; Cloud/edge/fog computing; Green computing; and Quantum computing.