Feng Zhang, Jinrong Jiang, Junlin Wei, Xuebin Chi, Huadong Xiao, Qingu Jiang, Xiangjun Wu, Sa Xiao, Lian Zhao, Youyun Li
{"title":"GPU Optimization of ILU-Preconditioned GCR for Solving 19-Diagonal Linear Equations in GRAPES","authors":"Feng Zhang, Jinrong Jiang, Junlin Wei, Xuebin Chi, Huadong Xiao, Qingu Jiang, Xiangjun Wu, Sa Xiao, Lian Zhao, Youyun Li","doi":"10.1002/cpe.70217","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>This article investigates the GPU optimization of solving 19-diagonal asymmetric linear systems within the numerical weather prediction model GRAPES. Such systems are commonly encountered when solving partial differential equations on 3D structured grids using finite difference methods. The five-diagonal patch-ILU preconditioner, which retains the essential connection coefficient, is well-suited for GPU platforms as it accelerates linear iterative convergence by approximately tenfold and offers a degree of parallelism. However, the forward-backward substitution process, used to solve the upper and lower triangular equations generated by the five-diagonal patch-ILU preconditioner, remains a major performance bottleneck on the GPU due to serial data dependencies. We designed the Shuffle Thomas algorithm, leveraging the GPU's shuffle functionality for data reuse, achieving efficient memory coalescing and data reuse, significantly enhancing memory throughput. Further exploiting the diagonal direction's parallelism in the substitution process, we designed the Divided Shuffle Thomas algorithm, doubling the instruction-level parallelism. This approach achieved a <span></span><math>\n <semantics>\n <mrow>\n <mn>11</mn>\n <mo>.</mo>\n <mn>42</mn>\n <mo>×</mo>\n </mrow>\n <annotation>$$ 11.42\\times $$</annotation>\n </semantics></math> to <span></span><math>\n <semantics>\n <mrow>\n <mn>15</mn>\n <mo>.</mo>\n <mn>11</mn>\n <mo>×</mo>\n </mrow>\n <annotation>$$ 15.11\\times $$</annotation>\n </semantics></math> speedup compared to cuSPARSE-gpsv. Our GCR solver on the Hygon DCU platform demonstrated a 5.41 to 8.47 times performance improvement over the CPU implementation with the same number of computing nodes, achieving higher computational efficiency with fewer processes. This has the potential to significantly enhance the computational efficiency for high-resolution numerical weather forecasting.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 21-22","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70217","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
This article investigates the GPU optimization of solving 19-diagonal asymmetric linear systems within the numerical weather prediction model GRAPES. Such systems are commonly encountered when solving partial differential equations on 3D structured grids using finite difference methods. The five-diagonal patch-ILU preconditioner, which retains the essential connection coefficient, is well-suited for GPU platforms as it accelerates linear iterative convergence by approximately tenfold and offers a degree of parallelism. However, the forward-backward substitution process, used to solve the upper and lower triangular equations generated by the five-diagonal patch-ILU preconditioner, remains a major performance bottleneck on the GPU due to serial data dependencies. We designed the Shuffle Thomas algorithm, leveraging the GPU's shuffle functionality for data reuse, achieving efficient memory coalescing and data reuse, significantly enhancing memory throughput. Further exploiting the diagonal direction's parallelism in the substitution process, we designed the Divided Shuffle Thomas algorithm, doubling the instruction-level parallelism. This approach achieved a to speedup compared to cuSPARSE-gpsv. Our GCR solver on the Hygon DCU platform demonstrated a 5.41 to 8.47 times performance improvement over the CPU implementation with the same number of computing nodes, achieving higher computational efficiency with fewer processes. This has the potential to significantly enhance the computational efficiency for high-resolution numerical weather forecasting.
期刊介绍:
Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of:
Parallel and distributed computing;
High-performance computing;
Computational and data science;
Artificial intelligence and machine learning;
Big data applications, algorithms, and systems;
Network science;
Ontologies and semantics;
Security and privacy;
Cloud/edge/fog computing;
Green computing; and
Quantum computing.