分布式内存三对角线求解器，基于针对CPU和GPU架构优化的专用数据结构

IF 3.4 2区物理与天体物理 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computer Physics Communications Pub Date : 2025-07-10 DOI:10.1016/j.cpc.2025.109747

Semih Akkurt , Sébastien Lemaire , Paul Bartholomew , Sylvain Laizet

{"title":"分布式内存三对角线求解器，基于针对CPU和GPU架构优化的专用数据结构","authors":"Semih Akkurt , Sébastien Lemaire , Paul Bartholomew , Sylvain Laizet","doi":"10.1016/j.cpc.2025.109747","DOIUrl":null,"url":null,"abstract":"<div><div>Various numerical methods used for solving partial differential equations (PDE) result in tridiagonal systems. Solving tridiagonal systems on distributed-memory environments is not straightforward, and often requires significant amount of communication. In this article, we present a novel distributed-memory tridiagonal solver algorithm, DistD2-TDS, based on a specialised data structure. DistD2-TDS algorithm takes advantage of the diagonal dominance in tridiagonal systems to reduce the communications in distributed-memory environments based on an established strategy. The underlying data structure plays a crucial role for the performance of the algorithm on individual ranks. First, the data structure improves data localities and makes it possible to minimise data movements via cache blocking and kernel fusion strategies. Second, data continuity enables a contiguous data access pattern and results in efficient utilisation of the available memory bandwidth. Finally, the data layout supports vectorisation on CPUs and thread level parallelisation on GPUs for improved performance. In order to demonstrate the robustness of the algorithm, we implemented and benchmarked the algorithm on CPUs and GPUs. We investigated the single rank performance and compared against existing algorithms. Furthermore, we analysed the strong scaling of the implementation up to 384 NVIDIA H100 GPUs and up to 8192 AMD EPYC 7742 CPUs. Finally, we demonstrated a practical use case of the algorithm by using compact finite difference schemes to solve a 3D non-linear PDE. The results demonstrate that DistD2 algorithm can sustain around 66% of the theoretical peak bandwidth at scale on CPU and GPU based supercomputers.</div></div>","PeriodicalId":285,"journal":{"name":"Computer Physics Communications","volume":"315 ","pages":"Article 109747"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A distributed-memory tridiagonal solver based on a specialised data structure optimised for CPU and GPU architectures\",\"authors\":\"Semih Akkurt , Sébastien Lemaire , Paul Bartholomew , Sylvain Laizet\",\"doi\":\"10.1016/j.cpc.2025.109747\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Various numerical methods used for solving partial differential equations (PDE) result in tridiagonal systems. Solving tridiagonal systems on distributed-memory environments is not straightforward, and often requires significant amount of communication. In this article, we present a novel distributed-memory tridiagonal solver algorithm, DistD2-TDS, based on a specialised data structure. DistD2-TDS algorithm takes advantage of the diagonal dominance in tridiagonal systems to reduce the communications in distributed-memory environments based on an established strategy. The underlying data structure plays a crucial role for the performance of the algorithm on individual ranks. First, the data structure improves data localities and makes it possible to minimise data movements via cache blocking and kernel fusion strategies. Second, data continuity enables a contiguous data access pattern and results in efficient utilisation of the available memory bandwidth. Finally, the data layout supports vectorisation on CPUs and thread level parallelisation on GPUs for improved performance. In order to demonstrate the robustness of the algorithm, we implemented and benchmarked the algorithm on CPUs and GPUs. We investigated the single rank performance and compared against existing algorithms. Furthermore, we analysed the strong scaling of the implementation up to 384 NVIDIA H100 GPUs and up to 8192 AMD EPYC 7742 CPUs. Finally, we demonstrated a practical use case of the algorithm by using compact finite difference schemes to solve a 3D non-linear PDE. The results demonstrate that DistD2 algorithm can sustain around 66% of the theoretical peak bandwidth at scale on CPU and GPU based supercomputers.</div></div>\",\"PeriodicalId\":285,\"journal\":{\"name\":\"Computer Physics Communications\",\"volume\":\"315 \",\"pages\":\"Article 109747\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Physics Communications\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0010465525002498\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Physics Communications","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0010465525002498","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

用于求解偏微分方程（PDE）的各种数值方法导致三对角线系统。在分布式内存环境中解决三对角线系统并不简单，通常需要大量的通信。在本文中，我们提出了一种新的基于特殊数据结构的分布式内存三对角求解算法DistD2-TDS。DistD2-TDS算法利用三对角线系统中的对角线优势，根据既定策略减少分布式内存环境中的通信。底层数据结构对算法在个体排名上的性能起着至关重要的作用。首先，数据结构改善了数据位置，并通过缓存阻塞和核融合策略使数据移动最小化成为可能。其次，数据连续性使连续的数据访问模式成为可能，从而有效地利用可用的内存带宽。最后，数据布局支持cpu上的矢量化和gpu上的线程级并行化，以提高性能。为了证明算法的鲁棒性，我们在cpu和gpu上实现并对算法进行了基准测试。我们研究了单秩性能，并与现有算法进行了比较。此外，我们分析了实现的强大扩展，最多可达384个NVIDIA H100 gpu和多达8192个AMD EPYC 7742 cpu。最后，我们通过使用紧凑有限差分格式来解决三维非线性PDE，展示了该算法的实际用例。结果表明，在基于CPU和GPU的超级计算机上，DistD2算法可以维持约66%的理论峰值带宽。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A distributed-memory tridiagonal solver based on a specialised data structure optimised for CPU and GPU architectures

Various numerical methods used for solving partial differential equations (PDE) result in tridiagonal systems. Solving tridiagonal systems on distributed-memory environments is not straightforward, and often requires significant amount of communication. In this article, we present a novel distributed-memory tridiagonal solver algorithm, DistD2-TDS, based on a specialised data structure. DistD2-TDS algorithm takes advantage of the diagonal dominance in tridiagonal systems to reduce the communications in distributed-memory environments based on an established strategy. The underlying data structure plays a crucial role for the performance of the algorithm on individual ranks. First, the data structure improves data localities and makes it possible to minimise data movements via cache blocking and kernel fusion strategies. Second, data continuity enables a contiguous data access pattern and results in efficient utilisation of the available memory bandwidth. Finally, the data layout supports vectorisation on CPUs and thread level parallelisation on GPUs for improved performance. In order to demonstrate the robustness of the algorithm, we implemented and benchmarked the algorithm on CPUs and GPUs. We investigated the single rank performance and compared against existing algorithms. Furthermore, we analysed the strong scaling of the implementation up to 384 NVIDIA H100 GPUs and up to 8192 AMD EPYC 7742 CPUs. Finally, we demonstrated a practical use case of the algorithm by using compact finite difference schemes to solve a 3D non-linear PDE. The results demonstrate that DistD2 algorithm can sustain around 66% of the theoretical peak bandwidth at scale on CPU and GPU based supercomputers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Physics Communications 物理-计算机：跨学科应用

CiteScore

12.10

自引率

3.20%

发文量

287

审稿时长

5.3 months

期刊介绍： The focus of CPC is on contemporary computational methods and techniques and their implementation, the effectiveness of which will normally be evidenced by the author(s) within the context of a substantive problem in physics. Within this setting CPC publishes two types of paper. Computer Programs in Physics (CPiP) These papers describe significant computer programs to be archived in the CPC Program Library which is held in the Mendeley Data repository. The submitted software must be covered by an approved open source licence. Papers and associated computer programs that address a problem of contemporary interest in physics that cannot be solved by current software are particularly encouraged. Computational Physics Papers (CP) These are research papers in, but are not limited to, the following themes across computational physics and related disciplines. mathematical and numerical methods and algorithms; computational models including those associated with the design, control and analysis of experiments; and algebraic computation. Each will normally include software implementation and performance details. The software implementation should, ideally, be available via GitHub, Zenodo or an institutional repository.In addition, research papers on the impact of advanced computer architecture and special purpose computers on computing in the physical sciences and software topics related to, and of importance in, the physical sciences may be considered.