目标性能和用户友好性:gpu加速有限元计算与fenic中的自动代码生成

IF 2.1 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing Pub Date : 2023-10-06 DOI:10.1016/j.parco.2023.103051

James D. Trotter , Johannes Langguth , Xing Cai

{"title":"目标性能和用户友好性:gpu加速有限元计算与fenic中的自动代码生成","authors":"James D. Trotter , Johannes Langguth , Xing Cai","doi":"10.1016/j.parco.2023.103051","DOIUrl":null,"url":null,"abstract":"<div><p>This paper studies the use of automated code generation to provide user-friendly GPU acceleration for solving partial differential equations (PDEs) with finite element methods. By extending the FEniCS framework and its automated compiler, we have achieved that a high-level description of finite element computations written in the Unified Form Language is auto-translated to parallelised CUDA C++ code. The auto-generated code provides GPU offloading for the finite element assembly of linear equation systems which are then solved by a GPU-supported linear algebra backend.</p><p>Specifically, we explore several auto-generated optimisations of the resulting CUDA C++ code. Numerical experiments show that GPU-based linear system assembly for a typical PDE with first-order elements can benefit from using a lookup table to avoid repeatedly carrying out numerous binary searches, and that further performance gains can be obtained by assembling a sparse matrix row by row. More importantly, the extended FEniCS compiler is able to seamlessly couple the assembly and solution phases for GPU acceleration, so that all unnecessary CPU–GPU data transfers are eliminated. Detailed experiments are used to quantify the negative impact of these data transfers, which can entirely destroy the potential of GPU acceleration if the assembly and solution phases are offloaded to GPU separately. Finally, a complete, auto-generated GPU-based PDE solver for a nonlinear solid mechanics application is used to demonstrate a substantial speedup over running on dual-socket multi-core CPUs, including GPU acceleration of algebraic multigrid as the preconditioner.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"118 ","pages":"Article 103051"},"PeriodicalIF":2.1000,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Targeting performance and user-friendliness: GPU-accelerated finite element computation with automated code generation in FEniCS\",\"authors\":\"James D. Trotter , Johannes Langguth , Xing Cai\",\"doi\":\"10.1016/j.parco.2023.103051\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>This paper studies the use of automated code generation to provide user-friendly GPU acceleration for solving partial differential equations (PDEs) with finite element methods. By extending the FEniCS framework and its automated compiler, we have achieved that a high-level description of finite element computations written in the Unified Form Language is auto-translated to parallelised CUDA C++ code. The auto-generated code provides GPU offloading for the finite element assembly of linear equation systems which are then solved by a GPU-supported linear algebra backend.</p><p>Specifically, we explore several auto-generated optimisations of the resulting CUDA C++ code. Numerical experiments show that GPU-based linear system assembly for a typical PDE with first-order elements can benefit from using a lookup table to avoid repeatedly carrying out numerous binary searches, and that further performance gains can be obtained by assembling a sparse matrix row by row. More importantly, the extended FEniCS compiler is able to seamlessly couple the assembly and solution phases for GPU acceleration, so that all unnecessary CPU–GPU data transfers are eliminated. Detailed experiments are used to quantify the negative impact of these data transfers, which can entirely destroy the potential of GPU acceleration if the assembly and solution phases are offloaded to GPU separately. Finally, a complete, auto-generated GPU-based PDE solver for a nonlinear solid mechanics application is used to demonstrate a substantial speedup over running on dual-socket multi-core CPUs, including GPU acceleration of algebraic multigrid as the preconditioner.</p></div>\",\"PeriodicalId\":54642,\"journal\":{\"name\":\"Parallel Computing\",\"volume\":\"118 \",\"pages\":\"Article 103051\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2023-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Parallel Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167819123000571\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819123000571","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

本文研究了使用自动代码生成为用有限元方法求解偏微分方程(PDEs)提供用户友好的GPU加速。通过扩展FEniCS框架及其自动编译器，我们已经实现了用统一形式语言编写的有限元计算的高级描述自动翻译为并行CUDA c++代码。自动生成的代码为线性方程系统的有限元装配提供GPU卸载，然后由GPU支持的线性代数后端进行求解。具体来说，我们探索了CUDA c++代码的几个自动生成的优化。数值实验表明，对于典型的一阶元素PDE，基于gpu的线性系统装配可以通过使用查找表来避免重复进行大量的二进制搜索，并且通过逐行组装稀疏矩阵可以获得进一步的性能提升。更重要的是，扩展的FEniCS编译器能够无缝地耦合GPU加速的组装和解决方案阶段，从而消除了所有不必要的CPU-GPU数据传输。详细的实验用于量化这些数据传输的负面影响，如果将组装和解决阶段分别卸载到GPU，则可能完全破坏GPU加速的潜力。最后，使用一个完整的、自动生成的基于GPU的非线性固体力学PDE求解器来演示在双插槽多核cpu上运行时的显著加速，包括GPU对代数多网格的加速作为前置条件。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Targeting performance and user-friendliness: GPU-accelerated finite element computation with automated code generation in FEniCS

This paper studies the use of automated code generation to provide user-friendly GPU acceleration for solving partial differential equations (PDEs) with finite element methods. By extending the FEniCS framework and its automated compiler, we have achieved that a high-level description of finite element computations written in the Unified Form Language is auto-translated to parallelised CUDA C++ code. The auto-generated code provides GPU offloading for the finite element assembly of linear equation systems which are then solved by a GPU-supported linear algebra backend.

Specifically, we explore several auto-generated optimisations of the resulting CUDA C++ code. Numerical experiments show that GPU-based linear system assembly for a typical PDE with first-order elements can benefit from using a lookup table to avoid repeatedly carrying out numerous binary searches, and that further performance gains can be obtained by assembling a sparse matrix row by row. More importantly, the extended FEniCS compiler is able to seamlessly couple the assembly and solution phases for GPU acceleration, so that all unnecessary CPU–GPU data transfers are eliminated. Detailed experiments are used to quantify the negative impact of these data transfers, which can entirely destroy the potential of GPU acceleration if the assembly and solution phases are offloaded to GPU separately. Finally, a complete, auto-generated GPU-based PDE solver for a nonlinear solid mechanics application is used to demonstrate a substantial speedup over running on dual-socket multi-core CPUs, including GPU acceleration of algebraic multigrid as the preconditioner.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Parallel Computing 工程技术-计算机：理论方法

CiteScore

3.50

自引率

7.10%

发文量

审稿时长

4.5 months

期刊介绍： Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications