在 SeisSol 中融合 GEMMs 以实现 ADER-DG 方法的高效 GPU 实施

Concurrency and Computation: Practice and Experience Pub Date : 2024-02-13 DOI:10.1002/cpe.8037

Ravil Dorozhinskii, G. B. Gadeschi, Michael Bader

{"title":"在 SeisSol 中融合 GEMMs 以实现 ADER-DG 方法的高效 GPU 实施","authors":"Ravil Dorozhinskii, G. B. Gadeschi, Michael Bader","doi":"10.1002/cpe.8037","DOIUrl":null,"url":null,"abstract":"This study shows how GPU performance of the ADER discontinuous Galerkin method in SeisSol (an earthquake simulation software) can be further improved while preserving its original design that ensures high CPU performance. We introduce a new code generator (“ChainForge”) that fuses subsequent batched matrix multiplications (“GEMMs”) into a single GPU kernel, holding intermediate results in shared memory as long as necessary. The generator operates as an external module linked against SeisSol's domain specific language YATeTo and, as a result, the original SeisSol source code remains mainly unchanged. In this paper, we discuss several challenges related to automatic fusion of GPU kernels and provide solutions to them. By and large, we gain 60% in performance of SeisSol's wave propagation solver using Fused‐GEMMs compared to the original GPU implementation. We demonstrated this on benchmarks as well as on a real production scenario simulating the Northridge 1994 earthquake.","PeriodicalId":10584,"journal":{"name":"Concurrency and Computation: Practice and Experience","volume":"39 9","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fused GEMMs towards an efficient GPU implementation of the ADER‐DG method in SeisSol\",\"authors\":\"Ravil Dorozhinskii, G. B. Gadeschi, Michael Bader\",\"doi\":\"10.1002/cpe.8037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study shows how GPU performance of the ADER discontinuous Galerkin method in SeisSol (an earthquake simulation software) can be further improved while preserving its original design that ensures high CPU performance. We introduce a new code generator (“ChainForge”) that fuses subsequent batched matrix multiplications (“GEMMs”) into a single GPU kernel, holding intermediate results in shared memory as long as necessary. The generator operates as an external module linked against SeisSol's domain specific language YATeTo and, as a result, the original SeisSol source code remains mainly unchanged. In this paper, we discuss several challenges related to automatic fusion of GPU kernels and provide solutions to them. By and large, we gain 60% in performance of SeisSol's wave propagation solver using Fused‐GEMMs compared to the original GPU implementation. We demonstrated this on benchmarks as well as on a real production scenario simulating the Northridge 1994 earthquake.\",\"PeriodicalId\":10584,\"journal\":{\"name\":\"Concurrency and Computation: Practice and Experience\",\"volume\":\"39 9\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Concurrency and Computation: Practice and Experience\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1002/cpe.8037\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation: Practice and Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/cpe.8037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本研究展示了如何进一步提高 SeisSol（一款地震模拟软件）中 ADER 非连续伽勒金方法的 GPU 性能，同时保留其确保 CPU 高性能的原始设计。我们引入了一种新的代码生成器（"ChainForge"），可将后续的分批矩阵乘法（"GEMM"）融合到一个 GPU 内核中，必要时将中间结果保留在共享内存中。生成器作为外部模块与 SeisSol 的特定领域语言 YATeTo 相链接，因此，SeisSol 的原始源代码基本保持不变。在本文中，我们讨论了与 GPU 内核自动融合相关的几个挑战，并提供了解决方案。总的来说，与最初的GPU实现相比，使用Fused-GEMMs的SeisSol波传播求解器的性能提高了60%。我们在基准测试以及模拟 1994 年北岭地震的实际生产场景中证明了这一点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fused GEMMs towards an efficient GPU implementation of the ADER‐DG method in SeisSol

This study shows how GPU performance of the ADER discontinuous Galerkin method in SeisSol (an earthquake simulation software) can be further improved while preserving its original design that ensures high CPU performance. We introduce a new code generator (“ChainForge”) that fuses subsequent batched matrix multiplications (“GEMMs”) into a single GPU kernel, holding intermediate results in shared memory as long as necessary. The generator operates as an external module linked against SeisSol's domain specific language YATeTo and, as a result, the original SeisSol source code remains mainly unchanged. In this paper, we discuss several challenges related to automatic fusion of GPU kernels and provide solutions to them. By and large, we gain 60% in performance of SeisSol's wave propagation solver using Fused‐GEMMs compared to the original GPU implementation. We demonstrated this on benchmarks as well as on a real production scenario simulating the Northridge 1994 earthquake.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Concurrency and Computation: Practice and Experience

自引率

0.00%

发文量