Fused GEMMs towards an efficient GPU implementation of the ADER‐DG method in SeisSol

Ravil Dorozhinskii, G. B. Gadeschi, Michael Bader
{"title":"Fused GEMMs towards an efficient GPU implementation of the ADER‐DG method in SeisSol","authors":"Ravil Dorozhinskii, G. B. Gadeschi, Michael Bader","doi":"10.1002/cpe.8037","DOIUrl":null,"url":null,"abstract":"This study shows how GPU performance of the ADER discontinuous Galerkin method in SeisSol (an earthquake simulation software) can be further improved while preserving its original design that ensures high CPU performance. We introduce a new code generator (“ChainForge”) that fuses subsequent batched matrix multiplications (“GEMMs”) into a single GPU kernel, holding intermediate results in shared memory as long as necessary. The generator operates as an external module linked against SeisSol's domain specific language YATeTo and, as a result, the original SeisSol source code remains mainly unchanged. In this paper, we discuss several challenges related to automatic fusion of GPU kernels and provide solutions to them. By and large, we gain 60% in performance of SeisSol's wave propagation solver using Fused‐GEMMs compared to the original GPU implementation. We demonstrated this on benchmarks as well as on a real production scenario simulating the Northridge 1994 earthquake.","PeriodicalId":10584,"journal":{"name":"Concurrency and Computation: Practice and Experience","volume":"39 9","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation: Practice and Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/cpe.8037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This study shows how GPU performance of the ADER discontinuous Galerkin method in SeisSol (an earthquake simulation software) can be further improved while preserving its original design that ensures high CPU performance. We introduce a new code generator (“ChainForge”) that fuses subsequent batched matrix multiplications (“GEMMs”) into a single GPU kernel, holding intermediate results in shared memory as long as necessary. The generator operates as an external module linked against SeisSol's domain specific language YATeTo and, as a result, the original SeisSol source code remains mainly unchanged. In this paper, we discuss several challenges related to automatic fusion of GPU kernels and provide solutions to them. By and large, we gain 60% in performance of SeisSol's wave propagation solver using Fused‐GEMMs compared to the original GPU implementation. We demonstrated this on benchmarks as well as on a real production scenario simulating the Northridge 1994 earthquake.
在 SeisSol 中融合 GEMMs 以实现 ADER-DG 方法的高效 GPU 实施
本研究展示了如何进一步提高 SeisSol(一款地震模拟软件)中 ADER 非连续伽勒金方法的 GPU 性能,同时保留其确保 CPU 高性能的原始设计。我们引入了一种新的代码生成器("ChainForge"),可将后续的分批矩阵乘法("GEMM")融合到一个 GPU 内核中,必要时将中间结果保留在共享内存中。生成器作为外部模块与 SeisSol 的特定领域语言 YATeTo 相链接,因此,SeisSol 的原始源代码基本保持不变。在本文中,我们讨论了与 GPU 内核自动融合相关的几个挑战,并提供了解决方案。总的来说,与最初的GPU实现相比,使用Fused-GEMMs的SeisSol波传播求解器的性能提高了60%。我们在基准测试以及模拟 1994 年北岭地震的实际生产场景中证明了这一点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信