针对稀疏复杂几何图形的大规模点阵玻尔兹曼方法的特定架构生成

arXiv - CS - Performance Pub Date : 2024-08-13 DOI:arxiv-2408.06880

Philipp Suffa, Markus Holzer, Harald Köstler, Ulrich Rüde

{"title":"针对稀疏复杂几何图形的大规模点阵玻尔兹曼方法的特定架构生成","authors":"Philipp Suffa, Markus Holzer, Harald Köstler, Ulrich Rüde","doi":"arxiv-2408.06880","DOIUrl":null,"url":null,"abstract":"We implement and analyse a sparse / indirect-addressing data structure for\nthe Lattice Boltzmann Method to support efficient compute kernels for fluid\ndynamics problems with a high number of non-fluid nodes in the domain, such as\nin porous media flows. The data structure is integrated into a code generation\npipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils\nand collision operators and to generate efficient code for kernels for CPU as\nwell as for AMD and NVIDIA accelerator cards. We optimize these sparse kernels\nwith an in-place streaming pattern to save memory accesses and memory\nconsumption and we implement a communication hiding technique to prove\nscalability. We present single GPU performance results with up to 99% of\nmaximal bandwidth utilization. We integrate the optimized generated kernels in\nthe high performance framework WALBERLA and achieve a scaling efficiency of at\nleast 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on\nmodern HPC systems. Further, we set up three different applications to test the\nsparse data structure for realistic demonstrator problems. We show performance\nresults for flow through porous media, free flow over a particle bed, and blood\nflow in a coronary artery. We achieve a maximal performance speed-up of 2 and a\nsignificantly reduced memory consumption by up to 75% with the sparse /\nindirect-addressing data structure compared to the direct-addressing data\nstructure for these applications.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"176 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Architecture Specific Generation of Large Scale Lattice Boltzmann Methods for Sparse Complex Geometries\",\"authors\":\"Philipp Suffa, Markus Holzer, Harald Köstler, Ulrich Rüde\",\"doi\":\"arxiv-2408.06880\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We implement and analyse a sparse / indirect-addressing data structure for\\nthe Lattice Boltzmann Method to support efficient compute kernels for fluid\\ndynamics problems with a high number of non-fluid nodes in the domain, such as\\nin porous media flows. The data structure is integrated into a code generation\\npipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils\\nand collision operators and to generate efficient code for kernels for CPU as\\nwell as for AMD and NVIDIA accelerator cards. We optimize these sparse kernels\\nwith an in-place streaming pattern to save memory accesses and memory\\nconsumption and we implement a communication hiding technique to prove\\nscalability. We present single GPU performance results with up to 99% of\\nmaximal bandwidth utilization. We integrate the optimized generated kernels in\\nthe high performance framework WALBERLA and achieve a scaling efficiency of at\\nleast 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on\\nmodern HPC systems. Further, we set up three different applications to test the\\nsparse data structure for realistic demonstrator problems. We show performance\\nresults for flow through porous media, free flow over a particle bed, and blood\\nflow in a coronary artery. We achieve a maximal performance speed-up of 2 and a\\nsignificantly reduced memory consumption by up to 75% with the sparse /\\nindirect-addressing data structure compared to the direct-addressing data\\nstructure for these applications.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"176 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.06880\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.06880","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们实现并分析了一种稀疏/间接寻址的数据结构，以支持在多孔介质流等领域中具有大量非流体节点的流体力学问题的高效计算内核。数据结构被集成到代码生成流水线中，使稀疏的 Lattice Boltzmann 方法能够使用各种模板和碰撞算子，并为 CPU 以及 AMD 和 NVIDIA 加速卡的内核生成高效代码。我们采用就地流式模式优化这些稀疏内核，以节省内存访问和内存消耗，并采用通信隐藏技术来证明可升级性。我们展示了单 GPU 性能结果，最大带宽利用率高达 99%。我们将优化生成的内核集成到高性能框架WALBERLA中，并在多达1024个英伟达A100 GPU和多达4096个AMD MI250X GPU的现代高性能计算系统上实现了至少82%的扩展效率。此外，我们还设置了三个不同的应用程序，以测试针对实际演示问题的稀疏数据结构。我们展示了流经多孔介质、粒子床自由流动和冠状动脉中血流的性能结果。与直接寻址数据结构相比，稀疏/间接寻址数据结构在这些应用中的最大性能提升了 2 倍，内存消耗也显著降低了 75%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Architecture Specific Generation of Large Scale Lattice Boltzmann Methods for Sparse Complex Geometries

We implement and analyse a sparse / indirect-addressing data structure for the Lattice Boltzmann Method to support efficient compute kernels for fluid dynamics problems with a high number of non-fluid nodes in the domain, such as in porous media flows. The data structure is integrated into a code generation pipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils and collision operators and to generate efficient code for kernels for CPU as well as for AMD and NVIDIA accelerator cards. We optimize these sparse kernels with an in-place streaming pattern to save memory accesses and memory consumption and we implement a communication hiding technique to prove scalability. We present single GPU performance results with up to 99% of maximal bandwidth utilization. We integrate the optimized generated kernels in the high performance framework WALBERLA and achieve a scaling efficiency of at least 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on modern HPC systems. Further, we set up three different applications to test the sparse data structure for realistic demonstrator problems. We show performance results for flow through porous media, free flow over a particle bed, and blood flow in a coronary artery. We achieve a maximal performance speed-up of 2 and a significantly reduced memory consumption by up to 75% with the sparse / indirect-addressing data structure compared to the direct-addressing data structure for these applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量