Philipp Suffa, Markus Holzer, Harald Köstler, Ulrich Rüde
{"title":"针对稀疏复杂几何图形的大规模点阵玻尔兹曼方法的特定架构生成","authors":"Philipp Suffa, Markus Holzer, Harald Köstler, Ulrich Rüde","doi":"arxiv-2408.06880","DOIUrl":null,"url":null,"abstract":"We implement and analyse a sparse / indirect-addressing data structure for\nthe Lattice Boltzmann Method to support efficient compute kernels for fluid\ndynamics problems with a high number of non-fluid nodes in the domain, such as\nin porous media flows. The data structure is integrated into a code generation\npipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils\nand collision operators and to generate efficient code for kernels for CPU as\nwell as for AMD and NVIDIA accelerator cards. We optimize these sparse kernels\nwith an in-place streaming pattern to save memory accesses and memory\nconsumption and we implement a communication hiding technique to prove\nscalability. We present single GPU performance results with up to 99% of\nmaximal bandwidth utilization. We integrate the optimized generated kernels in\nthe high performance framework WALBERLA and achieve a scaling efficiency of at\nleast 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on\nmodern HPC systems. Further, we set up three different applications to test the\nsparse data structure for realistic demonstrator problems. We show performance\nresults for flow through porous media, free flow over a particle bed, and blood\nflow in a coronary artery. We achieve a maximal performance speed-up of 2 and a\nsignificantly reduced memory consumption by up to 75% with the sparse /\nindirect-addressing data structure compared to the direct-addressing data\nstructure for these applications.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"176 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Architecture Specific Generation of Large Scale Lattice Boltzmann Methods for Sparse Complex Geometries\",\"authors\":\"Philipp Suffa, Markus Holzer, Harald Köstler, Ulrich Rüde\",\"doi\":\"arxiv-2408.06880\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We implement and analyse a sparse / indirect-addressing data structure for\\nthe Lattice Boltzmann Method to support efficient compute kernels for fluid\\ndynamics problems with a high number of non-fluid nodes in the domain, such as\\nin porous media flows. The data structure is integrated into a code generation\\npipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils\\nand collision operators and to generate efficient code for kernels for CPU as\\nwell as for AMD and NVIDIA accelerator cards. We optimize these sparse kernels\\nwith an in-place streaming pattern to save memory accesses and memory\\nconsumption and we implement a communication hiding technique to prove\\nscalability. We present single GPU performance results with up to 99% of\\nmaximal bandwidth utilization. We integrate the optimized generated kernels in\\nthe high performance framework WALBERLA and achieve a scaling efficiency of at\\nleast 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on\\nmodern HPC systems. Further, we set up three different applications to test the\\nsparse data structure for realistic demonstrator problems. We show performance\\nresults for flow through porous media, free flow over a particle bed, and blood\\nflow in a coronary artery. We achieve a maximal performance speed-up of 2 and a\\nsignificantly reduced memory consumption by up to 75% with the sparse /\\nindirect-addressing data structure compared to the direct-addressing data\\nstructure for these applications.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"176 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.06880\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.06880","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Architecture Specific Generation of Large Scale Lattice Boltzmann Methods for Sparse Complex Geometries
We implement and analyse a sparse / indirect-addressing data structure for
the Lattice Boltzmann Method to support efficient compute kernels for fluid
dynamics problems with a high number of non-fluid nodes in the domain, such as
in porous media flows. The data structure is integrated into a code generation
pipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils
and collision operators and to generate efficient code for kernels for CPU as
well as for AMD and NVIDIA accelerator cards. We optimize these sparse kernels
with an in-place streaming pattern to save memory accesses and memory
consumption and we implement a communication hiding technique to prove
scalability. We present single GPU performance results with up to 99% of
maximal bandwidth utilization. We integrate the optimized generated kernels in
the high performance framework WALBERLA and achieve a scaling efficiency of at
least 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on
modern HPC systems. Further, we set up three different applications to test the
sparse data structure for realistic demonstrator problems. We show performance
results for flow through porous media, free flow over a particle bed, and blood
flow in a coronary artery. We achieve a maximal performance speed-up of 2 and a
significantly reduced memory consumption by up to 75% with the sparse /
indirect-addressing data structure compared to the direct-addressing data
structure for these applications.