非结构化网格上非连续Galerkin浅水模型的可扩展多fpga设计

Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2023-06-26 DOI:10.1145/3592979.3593407

Jennifer Faj, Tobias Kenter, S. Faghih-Naini, Christian Plessl, V. Aizinger

{"title":"非结构化网格上非连续Galerkin浅水模型的可扩展多fpga设计","authors":"Jennifer Faj, Tobias Kenter, S. Faghih-Naini, Christian Plessl, V. Aizinger","doi":"10.1145/3592979.3593407","DOIUrl":null,"url":null,"abstract":"FPGAs are fostering interest as energy-efficient accelerators for scientific simulations, including for methods operating on unstructured meshes. Considering the potential impact on high-performance computing, specific attention needs to be given to the scalability of such approaches. In this context, the networking capabilites of FPGA hardware and software stacks can play a crucial role to enable solutions that go beyond a traditional host-MPI and accelerator-offload model. In this work, we present the multi-FPGA scaling of a discontinuous Galerkin shallow water model using direct low-latency streaming communication between the FPGAs. To this end, the unstructured mesh defining the spatial domain of the simulation is partitioned, the inter-FPGA network is configured to match the topology of neighboring partitions, and halo communication is overlapped with the dataflow computation pipeline. With this approach, we demonstrate strong scaling on up to eight FPGAs with a parallel efficiency of >80% and execution times per time step of as low as 7.6 μs. At the same time, with weak scaling, the approach allows to simulate larger meshes that would exceed the local memory limits of a single FPGA, now supporting meshes up to more than 100,000 elements and reaching an aggregated performance of up to 6.5 TFLOPs. Finally, a hierarchical partitioning approach allows for better utilization of the FPGA compute resources in some designs and, by mitigating limitations posed by the communication topology, enables simulations with up to 32 partitions on 8 FPGAs.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Scalable Multi-FPGA Design of a Discontinuous Galerkin Shallow-Water Model on Unstructured Meshes\",\"authors\":\"Jennifer Faj, Tobias Kenter, S. Faghih-Naini, Christian Plessl, V. Aizinger\",\"doi\":\"10.1145/3592979.3593407\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"FPGAs are fostering interest as energy-efficient accelerators for scientific simulations, including for methods operating on unstructured meshes. Considering the potential impact on high-performance computing, specific attention needs to be given to the scalability of such approaches. In this context, the networking capabilites of FPGA hardware and software stacks can play a crucial role to enable solutions that go beyond a traditional host-MPI and accelerator-offload model. In this work, we present the multi-FPGA scaling of a discontinuous Galerkin shallow water model using direct low-latency streaming communication between the FPGAs. To this end, the unstructured mesh defining the spatial domain of the simulation is partitioned, the inter-FPGA network is configured to match the topology of neighboring partitions, and halo communication is overlapped with the dataflow computation pipeline. With this approach, we demonstrate strong scaling on up to eight FPGAs with a parallel efficiency of >80% and execution times per time step of as low as 7.6 μs. At the same time, with weak scaling, the approach allows to simulate larger meshes that would exceed the local memory limits of a single FPGA, now supporting meshes up to more than 100,000 elements and reaching an aggregated performance of up to 6.5 TFLOPs. Finally, a hierarchical partitioning approach allows for better utilization of the FPGA compute resources in some designs and, by mitigating limitations posed by the communication topology, enables simulations with up to 32 partitions on 8 FPGAs.\",\"PeriodicalId\":174137,\"journal\":{\"name\":\"Proceedings of the Platform for Advanced Scientific Computing Conference\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Platform for Advanced Scientific Computing Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3592979.3593407\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Platform for Advanced Scientific Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3592979.3593407","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

fpga作为科学模拟的节能加速器，包括在非结构化网格上操作的方法，正在培养人们的兴趣。考虑到对高性能计算的潜在影响，需要特别注意这些方法的可伸缩性。在这种情况下，FPGA硬件和软件堆栈的网络功能可以发挥关键作用，使解决方案超越传统的主机- mpi和加速器-卸载模型。在这项工作中，我们使用fpga之间的直接低延迟流通信，提出了不连续伽辽金浅水模型的多fpga缩放。为此，划分了定义仿真空间域的非结构化网格，配置了fpga间网络以匹配相邻分区的拓扑结构，并将halo通信与数据流计算管道重叠。通过这种方法，我们展示了在多达8个fpga上的强大扩展，并行效率>80%，每个时间步长的执行时间低至7.6 μs。同时，通过弱缩放，该方法允许模拟更大的网格，这将超过单个FPGA的本地内存限制，现在支持多达100,000个元素的网格，并达到高达6.5 TFLOPs的聚合性能。最后，分层分区方法允许在某些设计中更好地利用FPGA计算资源，并且通过减轻通信拓扑带来的限制，可以在8个FPGA上进行多达32个分区的模拟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Scalable Multi-FPGA Design of a Discontinuous Galerkin Shallow-Water Model on Unstructured Meshes

FPGAs are fostering interest as energy-efficient accelerators for scientific simulations, including for methods operating on unstructured meshes. Considering the potential impact on high-performance computing, specific attention needs to be given to the scalability of such approaches. In this context, the networking capabilites of FPGA hardware and software stacks can play a crucial role to enable solutions that go beyond a traditional host-MPI and accelerator-offload model. In this work, we present the multi-FPGA scaling of a discontinuous Galerkin shallow water model using direct low-latency streaming communication between the FPGAs. To this end, the unstructured mesh defining the spatial domain of the simulation is partitioned, the inter-FPGA network is configured to match the topology of neighboring partitions, and halo communication is overlapped with the dataflow computation pipeline. With this approach, we demonstrate strong scaling on up to eight FPGAs with a parallel efficiency of >80% and execution times per time step of as low as 7.6 μs. At the same time, with weak scaling, the approach allows to simulate larger meshes that would exceed the local memory limits of a single FPGA, now supporting meshes up to more than 100,000 elements and reaching an aggregated performance of up to 6.5 TFLOPs. Finally, a hierarchical partitioning approach allows for better utilization of the FPGA compute resources in some designs and, by mitigating limitations posed by the communication topology, enables simulations with up to 32 partitions on 8 FPGAs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Platform for Advanced Scientific Computing Conference

自引率

0.00%

发文量