Accelerating Particle-in-Cell Monte Carlo simulations with MPI, OpenMP/OpenACC and Asynchronous Multi-GPU Programming

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science Pub Date : 2025-04-23 DOI:10.1016/j.jocs.2025.102590

Jeremy J. Williams , Felix Liu , Jordy Trilaksono , David Tskhakaya , Stefan Costea , Leon Kos , Ales Podolnik , Jakub Hromadka , Pratibha Hegde , Marta Garcia-Gasulla , Valentin Seitz , Frank Jenko , Erwin Laure , Stefano Markidis

{"title":"Accelerating Particle-in-Cell Monte Carlo simulations with MPI, OpenMP/OpenACC and Asynchronous Multi-GPU Programming","authors":"Jeremy J. Williams , Felix Liu , Jordy Trilaksono , David Tskhakaya , Stefan Costea , Leon Kos , Ales Podolnik , Jakub Hromadka , Pratibha Hegde , Marta Garcia-Gasulla , Valentin Seitz , Frank Jenko , Erwin Laure , Stefano Markidis","doi":"10.1016/j.jocs.2025.102590","DOIUrl":null,"url":null,"abstract":"<div><div>As fusion energy devices advance, plasma simulations play a critical role in fusion reactor design. Particle-in-Cell Monte Carlo simulations are essential for modeling plasma-material interactions and analyzing power load distributions on tokamak divertors. Previous work (Williams, 2024) introduced hybrid parallelization in BIT1 using MPI and OpenMP/OpenACC for shared-memory and multicore CPU processing. In this extended work, we integrate MPI with OpenMP and OpenACC, focusing on asynchronous multi-GPU programming with OpenMP Target Tasks using the “nowait” and “depend” clauses, and OpenACC Parallel with the “async(n)” clause. Our results show significant performance improvements: 16 MPI ranks plus OpenMP threads reduced simulation runtime by 53% on a petascale EuroHPC supercomputer, while the OpenACC multicore implementation achieved a 58% reduction compared to the MPI-only version. Scaling to 64 MPI ranks, OpenACC outperformed OpenMP, achieving a 24% improvement in the particle mover function. On the HPE Cray EX supercomputer, OpenMP and OpenACC consistently reduced simulation times, with a 37% reduction at 100 nodes. Results from MareNostrum 5, a pre-exascale EuroHPC supercomputer, highlight OpenACC’s effectiveness, with the “async(n)” configuration delivering notable performance gains. However, OpenMP asynchronous configurations outperform OpenACC at larger node counts, particularly for extreme scaling runs. As BIT1 scales asynchronously to 128 GPUs, OpenMP asynchronous multi-GPU configurations outperformed OpenACC in runtime, demonstrating superior scalability, which continues up to 400 GPUs, further improving runtime. Speedup and parallel efficiency (PE) studies reveal OpenMP asynchronous multi-GPU achieving an 8.77<span><math><mo>×</mo></math></span> speedup (54.81% PE) and OpenACC achieving an 8.14<span><math><mo>×</mo></math></span> speedup (50.87% PE) on MareNostrum 5, surpassing the CPU-only version. At higher node counts, PE declined across all implementations due to communication and synchronization costs. However, the asynchronous multi-GPU versions maintained better PE, demonstrating the benefits of asynchronous multi-GPU execution in reducing scalability bottlenecks. While the CPU-only implementation is faster in some cases, OpenMP’s asynchronous multi-GPU approach delivers better GPU performance through asynchronous data transfer and task dependencies, ensuring data consistency and avoiding race conditions. Using NVIDIA Nsight tools, we confirmed BIT1’s overall efficiency for large-scale plasma simulations, leveraging current and future exascale supercomputing infrastructures. Asynchronous data transfers and dedicated GPU assignments to MPI ranks enhance performance, with OpenMP’s asynchronous multi-GPU implementation utilizing OpenMP Target Tasks with “nowait” and “depend” clauses outperforming other configurations. This makes OpenMP the preferred application programming interface when performance portability, high throughput, and efficient GPU utilization are critical. This enables BIT1 to fully exploit modern supercomputing architectures, advancing fusion energy research. MareNostrum 5 brings us closer to achieving exascale performance.</div></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"88 ","pages":"Article 102590"},"PeriodicalIF":3.1000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877750325000675","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

As fusion energy devices advance, plasma simulations play a critical role in fusion reactor design. Particle-in-Cell Monte Carlo simulations are essential for modeling plasma-material interactions and analyzing power load distributions on tokamak divertors. Previous work (Williams, 2024) introduced hybrid parallelization in BIT1 using MPI and OpenMP/OpenACC for shared-memory and multicore CPU processing. In this extended work, we integrate MPI with OpenMP and OpenACC, focusing on asynchronous multi-GPU programming with OpenMP Target Tasks using the “nowait” and “depend” clauses, and OpenACC Parallel with the “async(n)” clause. Our results show significant performance improvements: 16 MPI ranks plus OpenMP threads reduced simulation runtime by 53% on a petascale EuroHPC supercomputer, while the OpenACC multicore implementation achieved a 58% reduction compared to the MPI-only version. Scaling to 64 MPI ranks, OpenACC outperformed OpenMP, achieving a 24% improvement in the particle mover function. On the HPE Cray EX supercomputer, OpenMP and OpenACC consistently reduced simulation times, with a 37% reduction at 100 nodes. Results from MareNostrum 5, a pre-exascale EuroHPC supercomputer, highlight OpenACC’s effectiveness, with the “async(n)” configuration delivering notable performance gains. However, OpenMP asynchronous configurations outperform OpenACC at larger node counts, particularly for extreme scaling runs. As BIT1 scales asynchronously to 128 GPUs, OpenMP asynchronous multi-GPU configurations outperformed OpenACC in runtime, demonstrating superior scalability, which continues up to 400 GPUs, further improving runtime. Speedup and parallel efficiency (PE) studies reveal OpenMP asynchronous multi-GPU achieving an 8.77

\times

speedup (54.81% PE) and OpenACC achieving an 8.14

\times

speedup (50.87% PE) on MareNostrum 5, surpassing the CPU-only version. At higher node counts, PE declined across all implementations due to communication and synchronization costs. However, the asynchronous multi-GPU versions maintained better PE, demonstrating the benefits of asynchronous multi-GPU execution in reducing scalability bottlenecks. While the CPU-only implementation is faster in some cases, OpenMP’s asynchronous multi-GPU approach delivers better GPU performance through asynchronous data transfer and task dependencies, ensuring data consistency and avoiding race conditions. Using NVIDIA Nsight tools, we confirmed BIT1’s overall efficiency for large-scale plasma simulations, leveraging current and future exascale supercomputing infrastructures. Asynchronous data transfers and dedicated GPU assignments to MPI ranks enhance performance, with OpenMP’s asynchronous multi-GPU implementation utilizing OpenMP Target Tasks with “nowait” and “depend” clauses outperforming other configurations. This makes OpenMP the preferred application programming interface when performance portability, high throughput, and efficient GPU utilization are critical. This enables BIT1 to fully exploit modern supercomputing architectures, advancing fusion energy research. MareNostrum 5 brings us closer to achieving exascale performance.

查看原文本刊更多论文

加速粒子在单元蒙特卡罗模拟与MPI， OpenMP/OpenACC和异步多gpu编程

随着聚变能装置的发展，等离子体模拟在聚变反应堆设计中起着至关重要的作用。粒子池内蒙特卡罗模拟对于模拟等离子体-材料相互作用和分析托卡马克转流器上的功率负载分布至关重要。之前的工作（Williams, 2024）介绍了在BIT1中使用MPI和OpenMP/OpenACC进行共享内存和多核CPU处理的混合并行化。在这项扩展工作中，我们将MPI与OpenMP和OpenACC集成，重点关注使用OpenMP目标任务的异步多gpu编程，使用“nowait”和“depend”子句，以及使用“async(n)”子句的OpenACC并行。我们的结果显示了显著的性能改进：16个MPI排名加上OpenMP线程在千兆级EuroHPC超级计算机上减少了53%的模拟运行时间，而OpenACC多核实现与仅MPI版本相比减少了58%。扩展到64个MPI等级，OpenACC优于OpenMP，在粒子移动功能上实现了24%的改进。在HPE Cray EX超级计算机上，OpenMP和OpenACC持续减少了模拟时间，在100个节点上减少了37%。来自MareNostrum 5的结果，一个pre-exascale EuroHPC超级计算机，突出了OpenACC的有效性，“async(n)”配置提供了显着的性能提升。但是，OpenMP异步配置在较大的节点数下优于OpenACC，特别是在极端扩展运行时。当BIT1异步扩展到128个gpu时，OpenMP异步多gpu配置在运行时优于OpenACC，展示了卓越的可扩展性，可以持续到400个gpu，进一步改善了运行时。加速和并行效率（PE）研究表明，OpenMP异步多gpu在MareNostrum 5上实现了8.77倍的加速（54.81% PE）， OpenACC实现了8.14倍的加速（50.87% PE），超过了只有cpu的版本。在较高的节点数下，由于通信和同步成本，PE在所有实现中都有所下降。然而，异步多gpu版本保持了更好的PE，证明了异步多gpu执行在减少可伸缩性瓶颈方面的好处。虽然cpu实现在某些情况下更快，但OpenMP的异步多GPU方法通过异步数据传输和任务依赖性提供了更好的GPU性能，确保了数据一致性并避免了竞争条件。使用NVIDIA Nsight工具，我们确认了BIT1在大规模等离子体模拟中的整体效率，利用了当前和未来的百亿亿次超级计算基础设施。异步数据传输和专用GPU分配到MPI排名可以提高性能，OpenMP的异步多GPU实现利用带有“nowait”和“depend”子句的OpenMP目标任务优于其他配置。当性能可移植性、高吞吐量和高效GPU利用率至关重要时，这使得OpenMP成为首选的应用程序编程接口。这使BIT1能够充分利用现代超级计算架构，推进聚变能研究。MareNostrum 5使我们更接近于实现百亿亿次的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computational Science COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-COMPUTER SCIENCE, THEORY & METHODS

CiteScore

5.50

自引率

3.00%

发文量

227

审稿时长

41 days

期刊介绍： Computational Science is a rapidly growing multi- and interdisciplinary field that uses advanced computing and data analysis to understand and solve complex problems. It has reached a level of predictive capability that now firmly complements the traditional pillars of experimentation and theory. The recent advances in experimental techniques such as detectors, on-line sensor networks and high-resolution imaging techniques, have opened up new windows into physical and biological processes at many levels of detail. The resulting data explosion allows for detailed data driven modeling and simulation. This new discipline in science combines computational thinking, modern computational methods, devices and collateral technologies to address problems far beyond the scope of traditional numerical methods. Computational science typically unifies three distinct elements: • Modeling, Algorithms and Simulations (e.g. numerical and non-numerical, discrete and continuous); • Software developed to solve science (e.g., biological, physical, and social), engineering, medicine, and humanities problems; • Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components (e.g. problem solving environments).