Jeremy J. Williams , Felix Liu , Jordy Trilaksono , David Tskhakaya , Stefan Costea , Leon Kos , Ales Podolnik , Jakub Hromadka , Pratibha Hegde , Marta Garcia-Gasulla , Valentin Seitz , Frank Jenko , Erwin Laure , Stefano Markidis
{"title":"Accelerating Particle-in-Cell Monte Carlo simulations with MPI, OpenMP/OpenACC and Asynchronous Multi-GPU Programming","authors":"Jeremy J. Williams , Felix Liu , Jordy Trilaksono , David Tskhakaya , Stefan Costea , Leon Kos , Ales Podolnik , Jakub Hromadka , Pratibha Hegde , Marta Garcia-Gasulla , Valentin Seitz , Frank Jenko , Erwin Laure , Stefano Markidis","doi":"10.1016/j.jocs.2025.102590","DOIUrl":null,"url":null,"abstract":"<div><div>As fusion energy devices advance, plasma simulations play a critical role in fusion reactor design. Particle-in-Cell Monte Carlo simulations are essential for modeling plasma-material interactions and analyzing power load distributions on tokamak divertors. Previous work (Williams, 2024) introduced hybrid parallelization in BIT1 using MPI and OpenMP/OpenACC for shared-memory and multicore CPU processing. In this extended work, we integrate MPI with OpenMP and OpenACC, focusing on asynchronous multi-GPU programming with OpenMP Target Tasks using the “nowait” and “depend” clauses, and OpenACC Parallel with the “async(n)” clause. Our results show significant performance improvements: 16 MPI ranks plus OpenMP threads reduced simulation runtime by 53% on a petascale EuroHPC supercomputer, while the OpenACC multicore implementation achieved a 58% reduction compared to the MPI-only version. Scaling to 64 MPI ranks, OpenACC outperformed OpenMP, achieving a 24% improvement in the particle mover function. On the HPE Cray EX supercomputer, OpenMP and OpenACC consistently reduced simulation times, with a 37% reduction at 100 nodes. Results from MareNostrum 5, a pre-exascale EuroHPC supercomputer, highlight OpenACC’s effectiveness, with the “async(n)” configuration delivering notable performance gains. However, OpenMP asynchronous configurations outperform OpenACC at larger node counts, particularly for extreme scaling runs. As BIT1 scales asynchronously to 128 GPUs, OpenMP asynchronous multi-GPU configurations outperformed OpenACC in runtime, demonstrating superior scalability, which continues up to 400 GPUs, further improving runtime. Speedup and parallel efficiency (PE) studies reveal OpenMP asynchronous multi-GPU achieving an 8.77<span><math><mo>×</mo></math></span> speedup (54.81% PE) and OpenACC achieving an 8.14<span><math><mo>×</mo></math></span> speedup (50.87% PE) on MareNostrum 5, surpassing the CPU-only version. At higher node counts, PE declined across all implementations due to communication and synchronization costs. However, the asynchronous multi-GPU versions maintained better PE, demonstrating the benefits of asynchronous multi-GPU execution in reducing scalability bottlenecks. While the CPU-only implementation is faster in some cases, OpenMP’s asynchronous multi-GPU approach delivers better GPU performance through asynchronous data transfer and task dependencies, ensuring data consistency and avoiding race conditions. Using NVIDIA Nsight tools, we confirmed BIT1’s overall efficiency for large-scale plasma simulations, leveraging current and future exascale supercomputing infrastructures. Asynchronous data transfers and dedicated GPU assignments to MPI ranks enhance performance, with OpenMP’s asynchronous multi-GPU implementation utilizing OpenMP Target Tasks with “nowait” and “depend” clauses outperforming other configurations. This makes OpenMP the preferred application programming interface when performance portability, high throughput, and efficient GPU utilization are critical. This enables BIT1 to fully exploit modern supercomputing architectures, advancing fusion energy research. MareNostrum 5 brings us closer to achieving exascale performance.</div></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"88 ","pages":"Article 102590"},"PeriodicalIF":3.1000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877750325000675","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
As fusion energy devices advance, plasma simulations play a critical role in fusion reactor design. Particle-in-Cell Monte Carlo simulations are essential for modeling plasma-material interactions and analyzing power load distributions on tokamak divertors. Previous work (Williams, 2024) introduced hybrid parallelization in BIT1 using MPI and OpenMP/OpenACC for shared-memory and multicore CPU processing. In this extended work, we integrate MPI with OpenMP and OpenACC, focusing on asynchronous multi-GPU programming with OpenMP Target Tasks using the “nowait” and “depend” clauses, and OpenACC Parallel with the “async(n)” clause. Our results show significant performance improvements: 16 MPI ranks plus OpenMP threads reduced simulation runtime by 53% on a petascale EuroHPC supercomputer, while the OpenACC multicore implementation achieved a 58% reduction compared to the MPI-only version. Scaling to 64 MPI ranks, OpenACC outperformed OpenMP, achieving a 24% improvement in the particle mover function. On the HPE Cray EX supercomputer, OpenMP and OpenACC consistently reduced simulation times, with a 37% reduction at 100 nodes. Results from MareNostrum 5, a pre-exascale EuroHPC supercomputer, highlight OpenACC’s effectiveness, with the “async(n)” configuration delivering notable performance gains. However, OpenMP asynchronous configurations outperform OpenACC at larger node counts, particularly for extreme scaling runs. As BIT1 scales asynchronously to 128 GPUs, OpenMP asynchronous multi-GPU configurations outperformed OpenACC in runtime, demonstrating superior scalability, which continues up to 400 GPUs, further improving runtime. Speedup and parallel efficiency (PE) studies reveal OpenMP asynchronous multi-GPU achieving an 8.77 speedup (54.81% PE) and OpenACC achieving an 8.14 speedup (50.87% PE) on MareNostrum 5, surpassing the CPU-only version. At higher node counts, PE declined across all implementations due to communication and synchronization costs. However, the asynchronous multi-GPU versions maintained better PE, demonstrating the benefits of asynchronous multi-GPU execution in reducing scalability bottlenecks. While the CPU-only implementation is faster in some cases, OpenMP’s asynchronous multi-GPU approach delivers better GPU performance through asynchronous data transfer and task dependencies, ensuring data consistency and avoiding race conditions. Using NVIDIA Nsight tools, we confirmed BIT1’s overall efficiency for large-scale plasma simulations, leveraging current and future exascale supercomputing infrastructures. Asynchronous data transfers and dedicated GPU assignments to MPI ranks enhance performance, with OpenMP’s asynchronous multi-GPU implementation utilizing OpenMP Target Tasks with “nowait” and “depend” clauses outperforming other configurations. This makes OpenMP the preferred application programming interface when performance portability, high throughput, and efficient GPU utilization are critical. This enables BIT1 to fully exploit modern supercomputing architectures, advancing fusion energy research. MareNostrum 5 brings us closer to achieving exascale performance.
期刊介绍:
Computational Science is a rapidly growing multi- and interdisciplinary field that uses advanced computing and data analysis to understand and solve complex problems. It has reached a level of predictive capability that now firmly complements the traditional pillars of experimentation and theory.
The recent advances in experimental techniques such as detectors, on-line sensor networks and high-resolution imaging techniques, have opened up new windows into physical and biological processes at many levels of detail. The resulting data explosion allows for detailed data driven modeling and simulation.
This new discipline in science combines computational thinking, modern computational methods, devices and collateral technologies to address problems far beyond the scope of traditional numerical methods.
Computational science typically unifies three distinct elements:
• Modeling, Algorithms and Simulations (e.g. numerical and non-numerical, discrete and continuous);
• Software developed to solve science (e.g., biological, physical, and social), engineering, medicine, and humanities problems;
• Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components (e.g. problem solving environments).