Batched sparse iterative solvers on GPU for the collision operator for fusion plasma simulations

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI:10.1109/ipdps53621.2022.00024

Aditya Kashi, Pratik Nayak, Dhruva Kulkarni, A. Scheinberg, Paul Lin, H. Anzt

{"title":"Batched sparse iterative solvers on GPU for the collision operator for fusion plasma simulations","authors":"Aditya Kashi, Pratik Nayak, Dhruva Kulkarni, A. Scheinberg, Paul Lin, H. Anzt","doi":"10.1109/ipdps53621.2022.00024","DOIUrl":null,"url":null,"abstract":"Batched linear solvers, which solve many small related but independent problems, are important in several applications. This is increasingly the case for highly parallel processors such as graphics processing units (GPUs), which need a substantial amount of work to keep them operating efficiently and solving smaller problems one-by-one is not an option. Because of the small size of each problem, the task of coming up with a parallel partitioning scheme and mapping the problem to hardware is not trivial. In recent history, significant attention has been given to batched dense linear algebra. However, there is also an interest in utilizing sparse iterative solvers in a batched form, and this presents further challenges. An example use case is found in a gyrokinetic Particle-In-Cell (PIC) code used for modeling magnetically confined fusion plasma devices. The collision operator has been identified as a bottleneck, and a proxy app has been created for facilitating optimizations and porting to GPUs. The current collision kernel linear solver does not run on the GPU-a major bottleneck. As these matrices are well-conditioned, batched iterative sparse solvers are an attractive option. A batched sparse iterative solver capability has recently been developed in the Ginkgo library. In this paper, we describe how the software architecture can be used to develop an efficient solution for the XGC collision proxy app. Comparisons for the solve times on NVIDIA V100 and A100 GPUs and AMD MI100 GPUs with one dual-socket Intel Xeon Skylake CPU node with 40 OpenMP threads are presented for matrices representative of those required in the collision kernel of XGC. The results suggest that GINKGO's batched sparse iterative solvers are well suited for efficient utilization of the GPU for this problem, and the performance portability of Ginkgo in conjunction with Kokkos (used within XGC as the heterogeneous programming model) allows seamless execution for exascale oriented heterogeneous architectures at the various leadership supercomputing facilities.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ipdps53621.2022.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Batched linear solvers, which solve many small related but independent problems, are important in several applications. This is increasingly the case for highly parallel processors such as graphics processing units (GPUs), which need a substantial amount of work to keep them operating efficiently and solving smaller problems one-by-one is not an option. Because of the small size of each problem, the task of coming up with a parallel partitioning scheme and mapping the problem to hardware is not trivial. In recent history, significant attention has been given to batched dense linear algebra. However, there is also an interest in utilizing sparse iterative solvers in a batched form, and this presents further challenges. An example use case is found in a gyrokinetic Particle-In-Cell (PIC) code used for modeling magnetically confined fusion plasma devices. The collision operator has been identified as a bottleneck, and a proxy app has been created for facilitating optimizations and porting to GPUs. The current collision kernel linear solver does not run on the GPU-a major bottleneck. As these matrices are well-conditioned, batched iterative sparse solvers are an attractive option. A batched sparse iterative solver capability has recently been developed in the Ginkgo library. In this paper, we describe how the software architecture can be used to develop an efficient solution for the XGC collision proxy app. Comparisons for the solve times on NVIDIA V100 and A100 GPUs and AMD MI100 GPUs with one dual-socket Intel Xeon Skylake CPU node with 40 OpenMP threads are presented for matrices representative of those required in the collision kernel of XGC. The results suggest that GINKGO's batched sparse iterative solvers are well suited for efficient utilization of the GPU for this problem, and the performance portability of Ginkgo in conjunction with Kokkos (used within XGC as the heterogeneous programming model) allows seamless execution for exascale oriented heterogeneous architectures at the various leadership supercomputing facilities.

查看原文本刊更多论文

基于GPU的融合等离子体模拟碰撞算子的批处理稀疏迭代求解

批量线性解算器可以解决许多相互关联但又相互独立的小问题，在许多应用中具有重要意义。对于图形处理单元(gpu)等高度并行的处理器来说，这种情况越来越多，它们需要大量的工作来保持高效运行，而逐个解决较小的问题是不可能的。由于每个问题的规模都很小，因此提出并行分区方案并将问题映射到硬件的任务并不简单。近年来，批处理密集线性代数得到了广泛的关注。然而，也有人对以批处理形式利用稀疏迭代求解器感兴趣，这提出了进一步的挑战。在一个用于模拟磁约束聚变等离子体装置的陀螺动力学粒子池(PIC)代码中发现了一个示例用例。碰撞操作符已被确定为瓶颈，并创建了一个代理应用程序，以促进优化和移植到gpu。当前的碰撞内核线性求解器不能在gpu上运行，这是一个主要的瓶颈。由于这些矩阵是条件良好的，因此批处理迭代稀疏求解器是一个有吸引力的选择。最近在Ginkgo库中开发了批处理稀疏迭代求解器功能。在本文中，我们描述了如何使用软件架构来开发XGC碰撞代理应用程序的有效解决方案。比较了在NVIDIA V100和A100 gpu和AMD MI100 gpu上使用一个双插槽Intel至强Skylake CPU节点具有40个OpenMP线程的XGC碰撞内核所需矩阵的求解时间。结果表明，GINKGO的批处理稀疏迭代求解器非常适合GPU的有效利用，并且GINKGO与Kokkos(在XGC中用作异构编程模型)的性能可移植性允许在各种领先的超级计算设施中无缝执行面向百亿亿次的异构架构。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量