Batched sparse iterative solvers on GPU for the collision operator for fusion plasma simulations

Aditya Kashi, Pratik Nayak, Dhruva Kulkarni, A. Scheinberg, Paul Lin, H. Anzt
{"title":"Batched sparse iterative solvers on GPU for the collision operator for fusion plasma simulations","authors":"Aditya Kashi, Pratik Nayak, Dhruva Kulkarni, A. Scheinberg, Paul Lin, H. Anzt","doi":"10.1109/ipdps53621.2022.00024","DOIUrl":null,"url":null,"abstract":"Batched linear solvers, which solve many small related but independent problems, are important in several applications. This is increasingly the case for highly parallel processors such as graphics processing units (GPUs), which need a substantial amount of work to keep them operating efficiently and solving smaller problems one-by-one is not an option. Because of the small size of each problem, the task of coming up with a parallel partitioning scheme and mapping the problem to hardware is not trivial. In recent history, significant attention has been given to batched dense linear algebra. However, there is also an interest in utilizing sparse iterative solvers in a batched form, and this presents further challenges. An example use case is found in a gyrokinetic Particle-In-Cell (PIC) code used for modeling magnetically confined fusion plasma devices. The collision operator has been identified as a bottleneck, and a proxy app has been created for facilitating optimizations and porting to GPUs. The current collision kernel linear solver does not run on the GPU-a major bottleneck. As these matrices are well-conditioned, batched iterative sparse solvers are an attractive option. A batched sparse iterative solver capability has recently been developed in the Ginkgo library. In this paper, we describe how the software architecture can be used to develop an efficient solution for the XGC collision proxy app. Comparisons for the solve times on NVIDIA V100 and A100 GPUs and AMD MI100 GPUs with one dual-socket Intel Xeon Skylake CPU node with 40 OpenMP threads are presented for matrices representative of those required in the collision kernel of XGC. The results suggest that GINKGO's batched sparse iterative solvers are well suited for efficient utilization of the GPU for this problem, and the performance portability of Ginkgo in conjunction with Kokkos (used within XGC as the heterogeneous programming model) allows seamless execution for exascale oriented heterogeneous architectures at the various leadership supercomputing facilities.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ipdps53621.2022.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Batched linear solvers, which solve many small related but independent problems, are important in several applications. This is increasingly the case for highly parallel processors such as graphics processing units (GPUs), which need a substantial amount of work to keep them operating efficiently and solving smaller problems one-by-one is not an option. Because of the small size of each problem, the task of coming up with a parallel partitioning scheme and mapping the problem to hardware is not trivial. In recent history, significant attention has been given to batched dense linear algebra. However, there is also an interest in utilizing sparse iterative solvers in a batched form, and this presents further challenges. An example use case is found in a gyrokinetic Particle-In-Cell (PIC) code used for modeling magnetically confined fusion plasma devices. The collision operator has been identified as a bottleneck, and a proxy app has been created for facilitating optimizations and porting to GPUs. The current collision kernel linear solver does not run on the GPU-a major bottleneck. As these matrices are well-conditioned, batched iterative sparse solvers are an attractive option. A batched sparse iterative solver capability has recently been developed in the Ginkgo library. In this paper, we describe how the software architecture can be used to develop an efficient solution for the XGC collision proxy app. Comparisons for the solve times on NVIDIA V100 and A100 GPUs and AMD MI100 GPUs with one dual-socket Intel Xeon Skylake CPU node with 40 OpenMP threads are presented for matrices representative of those required in the collision kernel of XGC. The results suggest that GINKGO's batched sparse iterative solvers are well suited for efficient utilization of the GPU for this problem, and the performance portability of Ginkgo in conjunction with Kokkos (used within XGC as the heterogeneous programming model) allows seamless execution for exascale oriented heterogeneous architectures at the various leadership supercomputing facilities.
基于GPU的融合等离子体模拟碰撞算子的批处理稀疏迭代求解
批量线性解算器可以解决许多相互关联但又相互独立的小问题,在许多应用中具有重要意义。对于图形处理单元(gpu)等高度并行的处理器来说,这种情况越来越多,它们需要大量的工作来保持高效运行,而逐个解决较小的问题是不可能的。由于每个问题的规模都很小,因此提出并行分区方案并将问题映射到硬件的任务并不简单。近年来,批处理密集线性代数得到了广泛的关注。然而,也有人对以批处理形式利用稀疏迭代求解器感兴趣,这提出了进一步的挑战。在一个用于模拟磁约束聚变等离子体装置的陀螺动力学粒子池(PIC)代码中发现了一个示例用例。碰撞操作符已被确定为瓶颈,并创建了一个代理应用程序,以促进优化和移植到gpu。当前的碰撞内核线性求解器不能在gpu上运行,这是一个主要的瓶颈。由于这些矩阵是条件良好的,因此批处理迭代稀疏求解器是一个有吸引力的选择。最近在Ginkgo库中开发了批处理稀疏迭代求解器功能。在本文中,我们描述了如何使用软件架构来开发XGC碰撞代理应用程序的有效解决方案。比较了在NVIDIA V100和A100 gpu和AMD MI100 gpu上使用一个双插槽Intel至强Skylake CPU节点具有40个OpenMP线程的XGC碰撞内核所需矩阵的求解时间。结果表明,GINKGO的批处理稀疏迭代求解器非常适合GPU的有效利用,并且GINKGO与Kokkos(在XGC中用作异构编程模型)的性能可移植性允许在各种领先的超级计算设施中无缝执行面向百亿亿次的异构架构。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信