Beyond Fork-Join: Integration of Performance Portable Kokkos Kernels with HPX

Gregor Daiß, Mikael Simberg, Auriane Reverdell, J. Biddiscombe, Theresa Pollinger, H. Kaiser, D. Pflüger
{"title":"Beyond Fork-Join: Integration of Performance Portable Kokkos Kernels with HPX","authors":"Gregor Daiß, Mikael Simberg, Auriane Reverdell, J. Biddiscombe, Theresa Pollinger, H. Kaiser, D. Pflüger","doi":"10.1109/IPDPSW52791.2021.00066","DOIUrl":null,"url":null,"abstract":"Between a widening range of GPU vendors and the trend of having more GPUs per compute node in supercomputers such as Summit, Perlmutter, Frontier and Aurora, developing performant yet portable distributed HPC applications becomes ever more challenging. Leveraging existing solutions like Kokkos for platform-independent code and HPX for distributing the application in a task-based fashion can alleviate these challenges. However, using such frameworks in the same application requires them to work together seamlessly. In this work we present an HPX Kokkos integration that works both ways: we can integrate CPU and GPU Kokkos kernels as HPX tasks and inversely use HPX worker threads to work on Kokkos kernels. Using HPX futures makes launching and synchronizing Kokkos kernels from multiple threads easy, allowing us to move away from the more traditional fork-join model. To evaluate our integrations we ported existing Vc and CUDA kernels within an existing HPX application, Octo-Tiger, to use Kokkos instead. We achieve comparable, or better, performance than with previous Vc and CUDA kernels, showing both the viability of our HPX Kokkos integration, as well as future-proofing Octo-Tiger for a wider range of potential machines. Furthermore, we introduce event polling for synchronizing CUDA kernels (or Kokkos kernels on the respective backend) achieving speedups over the previous solution using callbacks.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW52791.2021.00066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Between a widening range of GPU vendors and the trend of having more GPUs per compute node in supercomputers such as Summit, Perlmutter, Frontier and Aurora, developing performant yet portable distributed HPC applications becomes ever more challenging. Leveraging existing solutions like Kokkos for platform-independent code and HPX for distributing the application in a task-based fashion can alleviate these challenges. However, using such frameworks in the same application requires them to work together seamlessly. In this work we present an HPX Kokkos integration that works both ways: we can integrate CPU and GPU Kokkos kernels as HPX tasks and inversely use HPX worker threads to work on Kokkos kernels. Using HPX futures makes launching and synchronizing Kokkos kernels from multiple threads easy, allowing us to move away from the more traditional fork-join model. To evaluate our integrations we ported existing Vc and CUDA kernels within an existing HPX application, Octo-Tiger, to use Kokkos instead. We achieve comparable, or better, performance than with previous Vc and CUDA kernels, showing both the viability of our HPX Kokkos integration, as well as future-proofing Octo-Tiger for a wider range of potential machines. Furthermore, we introduce event polling for synchronizing CUDA kernels (or Kokkos kernels on the respective backend) achieving speedups over the previous solution using callbacks.
超越Fork-Join:性能便携式Kokkos内核与HPX的集成
随着GPU供应商的不断扩大,以及在Summit、Perlmutter、Frontier和Aurora等超级计算机中每个计算节点拥有更多GPU的趋势,开发高性能且可移植的分布式HPC应用程序变得越来越具有挑战性。利用现有的解决方案(如Kokkos的平台无关代码和HPX的基于任务的方式分发应用程序)可以减轻这些挑战。然而,在同一个应用程序中使用这样的框架需要它们无缝地协同工作。在这项工作中,我们提出了一个HPX Kokkos集成,它可以两种方式工作:我们可以将CPU和GPU Kokkos内核集成为HPX任务,反过来使用HPX工作线程在Kokkos内核上工作。使用HPX期货可以轻松地从多个线程启动和同步Kokkos内核,从而使我们摆脱更传统的fork-join模型。为了评估我们的集成,我们将现有的Vc和CUDA内核移植到现有的HPX应用程序Octo-Tiger中,以使用Kokkos代替。我们实现了与以前的Vc和CUDA内核相当或更好的性能,显示了我们的HPX Kokkos集成的可行性,以及面向更广泛的潜在机器的面向未来的Octo-Tiger。此外,我们引入了事件轮询,用于同步CUDA内核(或各自后端的Kokkos内核),通过回调实现比以前的解决方案更快的速度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信