Beyond Fork-Join: Integration of Performance Portable Kokkos Kernels with HPX

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2021-06-01 DOI:10.1109/IPDPSW52791.2021.00066

Gregor Daiß, Mikael Simberg, Auriane Reverdell, J. Biddiscombe, Theresa Pollinger, H. Kaiser, D. Pflüger

{"title":"Beyond Fork-Join: Integration of Performance Portable Kokkos Kernels with HPX","authors":"Gregor Daiß, Mikael Simberg, Auriane Reverdell, J. Biddiscombe, Theresa Pollinger, H. Kaiser, D. Pflüger","doi":"10.1109/IPDPSW52791.2021.00066","DOIUrl":null,"url":null,"abstract":"Between a widening range of GPU vendors and the trend of having more GPUs per compute node in supercomputers such as Summit, Perlmutter, Frontier and Aurora, developing performant yet portable distributed HPC applications becomes ever more challenging. Leveraging existing solutions like Kokkos for platform-independent code and HPX for distributing the application in a task-based fashion can alleviate these challenges. However, using such frameworks in the same application requires them to work together seamlessly. In this work we present an HPX Kokkos integration that works both ways: we can integrate CPU and GPU Kokkos kernels as HPX tasks and inversely use HPX worker threads to work on Kokkos kernels. Using HPX futures makes launching and synchronizing Kokkos kernels from multiple threads easy, allowing us to move away from the more traditional fork-join model. To evaluate our integrations we ported existing Vc and CUDA kernels within an existing HPX application, Octo-Tiger, to use Kokkos instead. We achieve comparable, or better, performance than with previous Vc and CUDA kernels, showing both the viability of our HPX Kokkos integration, as well as future-proofing Octo-Tiger for a wider range of potential machines. Furthermore, we introduce event polling for synchronizing CUDA kernels (or Kokkos kernels on the respective backend) achieving speedups over the previous solution using callbacks.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW52791.2021.00066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Between a widening range of GPU vendors and the trend of having more GPUs per compute node in supercomputers such as Summit, Perlmutter, Frontier and Aurora, developing performant yet portable distributed HPC applications becomes ever more challenging. Leveraging existing solutions like Kokkos for platform-independent code and HPX for distributing the application in a task-based fashion can alleviate these challenges. However, using such frameworks in the same application requires them to work together seamlessly. In this work we present an HPX Kokkos integration that works both ways: we can integrate CPU and GPU Kokkos kernels as HPX tasks and inversely use HPX worker threads to work on Kokkos kernels. Using HPX futures makes launching and synchronizing Kokkos kernels from multiple threads easy, allowing us to move away from the more traditional fork-join model. To evaluate our integrations we ported existing Vc and CUDA kernels within an existing HPX application, Octo-Tiger, to use Kokkos instead. We achieve comparable, or better, performance than with previous Vc and CUDA kernels, showing both the viability of our HPX Kokkos integration, as well as future-proofing Octo-Tiger for a wider range of potential machines. Furthermore, we introduce event polling for synchronizing CUDA kernels (or Kokkos kernels on the respective backend) achieving speedups over the previous solution using callbacks.

查看原文本刊更多论文

超越Fork-Join:性能便携式Kokkos内核与HPX的集成

随着GPU供应商的不断扩大，以及在Summit、Perlmutter、Frontier和Aurora等超级计算机中每个计算节点拥有更多GPU的趋势，开发高性能且可移植的分布式HPC应用程序变得越来越具有挑战性。利用现有的解决方案(如Kokkos的平台无关代码和HPX的基于任务的方式分发应用程序)可以减轻这些挑战。然而，在同一个应用程序中使用这样的框架需要它们无缝地协同工作。在这项工作中，我们提出了一个HPX Kokkos集成，它可以两种方式工作:我们可以将CPU和GPU Kokkos内核集成为HPX任务，反过来使用HPX工作线程在Kokkos内核上工作。使用HPX期货可以轻松地从多个线程启动和同步Kokkos内核，从而使我们摆脱更传统的fork-join模型。为了评估我们的集成，我们将现有的Vc和CUDA内核移植到现有的HPX应用程序Octo-Tiger中，以使用Kokkos代替。我们实现了与以前的Vc和CUDA内核相当或更好的性能，显示了我们的HPX Kokkos集成的可行性，以及面向更广泛的潜在机器的面向未来的Octo-Tiger。此外，我们引入了事件轮询，用于同步CUDA内核(或各自后端的Kokkos内核)，通过回调实现比以前的解决方案更快的速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量