Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences

Chen-Chun Chen, Kawthar Shafie Khorassani, Goutham Kalikrishna Reddy Kuncham, Rahul Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, D. Panda
{"title":"Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences","authors":"Chen-Chun Chen, Kawthar Shafie Khorassani, Goutham Kalikrishna Reddy Kuncham, Rahul Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/CCGrid57682.2023.00022","DOIUrl":null,"url":null,"abstract":"As the demand for computing power from High-Performance Computing (HPC) and Deep Learning (DL) applications increase, there is a growing trend of equipping modern exascale clusters with accelerators, such as NVIDIA and AMD GPUs. GPU-aware MPI libraries allow the applications to communicate between GPUs in a parallel environment with high productivity and performance. Although NVIDIA and AMD GPUs have dominated the accelerator market for top supercomputers over the past several years, Intel has recently developed and released its GPUs and associated software stack, and provided a unified programming model to program their GPUs, referred to as oneAPI. The emergence of Intel GPUs drives the need for initial MPI-level GPU-aware support that utilizes the underlying software stack specific to these GPUs and a thorough evaluation of communication. In this paper, we propose a GPU-aware MPI library for Intel GPUs using oneAPI and an SYCL backend. We delve into our experiments using Intel GPUs and the challenges to consider at the MPI layer when adding GPU-aware support using the software stack provided by Intel for their GPUs. We explore different memory allocation approaches and benchmark the memory copy performance with Intel GPUs. We propose implementations based on our experiments on Intel GPUs to support point-to-point GPU-aware MPI operations and show the high adaptability of our approach by extending the implementations to MPI collective operations, such as MPI_Bcast and MPI_Reduce. We evaluate the benefits of our implementations at the benchmark level by extending support for Intel GPU buffers over OSU Micro-Benchmarks. Our implementations provide up to 1.8x and 2.2x speedups on point-to-point latency using device buffers at small messages compared to Intel MPI and a naive benchmark, respectively; and have up to 1.3x and 1.5x speedups at large message sizes. At collective MPI operations, our implementations show 8x and 5x speedups for MPI_Allreduce and MPI_Allgather at large messages. At the application-level evaluation, our implementations provide up to 40% improvement for 3DStencil compared to Intel MPI.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid57682.2023.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

As the demand for computing power from High-Performance Computing (HPC) and Deep Learning (DL) applications increase, there is a growing trend of equipping modern exascale clusters with accelerators, such as NVIDIA and AMD GPUs. GPU-aware MPI libraries allow the applications to communicate between GPUs in a parallel environment with high productivity and performance. Although NVIDIA and AMD GPUs have dominated the accelerator market for top supercomputers over the past several years, Intel has recently developed and released its GPUs and associated software stack, and provided a unified programming model to program their GPUs, referred to as oneAPI. The emergence of Intel GPUs drives the need for initial MPI-level GPU-aware support that utilizes the underlying software stack specific to these GPUs and a thorough evaluation of communication. In this paper, we propose a GPU-aware MPI library for Intel GPUs using oneAPI and an SYCL backend. We delve into our experiments using Intel GPUs and the challenges to consider at the MPI layer when adding GPU-aware support using the software stack provided by Intel for their GPUs. We explore different memory allocation approaches and benchmark the memory copy performance with Intel GPUs. We propose implementations based on our experiments on Intel GPUs to support point-to-point GPU-aware MPI operations and show the high adaptability of our approach by extending the implementations to MPI collective operations, such as MPI_Bcast and MPI_Reduce. We evaluate the benefits of our implementations at the benchmark level by extending support for Intel GPU buffers over OSU Micro-Benchmarks. Our implementations provide up to 1.8x and 2.2x speedups on point-to-point latency using device buffers at small messages compared to Intel MPI and a naive benchmark, respectively; and have up to 1.3x and 1.5x speedups at large message sizes. At collective MPI operations, our implementations show 8x and 5x speedups for MPI_Allreduce and MPI_Allgather at large messages. At the application-level evaluation, our implementations provide up to 40% improvement for 3DStencil compared to Intel MPI.
实现和优化gpu感知MPI库的英特尔gpu:早期的经验
随着高性能计算(HPC)和深度学习(DL)应用对计算能力的需求增加,为现代百亿亿级集群配备加速器(如NVIDIA和AMD gpu)的趋势越来越大。gpu感知MPI库允许应用程序在具有高生产力和性能的并行环境中在gpu之间进行通信。虽然NVIDIA和AMD的gpu在过去几年中一直主导着顶级超级计算机的加速器市场,但英特尔最近开发并发布了自己的gpu和相关软件堆栈,并提供了一个统一的编程模型来编程他们的gpu,称为oneAPI。英特尔gpu的出现推动了对初始mpi级gpu感知支持的需求,这种支持利用了这些gpu特有的底层软件堆栈,并对通信进行了彻底的评估。在本文中,我们提出了一个gpu感知的MPI库,用于英特尔gpu使用一个api和SYCL后端。我们深入研究了我们使用英特尔gpu的实验,以及在使用英特尔为其gpu提供的软件堆栈添加gpu感知支持时在MPI层考虑的挑战。我们探索了不同的内存分配方法,并对英特尔gpu的内存复制性能进行了基准测试。我们提出了基于我们在英特尔gpu上的实验来支持点对点gpu感知的MPI操作的实现,并通过将实现扩展到MPI集合操作(如MPI_Bcast和MPI_Reduce)来显示我们方法的高适应性。我们通过在OSU micro - benchmark上扩展对英特尔GPU缓冲区的支持来评估我们在基准级别实现的好处。与英特尔MPI和朴素基准测试相比,我们的实现分别在使用设备缓冲区处理小消息时提供了1.8倍和2.2倍的点对点延迟加速;并且在大消息大小时具有高达1.3倍和1.5倍的加速。在集体MPI操作中,我们的实现显示在处理大消息时MPI_Allreduce和MPI_Allgather的速度分别提高了8倍和5倍。在应用级评估中,与英特尔MPI相比,我们的实现为3DStencil提供了高达40%的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信