Chen-Chun Chen, Kawthar Shafie Khorassani, Goutham Kalikrishna Reddy Kuncham, Rahul Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, D. Panda
{"title":"实现和优化gpu感知MPI库的英特尔gpu:早期的经验","authors":"Chen-Chun Chen, Kawthar Shafie Khorassani, Goutham Kalikrishna Reddy Kuncham, Rahul Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/CCGrid57682.2023.00022","DOIUrl":null,"url":null,"abstract":"As the demand for computing power from High-Performance Computing (HPC) and Deep Learning (DL) applications increase, there is a growing trend of equipping modern exascale clusters with accelerators, such as NVIDIA and AMD GPUs. GPU-aware MPI libraries allow the applications to communicate between GPUs in a parallel environment with high productivity and performance. Although NVIDIA and AMD GPUs have dominated the accelerator market for top supercomputers over the past several years, Intel has recently developed and released its GPUs and associated software stack, and provided a unified programming model to program their GPUs, referred to as oneAPI. The emergence of Intel GPUs drives the need for initial MPI-level GPU-aware support that utilizes the underlying software stack specific to these GPUs and a thorough evaluation of communication. In this paper, we propose a GPU-aware MPI library for Intel GPUs using oneAPI and an SYCL backend. We delve into our experiments using Intel GPUs and the challenges to consider at the MPI layer when adding GPU-aware support using the software stack provided by Intel for their GPUs. We explore different memory allocation approaches and benchmark the memory copy performance with Intel GPUs. We propose implementations based on our experiments on Intel GPUs to support point-to-point GPU-aware MPI operations and show the high adaptability of our approach by extending the implementations to MPI collective operations, such as MPI_Bcast and MPI_Reduce. We evaluate the benefits of our implementations at the benchmark level by extending support for Intel GPU buffers over OSU Micro-Benchmarks. Our implementations provide up to 1.8x and 2.2x speedups on point-to-point latency using device buffers at small messages compared to Intel MPI and a naive benchmark, respectively; and have up to 1.3x and 1.5x speedups at large message sizes. At collective MPI operations, our implementations show 8x and 5x speedups for MPI_Allreduce and MPI_Allgather at large messages. At the application-level evaluation, our implementations provide up to 40% improvement for 3DStencil compared to Intel MPI.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences\",\"authors\":\"Chen-Chun Chen, Kawthar Shafie Khorassani, Goutham Kalikrishna Reddy Kuncham, Rahul Vaidya, M. Abduljabbar, A. Shafi, H. Subramoni, D. Panda\",\"doi\":\"10.1109/CCGrid57682.2023.00022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the demand for computing power from High-Performance Computing (HPC) and Deep Learning (DL) applications increase, there is a growing trend of equipping modern exascale clusters with accelerators, such as NVIDIA and AMD GPUs. GPU-aware MPI libraries allow the applications to communicate between GPUs in a parallel environment with high productivity and performance. Although NVIDIA and AMD GPUs have dominated the accelerator market for top supercomputers over the past several years, Intel has recently developed and released its GPUs and associated software stack, and provided a unified programming model to program their GPUs, referred to as oneAPI. The emergence of Intel GPUs drives the need for initial MPI-level GPU-aware support that utilizes the underlying software stack specific to these GPUs and a thorough evaluation of communication. In this paper, we propose a GPU-aware MPI library for Intel GPUs using oneAPI and an SYCL backend. We delve into our experiments using Intel GPUs and the challenges to consider at the MPI layer when adding GPU-aware support using the software stack provided by Intel for their GPUs. We explore different memory allocation approaches and benchmark the memory copy performance with Intel GPUs. We propose implementations based on our experiments on Intel GPUs to support point-to-point GPU-aware MPI operations and show the high adaptability of our approach by extending the implementations to MPI collective operations, such as MPI_Bcast and MPI_Reduce. We evaluate the benefits of our implementations at the benchmark level by extending support for Intel GPU buffers over OSU Micro-Benchmarks. Our implementations provide up to 1.8x and 2.2x speedups on point-to-point latency using device buffers at small messages compared to Intel MPI and a naive benchmark, respectively; and have up to 1.3x and 1.5x speedups at large message sizes. At collective MPI operations, our implementations show 8x and 5x speedups for MPI_Allreduce and MPI_Allgather at large messages. At the application-level evaluation, our implementations provide up to 40% improvement for 3DStencil compared to Intel MPI.\",\"PeriodicalId\":363806,\"journal\":{\"name\":\"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGrid57682.2023.00022\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid57682.2023.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences
As the demand for computing power from High-Performance Computing (HPC) and Deep Learning (DL) applications increase, there is a growing trend of equipping modern exascale clusters with accelerators, such as NVIDIA and AMD GPUs. GPU-aware MPI libraries allow the applications to communicate between GPUs in a parallel environment with high productivity and performance. Although NVIDIA and AMD GPUs have dominated the accelerator market for top supercomputers over the past several years, Intel has recently developed and released its GPUs and associated software stack, and provided a unified programming model to program their GPUs, referred to as oneAPI. The emergence of Intel GPUs drives the need for initial MPI-level GPU-aware support that utilizes the underlying software stack specific to these GPUs and a thorough evaluation of communication. In this paper, we propose a GPU-aware MPI library for Intel GPUs using oneAPI and an SYCL backend. We delve into our experiments using Intel GPUs and the challenges to consider at the MPI layer when adding GPU-aware support using the software stack provided by Intel for their GPUs. We explore different memory allocation approaches and benchmark the memory copy performance with Intel GPUs. We propose implementations based on our experiments on Intel GPUs to support point-to-point GPU-aware MPI operations and show the high adaptability of our approach by extending the implementations to MPI collective operations, such as MPI_Bcast and MPI_Reduce. We evaluate the benefits of our implementations at the benchmark level by extending support for Intel GPU buffers over OSU Micro-Benchmarks. Our implementations provide up to 1.8x and 2.2x speedups on point-to-point latency using device buffers at small messages compared to Intel MPI and a naive benchmark, respectively; and have up to 1.3x and 1.5x speedups at large message sizes. At collective MPI operations, our implementations show 8x and 5x speedups for MPI_Allreduce and MPI_Allgather at large messages. At the application-level evaluation, our implementations provide up to 40% improvement for 3DStencil compared to Intel MPI.