Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI:10.1109/IPDPSW.2012.228

S. Potluri, Hao Wang, Devendar Bureddy, A. Singh, C. Rosales, D. Panda

{"title":"Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication","authors":"S. Potluri, Hao Wang, Devendar Bureddy, A. Singh, C. Rosales, D. Panda","doi":"10.1109/IPDPSW.2012.228","DOIUrl":null,"url":null,"abstract":"Many modern clusters are being equipped with multiple GPUs per node to achieve better compute density and power efficiency. However, moving data in/out of GPUs continues to remain a major performance bottleneck. With CUDA 4.1, NVIDIA has introduced Inter-Process Communication (IPC) to address data movement overheads between processes using different GPUs connected to the same node. State-of-the-art MPI libraries like MVAPICH2 are being modified to allow application developers to use MPI calls directly over GPU device memory. This improves the programmability for application developers by removing the burden of dealing with complex data movement optimizations. In this paper, we propose efficient designs for intra-node MPI communication on multi-GPU nodes, taking advantage of IPC capabilities provided in CUDA. We also demonstrate how MPI one-sided communication semantics can provide better performance and overlap by taking advantage of IPC and the Direct Memory Access (DMA) engine on a GPU. We demonstrate the effectiveness of our designs using micro-benchmarks and an application. The proposed designs improve GPU-to-GPU MPI Send/Receive latency for 4MByte messages by 79% and achieve 4 times the bandwidth for the same message size. One-sided communication using Put and Active synchronization shows 74% improvement in latency for 4MByte message, compared to the existing Send/Receive based implementation. Our benchmark using Get and Passive Synchronization demonstrates that true asynchronous progress can be achieved using IPC and the GPU DMA engine. Our designs for two-sided and one-sided communication improve the performance of GPULBM, a CUDA implementation of Lattice Boltzmann Method for multiphase flows, by 16%, compared to the performance using existing designs in MVAPICH2. To the best of our knowledge, this is the first paper to provide a comprehensive solution for MPI two-sided and one-sided GPU-to-GPU communication within a node, using CUDA IPC.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"75","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2012.228","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 75

Abstract

Many modern clusters are being equipped with multiple GPUs per node to achieve better compute density and power efficiency. However, moving data in/out of GPUs continues to remain a major performance bottleneck. With CUDA 4.1, NVIDIA has introduced Inter-Process Communication (IPC) to address data movement overheads between processes using different GPUs connected to the same node. State-of-the-art MPI libraries like MVAPICH2 are being modified to allow application developers to use MPI calls directly over GPU device memory. This improves the programmability for application developers by removing the burden of dealing with complex data movement optimizations. In this paper, we propose efficient designs for intra-node MPI communication on multi-GPU nodes, taking advantage of IPC capabilities provided in CUDA. We also demonstrate how MPI one-sided communication semantics can provide better performance and overlap by taking advantage of IPC and the Direct Memory Access (DMA) engine on a GPU. We demonstrate the effectiveness of our designs using micro-benchmarks and an application. The proposed designs improve GPU-to-GPU MPI Send/Receive latency for 4MByte messages by 79% and achieve 4 times the bandwidth for the same message size. One-sided communication using Put and Active synchronization shows 74% improvement in latency for 4MByte message, compared to the existing Send/Receive based implementation. Our benchmark using Get and Passive Synchronization demonstrates that true asynchronous progress can be achieved using IPC and the GPU DMA engine. Our designs for two-sided and one-sided communication improve the performance of GPULBM, a CUDA implementation of Lattice Boltzmann Method for multiphase flows, by 16%, compared to the performance using existing designs in MVAPICH2. To the best of our knowledge, this is the first paper to provide a comprehensive solution for MPI two-sided and one-sided GPU-to-GPU communication within a node, using CUDA IPC.

查看原文本刊更多论文

利用CUDA进程间通信优化多gpu系统上的MPI通信

许多现代集群正在为每个节点配备多个gpu，以实现更好的计算密度和功率效率。然而，将数据移入/移出gpu仍然是一个主要的性能瓶颈。在CUDA 4.1中，NVIDIA引入了进程间通信(IPC)来解决使用连接到同一节点的不同gpu的进程之间的数据移动开销。最先进的MPI库(如MVAPICH2)正在进行修改，以允许应用程序开发人员直接在GPU设备内存上使用MPI调用。这消除了处理复杂数据移动优化的负担，从而提高了应用程序开发人员的可编程性。在本文中，我们提出了在多gpu节点上的节点内MPI通信的有效设计，利用CUDA提供的IPC功能。我们还演示了MPI单侧通信语义如何通过利用GPU上的IPC和直接内存访问(DMA)引擎提供更好的性能和重叠。我们使用微基准测试和应用程序证明了我们设计的有效性。提出的设计将4MByte消息的gpu到gpu MPI发送/接收延迟提高了79%，并且在相同消息大小的情况下实现了4倍的带宽。与现有的基于Send/Receive的实现相比，使用Put和Active同步的单侧通信显示，对于4MByte消息，延迟提高了74%。我们使用Get和被动同步的基准测试表明，使用IPC和GPU DMA引擎可以实现真正的异步进度。与MVAPICH2中现有设计相比，我们的双边和单边通信设计将GPULBM(用于多相流的Lattice Boltzmann Method的CUDA实现)的性能提高了16%。据我们所知，这是第一篇使用CUDA IPC为节点内的MPI双边和单边gpu到gpu通信提供全面解决方案的论文。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

自引率

0.00%

发文量