在InfiniBand GPU集群上设计高效的节点间MPI通信小消息传输机制

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI:10.1109/HiPC.2014.7116873

Rong Shi, S. Potluri, Khaled Hamidouche, Jonathan L. Perkins, Mingzhe Li, D. Rossetti, D. Panda

{"title":"在InfiniBand GPU集群上设计高效的节点间MPI通信小消息传输机制","authors":"Rong Shi, S. Potluri, Khaled Hamidouche, Jonathan L. Perkins, Mingzhe Li, D. Rossetti, D. Panda","doi":"10.1109/HiPC.2014.7116873","DOIUrl":null,"url":null,"abstract":"Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques. Besides that, the newly introduced GPU Direct RDMA (GDR) is a promising solution to further solve this data movement bottleneck. However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange. In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes. Our designs to support the eager protocol include efficient support at both sender and receiver sides. Furthermore, we propose a new data path to provide fast copies between host and GPU memories. To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol. Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively. These designs boost the uni-directional bandwidth by 7.3x and 1.7x, respectively. We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue. Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 23.4% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":"{\"title\":\"Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters\",\"authors\":\"Rong Shi, S. Potluri, Khaled Hamidouche, Jonathan L. Perkins, Mingzhe Li, D. Rossetti, D. Panda\",\"doi\":\"10.1109/HiPC.2014.7116873\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques. Besides that, the newly introduced GPU Direct RDMA (GDR) is a promising solution to further solve this data movement bottleneck. However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange. In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes. Our designs to support the eager protocol include efficient support at both sender and receiver sides. Furthermore, we propose a new data path to provide fast copies between host and GPU memories. To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol. Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively. These designs boost the uni-directional bandwidth by 7.3x and 1.7x, respectively. We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue. Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 23.4% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively.\",\"PeriodicalId\":337777,\"journal\":{\"name\":\"2014 21st International Conference on High Performance Computing (HiPC)\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"31\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 21st International Conference on High Performance Computing (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC.2014.7116873\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 21st International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2014.7116873","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

摘要

越来越多的MPI应用程序正在被移植，以利用gpu提供的计算能力。GPU集群上的数据移动仍然是阻碍科学应用充分利用GPU潜力的主要瓶颈。以前，GPU-GPU节点间通信必须先将数据从GPU内存移动到主机内存，然后再通过网络发送数据。像MVAPICH2这样的MPI库提供了使用基于主机的流水线技术来缓解这一瓶颈的解决方案。此外，新推出的GPU Direct RDMA (GDR)是进一步解决这一数据移动瓶颈的一个有希望的解决方案。但是，MPI库中的现有设计对所有消息大小都应用了会合协议，由于需要额外的同步消息交换，这会给小消息通信带来相当大的开销。在本文中，我们提出了优化节点间gpu到gpu通信的新技术，用于小消息大小。我们的设计支持急切协议包括在发送方和接收方的有效支持。此外，我们提出了一种新的数据路径来提供主机和GPU存储器之间的快速拷贝。据我们所知，这是第一个使用急切协议为小消息大小的GPU通信提出有效设计的研究。我们的实验结果表明，gpu到gpu和cpu到gpu点对点通信的延迟分别减少了59%和63%。这些设计分别将单向带宽提高了7.3倍和1.7倍。我们还用两个最终应用来评估我们提出的设计:GPULBM和hood -blue。开普勒gpu上的性能数据显示，与现有最好的GDR设计相比，我们提出的设计分别实现了GPULBM的23.4%延迟降低和HOOMD-blue的58%平均TPS提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters

Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques. Besides that, the newly introduced GPU Direct RDMA (GDR) is a promising solution to further solve this data movement bottleneck. However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange. In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes. Our designs to support the eager protocol include efficient support at both sender and receiver sides. Furthermore, we propose a new data path to provide fast copies between host and GPU memories. To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol. Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively. These designs boost the uni-directional bandwidth by 7.3x and 1.7x, respectively. We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue. Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 23.4% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 21st International Conference on High Performance Computing (HiPC)

自引率

0.00%

发文量