A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI:10.1109/HiPC.2014.7116875

Akshay Venkatesh, H. Subramoni, Khaled Hamidouche, D. Panda

{"title":"A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters","authors":"Akshay Venkatesh, H. Subramoni, Khaled Hamidouche, D. Panda","doi":"10.1109/HiPC.2014.7116875","DOIUrl":null,"url":null,"abstract":"Several streaming applications in the field of high performance computing are obtaining significant speedups in execution time by leveraging the raw compute power offered by modern GPGPUs. This raw compute power, coupled with the high network throughput offered by high performance interconnects such as InfiniBand (IB) are allowing streaming applications to scale to rapidly. A frequently used operation that constitutes to the execution of multi-node streaming applications is the broadcast operation where data from a single source is transmitted to multiple sinks, typically from a live data site. Although high performance networks like IB offer novel features like hardware based multicast to speed up the performance of the broadcast operation, their benefits have been limited to host based applications due to the inability of IB Host Channel Adapters (HCAs) to directly access the memory of the GPGPUs. This poses a significant performance bottleneck to high performance streaming applications that rely heavily on broadcast operations from GPU memories. The recently introduced GPUDirect RDMA feature alleviates this bottleneck by enabling IB HCAs to perform data transfers directly to / from GPU memory (bypassing host memory). Thus, it presents an attractive alternative to designing high performance broadcast operations for GPGPU based high performance streaming applications. In this work, we propose a novel method for fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design a high performance broadcast operation for streaming applications. The experiments conducted with the proposed design show up 60% decrease in latency and 3X-4X improvement in a throughput benchmark compared to the naive scheme on 64 GPU nodes.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 21st International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2014.7116875","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Several streaming applications in the field of high performance computing are obtaining significant speedups in execution time by leveraging the raw compute power offered by modern GPGPUs. This raw compute power, coupled with the high network throughput offered by high performance interconnects such as InfiniBand (IB) are allowing streaming applications to scale to rapidly. A frequently used operation that constitutes to the execution of multi-node streaming applications is the broadcast operation where data from a single source is transmitted to multiple sinks, typically from a live data site. Although high performance networks like IB offer novel features like hardware based multicast to speed up the performance of the broadcast operation, their benefits have been limited to host based applications due to the inability of IB Host Channel Adapters (HCAs) to directly access the memory of the GPGPUs. This poses a significant performance bottleneck to high performance streaming applications that rely heavily on broadcast operations from GPU memories. The recently introduced GPUDirect RDMA feature alleviates this bottleneck by enabling IB HCAs to perform data transfers directly to / from GPU memory (bypassing host memory). Thus, it presents an attractive alternative to designing high performance broadcast operations for GPGPU based high performance streaming applications. In this work, we propose a novel method for fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design a high performance broadcast operation for streaming applications. The experiments conducted with the proposed design show up 60% decrease in latency and 3X-4X improvement in a throughput benchmark compared to the naive scheme on 64 GPU nodes.

查看原文本刊更多论文

一种高性能广播设计，采用硬件组播和GPUDirect RDMA，适用于Infiniband集群上的流媒体应用

高性能计算领域的几个流应用程序通过利用现代gpgpu提供的原始计算能力，在执行时间上获得了显着的加速。这种原始的计算能力，加上InfiniBand (IB)等高性能互连提供的高网络吞吐量，使流应用程序能够快速扩展。构成多节点流应用程序执行的一个经常使用的操作是广播操作，其中来自单个源的数据被传输到多个接收器，通常来自实时数据站点。尽管像IB这样的高性能网络提供了像基于硬件的多播这样的新特性来加速广播操作的性能，但由于IB主机通道适配器(hca)无法直接访问gpgpu的内存，它们的好处仅限于基于主机的应用程序。这对严重依赖GPU内存广播操作的高性能流媒体应用程序构成了显著的性能瓶颈。最近推出的GPUDirect RDMA特性缓解了这一瓶颈，使IB hca能够直接执行数据传输到/从GPU内存(绕过主机内存)。因此，它为基于GPGPU的高性能流媒体应用程序设计高性能广播操作提供了一个有吸引力的替代方案。在这项工作中，我们提出了一种新的方法，充分利用GPUDirect RDMA和硬件多播特性，为流媒体应用设计高性能广播操作。与64个GPU节点上的初始方案相比，使用所提出的设计进行的实验显示延迟减少60%，吞吐量基准提高3 -4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 21st International Conference on High Performance Computing (HiPC)

自引率

0.00%

发文量