A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters

Akshay Venkatesh, H. Subramoni, Khaled Hamidouche, D. Panda
{"title":"A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters","authors":"Akshay Venkatesh, H. Subramoni, Khaled Hamidouche, D. Panda","doi":"10.1109/HiPC.2014.7116875","DOIUrl":null,"url":null,"abstract":"Several streaming applications in the field of high performance computing are obtaining significant speedups in execution time by leveraging the raw compute power offered by modern GPGPUs. This raw compute power, coupled with the high network throughput offered by high performance interconnects such as InfiniBand (IB) are allowing streaming applications to scale to rapidly. A frequently used operation that constitutes to the execution of multi-node streaming applications is the broadcast operation where data from a single source is transmitted to multiple sinks, typically from a live data site. Although high performance networks like IB offer novel features like hardware based multicast to speed up the performance of the broadcast operation, their benefits have been limited to host based applications due to the inability of IB Host Channel Adapters (HCAs) to directly access the memory of the GPGPUs. This poses a significant performance bottleneck to high performance streaming applications that rely heavily on broadcast operations from GPU memories. The recently introduced GPUDirect RDMA feature alleviates this bottleneck by enabling IB HCAs to perform data transfers directly to / from GPU memory (bypassing host memory). Thus, it presents an attractive alternative to designing high performance broadcast operations for GPGPU based high performance streaming applications. In this work, we propose a novel method for fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design a high performance broadcast operation for streaming applications. The experiments conducted with the proposed design show up 60% decrease in latency and 3X-4X improvement in a throughput benchmark compared to the naive scheme on 64 GPU nodes.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 21st International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2014.7116875","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Several streaming applications in the field of high performance computing are obtaining significant speedups in execution time by leveraging the raw compute power offered by modern GPGPUs. This raw compute power, coupled with the high network throughput offered by high performance interconnects such as InfiniBand (IB) are allowing streaming applications to scale to rapidly. A frequently used operation that constitutes to the execution of multi-node streaming applications is the broadcast operation where data from a single source is transmitted to multiple sinks, typically from a live data site. Although high performance networks like IB offer novel features like hardware based multicast to speed up the performance of the broadcast operation, their benefits have been limited to host based applications due to the inability of IB Host Channel Adapters (HCAs) to directly access the memory of the GPGPUs. This poses a significant performance bottleneck to high performance streaming applications that rely heavily on broadcast operations from GPU memories. The recently introduced GPUDirect RDMA feature alleviates this bottleneck by enabling IB HCAs to perform data transfers directly to / from GPU memory (bypassing host memory). Thus, it presents an attractive alternative to designing high performance broadcast operations for GPGPU based high performance streaming applications. In this work, we propose a novel method for fully utilizing GPUDirect RDMA and hardware multicast features in tandem to design a high performance broadcast operation for streaming applications. The experiments conducted with the proposed design show up 60% decrease in latency and 3X-4X improvement in a throughput benchmark compared to the naive scheme on 64 GPU nodes.
一种高性能广播设计,采用硬件组播和GPUDirect RDMA,适用于Infiniband集群上的流媒体应用
高性能计算领域的几个流应用程序通过利用现代gpgpu提供的原始计算能力,在执行时间上获得了显着的加速。这种原始的计算能力,加上InfiniBand (IB)等高性能互连提供的高网络吞吐量,使流应用程序能够快速扩展。构成多节点流应用程序执行的一个经常使用的操作是广播操作,其中来自单个源的数据被传输到多个接收器,通常来自实时数据站点。尽管像IB这样的高性能网络提供了像基于硬件的多播这样的新特性来加速广播操作的性能,但由于IB主机通道适配器(hca)无法直接访问gpgpu的内存,它们的好处仅限于基于主机的应用程序。这对严重依赖GPU内存广播操作的高性能流媒体应用程序构成了显著的性能瓶颈。最近推出的GPUDirect RDMA特性缓解了这一瓶颈,使IB hca能够直接执行数据传输到/从GPU内存(绕过主机内存)。因此,它为基于GPGPU的高性能流媒体应用程序设计高性能广播操作提供了一个有吸引力的替代方案。在这项工作中,我们提出了一种新的方法,充分利用GPUDirect RDMA和硬件多播特性,为流媒体应用设计高性能广播操作。与64个GPU节点上的初始方案相比,使用所提出的设计进行的实验显示延迟减少60%,吞吐量基准提高3 -4倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信