GPUrdma: GPU端库，用于GPU内核的高性能网络

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2016-06-01 DOI:10.1145/2931088.2931091

F. Daoud, Amir Wated, M. Silberstein

{"title":"GPUrdma: GPU端库，用于GPU内核的高性能网络","authors":"F. Daoud, Amir Wated, M. Silberstein","doi":"10.1145/2931088.2931091","DOIUrl":null,"url":null,"abstract":"We present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both control and data. Slow single-thread GPU performance and the intricacies of the GPU-to-network adapter interaction pose a significant challenge. We describe several design options and analyze their performance implications in detail. We achieve 5μsec one-way communication latency and up to 50Gbit/sec transfer bandwidth for messages from 16KB and larger between K40c NVIDIA GPUs across the network. Moreover, GPUrdma outperforms the CPU RDMA for smaller packets ranging from 2 to 1024 bytes by factor of 4.5x thanks to greater parallelism of transfer requests enabled by highly parallel GPU hardware. We use GPUrdma to implement a subset of the global address space programming interface (GPI) for point-to-point asynchronous RDMA messaging. We demonstrate our preliminary results using two simple applications -- ping-pong and a multi-matrix-vector product with constant matrix and multiple vectors -- each running on two different machines connected by Infiniband. Our basic ping-pong implementation achieves 5%higher performance than the baseline using GPI-2. The improved ping-pong implementation with per-threadblock communication overlap enables further 20% improvement. The multi-matrix-vector product is up to 4.5x faster thanks to higher throughput for small messages and the ability to keep the matrix in fast GPU shared memory while receiving new inputs. GPUrdma prototype is not yet suitable for production systems due to hardware constraints in the current generation of NVIDIA GPUs which we discuss in detail. However, our results highlight the great potential of GPU-side native networking, and encourage further research toward scalable, high-performance, heterogeneous networking infrastructure.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"19 35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":"{\"title\":\"GPUrdma: GPU-side library for high performance networking from GPU kernels\",\"authors\":\"F. Daoud, Amir Wated, M. Silberstein\",\"doi\":\"10.1145/2931088.2931091\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both control and data. Slow single-thread GPU performance and the intricacies of the GPU-to-network adapter interaction pose a significant challenge. We describe several design options and analyze their performance implications in detail. We achieve 5μsec one-way communication latency and up to 50Gbit/sec transfer bandwidth for messages from 16KB and larger between K40c NVIDIA GPUs across the network. Moreover, GPUrdma outperforms the CPU RDMA for smaller packets ranging from 2 to 1024 bytes by factor of 4.5x thanks to greater parallelism of transfer requests enabled by highly parallel GPU hardware. We use GPUrdma to implement a subset of the global address space programming interface (GPI) for point-to-point asynchronous RDMA messaging. We demonstrate our preliminary results using two simple applications -- ping-pong and a multi-matrix-vector product with constant matrix and multiple vectors -- each running on two different machines connected by Infiniband. Our basic ping-pong implementation achieves 5%higher performance than the baseline using GPI-2. The improved ping-pong implementation with per-threadblock communication overlap enables further 20% improvement. The multi-matrix-vector product is up to 4.5x faster thanks to higher throughput for small messages and the ability to keep the matrix in fast GPU shared memory while receiving new inputs. GPUrdma prototype is not yet suitable for production systems due to hardware constraints in the current generation of NVIDIA GPUs which we discuss in detail. However, our results highlight the great potential of GPU-side native networking, and encourage further research toward scalable, high-performance, heterogeneous networking infrastructure.\",\"PeriodicalId\":262414,\"journal\":{\"name\":\"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers\",\"volume\":\"19 35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"47\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2931088.2931091\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2931088.2931091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 47

摘要

我们提出了GPUrdma，一个GPU端库，用于直接从GPU内核跨网络执行远程直接内存访问(RDMA)。该库在CPU上不执行任何代码，直接访问主机通道适配器(HCA) Infiniband硬件，用于控制和数据。缓慢的单线程GPU性能和GPU与网络适配器交互的复杂性构成了一个重大挑战。我们描述了几种设计选项，并详细分析了它们的性能含义。我们在K40c NVIDIA gpu之间实现了5μsec的单向通信延迟和高达50Gbit/sec的传输带宽，可以在网络上传输16KB及以上的消息。此外，由于高度并行的GPU硬件支持更高的传输请求并行性，GPUrdma在2到1024字节的较小数据包上的性能比CPU RDMA高出4.5倍。我们使用GPUrdma来实现点对点异步RDMA消息传递的全局地址空间编程接口(GPI)的一个子集。我们使用两个简单的应用程序来演示我们的初步结果——乒乓和一个具有常数矩阵和多个向量的多矩阵向量乘积——每个应用程序都运行在两台通过Infiniband连接的不同机器上。我们的基本乒乓实现比使用GPI-2的基准性能高5%。每个线程块通信重叠的改进乒乓实现可以进一步提高20%。由于小消息的更高吞吐量以及在接收新输入时将矩阵保持在快速GPU共享内存中的能力，多矩阵向量乘积的速度提高了4.5倍。由于当前一代NVIDIA gpu的硬件限制，GPUrdma原型尚不适合生产系统，我们将详细讨论。然而，我们的结果强调了gpu端本地网络的巨大潜力，并鼓励进一步研究可扩展的、高性能的、异构的网络基础设施。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GPUrdma: GPU-side library for high performance networking from GPU kernels

We present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both control and data. Slow single-thread GPU performance and the intricacies of the GPU-to-network adapter interaction pose a significant challenge. We describe several design options and analyze their performance implications in detail. We achieve 5μsec one-way communication latency and up to 50Gbit/sec transfer bandwidth for messages from 16KB and larger between K40c NVIDIA GPUs across the network. Moreover, GPUrdma outperforms the CPU RDMA for smaller packets ranging from 2 to 1024 bytes by factor of 4.5x thanks to greater parallelism of transfer requests enabled by highly parallel GPU hardware. We use GPUrdma to implement a subset of the global address space programming interface (GPI) for point-to-point asynchronous RDMA messaging. We demonstrate our preliminary results using two simple applications -- ping-pong and a multi-matrix-vector product with constant matrix and multiple vectors -- each running on two different machines connected by Infiniband. Our basic ping-pong implementation achieves 5%higher performance than the baseline using GPI-2. The improved ping-pong implementation with per-threadblock communication overlap enables further 20% improvement. The multi-matrix-vector product is up to 4.5x faster thanks to higher throughput for small messages and the ability to keep the matrix in fast GPU shared memory while receiving new inputs. GPUrdma prototype is not yet suitable for production systems due to hardware constraints in the current generation of NVIDIA GPUs which we discuss in detail. However, our results highlight the great potential of GPU-side native networking, and encourage further research toward scalable, high-performance, heterogeneous networking infrastructure.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers

自引率

0.00%

发文量