利用GPUDirect RDMA设计NVIDIA GPU集群的高性能OpenSHMEM

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI:10.1109/CLUSTER.2015.21

Khaled Hamidouche, Akshay Venkatesh, A. Awan, H. Subramoni, Ching-Hsiang Chu, D. Panda

{"title":"利用GPUDirect RDMA设计NVIDIA GPU集群的高性能OpenSHMEM","authors":"Khaled Hamidouche, Akshay Venkatesh, A. Awan, H. Subramoni, Ching-Hsiang Chu, D. Panda","doi":"10.1109/CLUSTER.2015.21","DOIUrl":null,"url":null,"abstract":"GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs (referred to as \"Device\"). It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs of OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to inefficiencies and sub-optimal performance. In this paper, we analyze the performance of various OpenSHMEM operations with different inter-node and intra-node communication configurations (Host-to-Device, Device-to-Device, and Device-to-Host) on GPU based systems. We propose novel designs that ensure \"truly one-sided\" communication for the different inter-/intra-node configurations identified above while working around the hardware limitations. To the best of our knowledge, this is the first work that investigates GDR-aware designs for OpenSHMEM communication operations. Experimental evaluations indicate 2.5X and 7X improvement in point-point communication for intra-node and inter-node, respectively. The proposed framework achieves 2.2μs for an intra-node 8 byte put operation from Host-to-Device, and 3.13μs for an inter-node 8 byte put operation from GPU to remote GPU. With Stencil2D application kernel from SHOC benchmark suite, we observe a 19% reduction in execution time on 64 GPU nodes. Further, for GPULBM application, we are able to improve the performance of the evolution phase by 53% and 45% on 32 and 64 GPU nodes, respectively.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"97 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters\",\"authors\":\"Khaled Hamidouche, Akshay Venkatesh, A. Awan, H. Subramoni, Ching-Hsiang Chu, D. Panda\",\"doi\":\"10.1109/CLUSTER.2015.21\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs (referred to as \\\"Device\\\"). It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs of OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to inefficiencies and sub-optimal performance. In this paper, we analyze the performance of various OpenSHMEM operations with different inter-node and intra-node communication configurations (Host-to-Device, Device-to-Device, and Device-to-Host) on GPU based systems. We propose novel designs that ensure \\\"truly one-sided\\\" communication for the different inter-/intra-node configurations identified above while working around the hardware limitations. To the best of our knowledge, this is the first work that investigates GDR-aware designs for OpenSHMEM communication operations. Experimental evaluations indicate 2.5X and 7X improvement in point-point communication for intra-node and inter-node, respectively. The proposed framework achieves 2.2μs for an intra-node 8 byte put operation from Host-to-Device, and 3.13μs for an inter-node 8 byte put operation from GPU to remote GPU. With Stencil2D application kernel from SHOC benchmark suite, we observe a 19% reduction in execution time on 64 GPU nodes. Further, for GPULBM application, we are able to improve the performance of the evolution phase by 53% and 45% on 32 and 64 GPU nodes, respectively.\",\"PeriodicalId\":187042,\"journal\":{\"name\":\"2015 IEEE International Conference on Cluster Computing\",\"volume\":\"97 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE International Conference on Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CLUSTER.2015.21\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

摘要

GDR (GPUDirect RDMA)将IB (InfiniBand)等RDMA网络的高性能通信能力引入gpu(即“设备”)。它使IB网卡能够直接向GPU内存写入/读取数据。分区全局地址空间(PGAS)编程模型，如OpenSHMEM，通过提供共享内存地址空间抽象以及单向通信语义，为开发具有不规则通信特征的科学应用程序提供了一种有吸引力的方法。然而，目前GPU集群上OpenSHMEM的方法和设计没有利用GDR特性，导致效率低下和性能次优。在本文中，我们分析了基于GPU的系统上不同节点间和节点内通信配置(主机到设备、设备到设备和设备到主机)下各种OpenSHMEM操作的性能。我们提出了新颖的设计，以确保在硬件限制的情况下，针对上述不同的节点间/节点内配置进行“真正的单侧”通信。据我们所知，这是第一个研究OpenSHMEM通信操作的gdr感知设计的工作。实验评估表明，节点内和节点间的点-点通信分别提高了2.5倍和7X。该框架实现了节点内从主机到设备的8字节放置操作2.2μs，节点间从GPU到远程GPU的8字节放置操作3.13μs。使用来自SHOC基准套件的Stencil2D应用程序内核，我们观察到64个GPU节点上的执行时间减少了19%。此外，对于GPULBM应用，我们能够在32和64个GPU节点上分别将演化阶段的性能提高53%和45%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs (referred to as "Device"). It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs of OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to inefficiencies and sub-optimal performance. In this paper, we analyze the performance of various OpenSHMEM operations with different inter-node and intra-node communication configurations (Host-to-Device, Device-to-Device, and Device-to-Host) on GPU based systems. We propose novel designs that ensure "truly one-sided" communication for the different inter-/intra-node configurations identified above while working around the hardware limitations. To the best of our knowledge, this is the first work that investigates GDR-aware designs for OpenSHMEM communication operations. Experimental evaluations indicate 2.5X and 7X improvement in point-point communication for intra-node and inter-node, respectively. The proposed framework achieves 2.2μs for an intra-node 8 byte put operation from Host-to-Device, and 3.13μs for an inter-node 8 byte put operation from GPU to remote GPU. With Stencil2D application kernel from SHOC benchmark suite, we observe a 19% reduction in execution time on 64 GPU nodes. Further, for GPULBM application, we are able to improve the performance of the evolution phase by 53% and 45% on 32 and 64 GPU nodes, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量