为GPU集群上的流应用设计高性能异构广播

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-01 DOI:10.1109/SBAC-PAD.2016.16

Ching-Hsiang Chu, Khaled Hamidouche, H. Subramoni, Akshay Venkatesh, B. Elton, D. Panda

{"title":"为GPU集群上的流应用设计高性能异构广播","authors":"Ching-Hsiang Chu, Khaled Hamidouche, H. Subramoni, Akshay Venkatesh, B. Elton, D. Panda","doi":"10.1109/SBAC-PAD.2016.16","DOIUrl":null,"url":null,"abstract":"High-performance streaming applications are beginning to leverage the compute power offered by graphics processing units (GPUs) and high network throughput offered by high performance interconnects such as InfiniBand (IB) to boost their performance and scalability. These applications rely heavily on broadcast operations to move data, which is stored in the host memory, from a single source—typically live—to multiple GPU-based computing sites. While homogeneous broadcast designs take advantage of IB hardware multicast feature to boost their performance, their heterogeneous counterpart requires an explicit data movement between Host and GPU, which significantly hurts the overall performance. There is a dearth of efficient heterogeneous broadcast designs for streaming applications especially on emerging multi-GPU configurations. In this work, we propose novel techniques to fully and conjointly take advantage of NVIDIA GPUDirect RDMA (GDR), CUDA inter-process communication (IPC) and IB hardware multicast features to design high-performance heterogeneous broadcast operations for modern multi-GPU systems. We propose intra-node, topology-aware schemes to maximize the performance benefits while minimizing the utilization of valuable PCIe resources. Further, we optimize the communication pipeline by overlapping the GDR + IB hardware multicast operations with CUDA IPC operations. Compared to existing solutions, our designs show up to 3X improvement in the latency of a heterogeneous broadcast operation. Our designs also show up to 67% improvement in execution time of a streaming benchmark on a GPU-dense Cray CS-Storm system with 88 GPUs.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters\",\"authors\":\"Ching-Hsiang Chu, Khaled Hamidouche, H. Subramoni, Akshay Venkatesh, B. Elton, D. Panda\",\"doi\":\"10.1109/SBAC-PAD.2016.16\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High-performance streaming applications are beginning to leverage the compute power offered by graphics processing units (GPUs) and high network throughput offered by high performance interconnects such as InfiniBand (IB) to boost their performance and scalability. These applications rely heavily on broadcast operations to move data, which is stored in the host memory, from a single source—typically live—to multiple GPU-based computing sites. While homogeneous broadcast designs take advantage of IB hardware multicast feature to boost their performance, their heterogeneous counterpart requires an explicit data movement between Host and GPU, which significantly hurts the overall performance. There is a dearth of efficient heterogeneous broadcast designs for streaming applications especially on emerging multi-GPU configurations. In this work, we propose novel techniques to fully and conjointly take advantage of NVIDIA GPUDirect RDMA (GDR), CUDA inter-process communication (IPC) and IB hardware multicast features to design high-performance heterogeneous broadcast operations for modern multi-GPU systems. We propose intra-node, topology-aware schemes to maximize the performance benefits while minimizing the utilization of valuable PCIe resources. Further, we optimize the communication pipeline by overlapping the GDR + IB hardware multicast operations with CUDA IPC operations. Compared to existing solutions, our designs show up to 3X improvement in the latency of a heterogeneous broadcast operation. Our designs also show up to 67% improvement in execution time of a streaming benchmark on a GPU-dense Cray CS-Storm system with 88 GPUs.\",\"PeriodicalId\":361160,\"journal\":{\"name\":\"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SBAC-PAD.2016.16\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2016.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

高性能流媒体应用程序开始利用图形处理单元(gpu)提供的计算能力和InfiniBand (IB)等高性能互连提供的高网络吞吐量来提高其性能和可扩展性。这些应用程序严重依赖广播操作来将存储在主机内存中的数据从单个源(通常是实时的)移动到多个基于gpu的计算站点。虽然同构广播设计利用IB硬件多播特性来提高其性能，但它们的异构对等体需要在主机和GPU之间显式地传输数据，这严重损害了整体性能。流媒体应用缺乏高效的异构广播设计，特别是在新兴的多gpu配置下。在这项工作中，我们提出了新的技术，以充分和联合利用NVIDIA GPUDirect RDMA (GDR)， CUDA进程间通信(IPC)和IB硬件多播功能，为现代多gpu系统设计高性能异构广播操作。我们提出了节点内拓扑感知方案，以最大限度地提高性能效益，同时最大限度地减少宝贵的PCIe资源的使用。此外，我们通过将GDR + IB硬件多播操作与CUDA IPC操作重叠来优化通信管道。与现有的解决方案相比，我们的设计将异构广播操作的延迟提高了3倍。我们的设计还显示，在具有88个gpu的gpu密集的Cray CS-Storm系统上，流基准测试的执行时间提高了67%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

High-performance streaming applications are beginning to leverage the compute power offered by graphics processing units (GPUs) and high network throughput offered by high performance interconnects such as InfiniBand (IB) to boost their performance and scalability. These applications rely heavily on broadcast operations to move data, which is stored in the host memory, from a single source—typically live—to multiple GPU-based computing sites. While homogeneous broadcast designs take advantage of IB hardware multicast feature to boost their performance, their heterogeneous counterpart requires an explicit data movement between Host and GPU, which significantly hurts the overall performance. There is a dearth of efficient heterogeneous broadcast designs for streaming applications especially on emerging multi-GPU configurations. In this work, we propose novel techniques to fully and conjointly take advantage of NVIDIA GPUDirect RDMA (GDR), CUDA inter-process communication (IPC) and IB hardware multicast features to design high-performance heterogeneous broadcast operations for modern multi-GPU systems. We propose intra-node, topology-aware schemes to maximize the performance benefits while minimizing the utilization of valuable PCIe resources. Further, we optimize the communication pipeline by overlapping the GDR + IB hardware multicast operations with CUDA IPC operations. Compared to existing solutions, our designs show up to 3X improvement in the latency of a heterogeneous broadcast operation. Our designs also show up to 67% improvement in execution time of a streaming benchmark on a GPU-dense Cray CS-Storm system with 88 GPUs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

自引率

0.00%

发文量