GPU-SM: shared memory multi-GPU programming

Proceedings of the 8th Workshop on General Purpose Processing using GPUs Pub Date : 2015-02-07 DOI:10.1145/2716282.2716286

Javier Cabezas, Marc Jordà, Isaac Gelado, N. Navarro, Wen-mei W. Hwu

{"title":"GPU-SM: shared memory multi-GPU programming","authors":"Javier Cabezas, Marc Jordà, Isaac Gelado, N. Navarro, Wen-mei W. Hwu","doi":"10.1145/2716282.2716286","DOIUrl":null,"url":null,"abstract":"Discrete GPUs in modern multi-GPU systems can transparently access each other's memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highly-multithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g. a 40% SLOC reduction in the host code of finite difference).","PeriodicalId":432610,"journal":{"name":"Proceedings of the 8th Workshop on General Purpose Processing using GPUs","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th Workshop on General Purpose Processing using GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2716282.2716286","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Discrete GPUs in modern multi-GPU systems can transparently access each other's memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highly-multithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g. a 40% SLOC reduction in the host code of finite difference).

查看原文本刊更多论文

GPU-SM:共享内存多gpu编程

现代多gpu系统中的分立gpu可以通过PCIe互连透明地访问彼此的内存。未来的系统将通过包括更好的GPU互连(如NVLink)来提高这种能力。然而，跨GPU的远程内存访问在很大程度上被程序员忽视了，多GPU系统仍然像分布式系统一样编程，其中每个GPU只能访问自己的内存。这增加了主机代码的复杂性，因为程序员需要显式地跨GPU内存通信数据。在本文中，我们提出了GPU-SM，这是一套指导方针，用于以最小的性能开销编程多gpu系统，如NUMA共享内存系统。使用GPU- sm，数据结构可以跨多个GPU内存分解，驻留在不同GPU上的数据可以通过PCI互连远程访问。共享内存模型在gpu上的可编程性优势通过有限差分和图像过滤应用程序来展示。我们还详细分析了PCIe互连和远程访问对内核性能的影响。虽然与本地GPU内存相比，PCIe施加了很长的延迟和有限的带宽，但我们表明，高度多线程的GPU执行模型可以帮助降低其成本。对有限差分和图像滤波GPU-SM实现的评估显示，在具有4个gpu的系统上，使用比原始实现简单得多的代码(例如，有限差分主机代码的SLOC减少了40%)，接近线性加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 8th Workshop on General Purpose Processing using GPUs

自引率

0.00%

发文量