Multi-GPU System Design with Memory Networks

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture Pub Date : 2014-12-13 DOI:10.1109/MICRO.2014.55

Gwangsun Kim, Minseok Lee, Jiyun Jeong, John Kim

{"title":"Multi-GPU System Design with Memory Networks","authors":"Gwangsun Kim, Minseok Lee, Jiyun Jeong, John Kim","doi":"10.1109/MICRO.2014.55","DOIUrl":null,"url":null,"abstract":"GPUs are being widely used to accelerate different workloads and multi-GPU systems can provide higher performance with multiple discrete GPUs interconnected together. However, there are two main communication bottlenecks in multi-GPU systems -- accessing remote GPU memory and the communication between GPU and the host CPU. Recent advances in multi-GPU programming, including unified virtual addressing and unified memory from NVIDIA, has made programming simpler but the costly remote memory access still makes multi-GPU programming difficult. In order to overcome the communication limitations, we propose to leverage the memory network based on hybrid memory cubes (HMCs) to simplify multi-GPU memory management and improve programmability. In particular, we propose scalable kernel execution (SKE) where multiple GPUs are viewed as a single virtual GPU as a single kernel can be executed across multiple GPUs without modifying the source code. To fully enable the benefits of SKE, we explore alternative memory network designs in a multi-GPU system. We propose a GPU memory network (GMN) to simplify data sharing between the discrete GPUs while a CPU memory network (CMN) is used to simplify data communication between the host CPU and the discrete GPUs. These two types of networks can be combined to create a unified memory network (UMN) where the communication bottleneck in multi-GPU can be significantly minimized as both the CPU and GPU share the memory network. We evaluate alternative network designs and propose a sliced flattened butterfly topology for the memory network that scales better than previously proposed alternative topologies by removing local HMC channels. In addition, we propose an overlay network organization for unified memory network to minimize the latency for CPU access while providing high bandwidth for the GPUs. We evaluate trade-offs between the different memory network organization and show how UMN significantly reduces the communication bottleneck in multi-GPU systems.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"96 1","pages":"484-495"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MICRO.2014.55","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

Abstract

GPUs are being widely used to accelerate different workloads and multi-GPU systems can provide higher performance with multiple discrete GPUs interconnected together. However, there are two main communication bottlenecks in multi-GPU systems -- accessing remote GPU memory and the communication between GPU and the host CPU. Recent advances in multi-GPU programming, including unified virtual addressing and unified memory from NVIDIA, has made programming simpler but the costly remote memory access still makes multi-GPU programming difficult. In order to overcome the communication limitations, we propose to leverage the memory network based on hybrid memory cubes (HMCs) to simplify multi-GPU memory management and improve programmability. In particular, we propose scalable kernel execution (SKE) where multiple GPUs are viewed as a single virtual GPU as a single kernel can be executed across multiple GPUs without modifying the source code. To fully enable the benefits of SKE, we explore alternative memory network designs in a multi-GPU system. We propose a GPU memory network (GMN) to simplify data sharing between the discrete GPUs while a CPU memory network (CMN) is used to simplify data communication between the host CPU and the discrete GPUs. These two types of networks can be combined to create a unified memory network (UMN) where the communication bottleneck in multi-GPU can be significantly minimized as both the CPU and GPU share the memory network. We evaluate alternative network designs and propose a sliced flattened butterfly topology for the memory network that scales better than previously proposed alternative topologies by removing local HMC channels. In addition, we propose an overlay network organization for unified memory network to minimize the latency for CPU access while providing high bandwidth for the GPUs. We evaluate trade-offs between the different memory network organization and show how UMN significantly reduces the communication bottleneck in multi-GPU systems.

查看原文本刊更多论文

内存网络的多gpu系统设计

gpu被广泛用于加速不同的工作负载，多gpu系统可以通过多个独立的gpu互连在一起提供更高的性能。然而，在多GPU系统中有两个主要的通信瓶颈——访问远程GPU内存和GPU与主机CPU之间的通信。多gpu编程的最新进展，包括NVIDIA的统一虚拟寻址和统一内存，使编程变得更简单，但昂贵的远程内存访问仍然使多gpu编程变得困难。为了克服通信限制，我们提出利用基于混合内存立方体(hmc)的内存网络来简化多gpu内存管理并提高可编程性。特别是，我们提出了可扩展的内核执行(SKE)，其中多个GPU被视为单个虚拟GPU，因为单个内核可以跨多个GPU执行而无需修改源代码。为了充分发挥SKE的优势，我们探索了多gpu系统中的替代内存网络设计。我们提出了GPU内存网络(GMN)来简化离散GPU之间的数据共享，而CPU内存网络(CMN)来简化主机CPU与离散GPU之间的数据通信。这两种类型的网络可以结合在一起创建一个统一的内存网络(UMN)， CPU和GPU共享内存网络，可以显着减少多GPU的通信瓶颈。我们评估了可替代的网络设计，并为存储网络提出了一种切片扁平的蝴蝶拓扑，通过去除本地HMC通道，该拓扑比先前提出的可替代拓扑具有更好的可扩展性。此外，我们提出了一种覆盖网络组织的统一存储网络，以尽量减少CPU访问的延迟，同时为gpu提供高带宽。我们评估了不同内存网络组织之间的权衡，并展示了UMN如何显著减少多gpu系统中的通信瓶颈。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

自引率

0.00%

发文量