Unlocking bandwidth for GPUs in CC-NUMA systems

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2015-02-01 DOI:10.1109/HPCA.2015.7056046

Neha Agarwal, D. Nellans, Mike O'Connor, S. Keckler, T. Wenisch

{"title":"Unlocking bandwidth for GPUs in CC-NUMA systems","authors":"Neha Agarwal, D. Nellans, Mike O'Connor, S. Keckler, T. Wenisch","doi":"10.1109/HPCA.2015.7056046","DOIUrl":null,"url":null,"abstract":"Historically, GPU-based HPC applications have had a substantial memory bandwidth advantage over CPU-based workloads due to using GDDR rather than DDR memory. However, past GPUs required a restricted programming model where application data was allocated up front and explicitly copied into GPU memory before launching a GPU kernel by the programmer. Recently, GPUs have eased this requirement and now can employ on-demand software page migration between CPU and GPU memory to obviate explicit copying. In the near future, CC-NUMA GPU-CPU systems will appear where software page migration is an optional choice and hardware cache-coherence can also support the GPU accessing CPU memory directly. In this work, we describe the trade-offs and considerations in relying on hardware cache-coherence mechanisms versus using software page migration to optimize the performance of memory-intensive GPU workloads. We show that page migration decisions based on page access frequency alone are a poor solution and that a broader solution using virtual address-based program locality to enable aggressive memory prefetching combined with bandwidth balancing is required to maximize performance. We present a software runtime system requiring minimal hardware support that, on average, outperforms CC-NUMA-based accesses by 1.95 ×, performs 6% better than the legacy CPU to GPU memcpy regime by intelligently using both CPU and GPU memory bandwidth, and comes within 28% of oracular page placement, all while maintaining the relaxed memory semantics of modern GPUs.","PeriodicalId":6593,"journal":{"name":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","volume":"59 1","pages":"354-365"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"72","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2015.7056046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 72

Abstract

Historically, GPU-based HPC applications have had a substantial memory bandwidth advantage over CPU-based workloads due to using GDDR rather than DDR memory. However, past GPUs required a restricted programming model where application data was allocated up front and explicitly copied into GPU memory before launching a GPU kernel by the programmer. Recently, GPUs have eased this requirement and now can employ on-demand software page migration between CPU and GPU memory to obviate explicit copying. In the near future, CC-NUMA GPU-CPU systems will appear where software page migration is an optional choice and hardware cache-coherence can also support the GPU accessing CPU memory directly. In this work, we describe the trade-offs and considerations in relying on hardware cache-coherence mechanisms versus using software page migration to optimize the performance of memory-intensive GPU workloads. We show that page migration decisions based on page access frequency alone are a poor solution and that a broader solution using virtual address-based program locality to enable aggressive memory prefetching combined with bandwidth balancing is required to maximize performance. We present a software runtime system requiring minimal hardware support that, on average, outperforms CC-NUMA-based accesses by 1.95 ×, performs 6% better than the legacy CPU to GPU memcpy regime by intelligently using both CPU and GPU memory bandwidth, and comes within 28% of oracular page placement, all while maintaining the relaxed memory semantics of modern GPUs.

查看原文本刊更多论文

CC-NUMA系统中gpu的带宽解锁

从历史上看，由于使用GDDR而不是DDR内存，基于gpu的HPC应用程序比基于cpu的工作负载具有显著的内存带宽优势。然而，过去的GPU需要一个受限制的编程模型，其中应用程序数据是预先分配的，并在程序员启动GPU内核之前显式地复制到GPU内存中。最近，GPU已经放宽了这一要求，现在可以在CPU和GPU内存之间采用按需软件页面迁移来避免显式复制。在不久的将来，CC-NUMA GPU-CPU系统将出现，软件页面迁移是一个可选的选择，硬件缓存一致性也可以支持GPU直接访问CPU内存。在这项工作中，我们描述了依赖硬件缓存一致性机制与使用软件页面迁移来优化内存密集型GPU工作负载性能的权衡和注意事项。我们表明，仅基于页面访问频率的页面迁移决策是一个糟糕的解决方案，并且需要使用基于虚拟地址的程序局域性来启用主动内存预取并结合带宽平衡的更广泛的解决方案来最大化性能。我们提出了一个软件运行时系统，需要最少的硬件支持，平均而言，优于基于cc - numa的访问1.95倍，通过智能地使用CPU和GPU内存带宽，比传统的CPU到GPU内存机制好6%，并且在28%的oracle页面放置范围内，同时保持现代GPU的宽松内存语义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量