Towards high performance paged memory for GPUs

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI:10.1109/HPCA.2016.7446077

Tianhao Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, S. Keckler

{"title":"Towards high performance paged memory for GPUs","authors":"Tianhao Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, S. Keckler","doi":"10.1109/HPCA.2016.7446077","DOIUrl":null,"url":null,"abstract":"Despite industrial investment in both on-die GPUs and next generation interconnects, the highest performing parallel accelerators shipping today continue to be discrete GPUs. Connected via PCIe, these GPUs utilize their own privately managed physical memory that is optimized for high bandwidth. These separate memories force GPU programmers to manage the movement of data between the CPU and GPU, in addition to the on-chip GPU memory hierarchy. To simplify this process, GPU vendors are developing software runtimes that automatically page memory in and out of the GPU on-demand, reducing programmer effort and enabling computation across datasets that exceed the GPU memory capacity. Because this memory migration occurs over a high latency and low bandwidth link (compared to GPU memory), these software runtimes may result in significant performance penalties. In this work, we explore the features needed in GPU hardware and software to close the performance gap of GPU paged memory versus legacy programmer directed memory management. Without modifying the GPU execution pipeline, we show it is possible to largely hide the performance overheads of GPU paged memory, converting an average 2× slowdown into a 12% speedup when compared to programmer directed transfers. Additionally, we examine the performance impact that GPU memory oversubscription has on application run times, enabling application designers to make informed decisions on how to shard their datasets across hosts and GPU instances.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"110","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2016.7446077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 110

Abstract

Despite industrial investment in both on-die GPUs and next generation interconnects, the highest performing parallel accelerators shipping today continue to be discrete GPUs. Connected via PCIe, these GPUs utilize their own privately managed physical memory that is optimized for high bandwidth. These separate memories force GPU programmers to manage the movement of data between the CPU and GPU, in addition to the on-chip GPU memory hierarchy. To simplify this process, GPU vendors are developing software runtimes that automatically page memory in and out of the GPU on-demand, reducing programmer effort and enabling computation across datasets that exceed the GPU memory capacity. Because this memory migration occurs over a high latency and low bandwidth link (compared to GPU memory), these software runtimes may result in significant performance penalties. In this work, we explore the features needed in GPU hardware and software to close the performance gap of GPU paged memory versus legacy programmer directed memory management. Without modifying the GPU execution pipeline, we show it is possible to largely hide the performance overheads of GPU paged memory, converting an average 2× slowdown into a 12% speedup when compared to programmer directed transfers. Additionally, we examine the performance impact that GPU memory oversubscription has on application run times, enabling application designers to make informed decisions on how to shard their datasets across hosts and GPU instances.

查看原文本刊更多论文

面向gpu的高性能分页内存

尽管工业投资于片上gpu和下一代互连，但目前出货的性能最高的并行加速器仍然是分立gpu。通过PCIe连接，这些gpu利用自己的私人管理的物理内存，为高带宽进行了优化。除了芯片上的GPU内存层次结构之外，这些独立的内存迫使GPU程序员管理CPU和GPU之间的数据移动。为了简化这一过程，GPU供应商正在开发软件运行时，这些软件运行时可以根据需要自动地在GPU中调入和调出内存，从而减少程序员的工作量，并支持在超过GPU内存容量的数据集上进行计算。由于这种内存迁移发生在高延迟和低带宽链路上(与GPU内存相比)，因此这些软件运行时可能会导致显著的性能损失。在这项工作中，我们探索了GPU硬件和软件所需的功能，以缩小GPU分页内存与传统程序员定向内存管理的性能差距。在不修改GPU执行管道的情况下，我们展示了在很大程度上隐藏GPU分页内存的性能开销是可能的，与程序员指导的传输相比，将平均2倍的减速转换为12%的加速。此外，我们还研究了GPU内存超额订阅对应用程序运行时的性能影响，使应用程序设计人员能够就如何跨主机和GPU实例共享数据集做出明智的决策。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量