H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUs

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI:10.1145/3639038

N. Akbarzadeh, Sina Darabi, A. Gheibi-Fetrat, Amir Mirzaei, Mohammad Sadrosadati, H. Sarbazi-Azad

{"title":"H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUs","authors":"N. Akbarzadeh, Sina Darabi, A. Gheibi-Fetrat, Amir Mirzaei, Mohammad Sadrosadati, H. Sarbazi-Azad","doi":"10.1145/3639038","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) are widely used for modern applications with huge data sizes. However, the performance benefit of GPUs is limited by their memory capacity and bandwidth. Although GPU vendors improve memory capacity and bandwidth using 3D memory technology (HBM), many important workloads with terabytes of data still cannot fit in the provided capacity and are bound by the provided bandwidth. With a limited GPU memory capacity, programmers should handle the data movement between GPU and host memories by themselves, causing a significant programming burden. To improve programming ease, GPUs use a unified address space with the host that allows over-subscribing GPU memory, but this approach is not effective in terms of performance once GPUs encounter memory page faults. Many recent works have tried to remedy capacity and bandwidth bottlenecks using dense non-volatile memories (NVMs) and true-3D stacking. However, these works mainly focus on one bottleneck or do not provide a scalable solution that fits future requirements. In this paper, we investigate true-3D stacking of dense, low-power, and refresh-free non-volatile phase change memory (PCM) on top of state-of-the-art GPU configurations to provide higher capacity and bandwidth within the available area and power budget. The higher density and lower power consumption of PCM provide higher capacity through integrating more cells in each 3D layer and enabling stacking more layers. However, we observe that stacking more than six layers of pure-PCM memory violates the thermal constraint and severely harms the performance and power efficiency due to its higher write latency and energy. Further, it degrades the lifetime of GPU to less than one year. Utilizing a hybrid architecture that leverages the benefits of both DRAM and PCM memories has been widely studied by prior proposals; however, true-3D integration of such a hybrid memory architecture especially on top of state-of-the-art powerful GPU architecture has not been investigated yet. We experimentally demonstrate that by covering 80% of write requests in DRAM and eliminating refresh overhead, true-3D stacking of eight 32GB layers of PCM along with two 8GB layers of DRAM is possible resulting in a total of 272GB memory capacity. Based on the explored design requirements, We propose a 3D high-bandwidth high-capacity hybrid memory (H3DM) system utilizing a hybrid-3D (H3D)-aware remapping scheme to reduce expensive PCM writes to under 20% while avoiding DRAM refresh overhead. H3DM improves the performance up to 291% compared to the baseline GPU architecture while remaining within only 3% of an ideal case with DRAM-like access latency, on average. Moreover, by increasing the dataset size above the baseline GPU memory space, H3DM improves performance and power up to 648% and 87% compared to the baseline GPU architecture since it avoids expensive data transfers through off-chip communication links.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"593 ","pages":"12:1-12:28"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. ACM Meas. Anal. Comput. Syst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3639038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Graphics Processing Units (GPUs) are widely used for modern applications with huge data sizes. However, the performance benefit of GPUs is limited by their memory capacity and bandwidth. Although GPU vendors improve memory capacity and bandwidth using 3D memory technology (HBM), many important workloads with terabytes of data still cannot fit in the provided capacity and are bound by the provided bandwidth. With a limited GPU memory capacity, programmers should handle the data movement between GPU and host memories by themselves, causing a significant programming burden. To improve programming ease, GPUs use a unified address space with the host that allows over-subscribing GPU memory, but this approach is not effective in terms of performance once GPUs encounter memory page faults. Many recent works have tried to remedy capacity and bandwidth bottlenecks using dense non-volatile memories (NVMs) and true-3D stacking. However, these works mainly focus on one bottleneck or do not provide a scalable solution that fits future requirements. In this paper, we investigate true-3D stacking of dense, low-power, and refresh-free non-volatile phase change memory (PCM) on top of state-of-the-art GPU configurations to provide higher capacity and bandwidth within the available area and power budget. The higher density and lower power consumption of PCM provide higher capacity through integrating more cells in each 3D layer and enabling stacking more layers. However, we observe that stacking more than six layers of pure-PCM memory violates the thermal constraint and severely harms the performance and power efficiency due to its higher write latency and energy. Further, it degrades the lifetime of GPU to less than one year. Utilizing a hybrid architecture that leverages the benefits of both DRAM and PCM memories has been widely studied by prior proposals; however, true-3D integration of such a hybrid memory architecture especially on top of state-of-the-art powerful GPU architecture has not been investigated yet. We experimentally demonstrate that by covering 80% of write requests in DRAM and eliminating refresh overhead, true-3D stacking of eight 32GB layers of PCM along with two 8GB layers of DRAM is possible resulting in a total of 272GB memory capacity. Based on the explored design requirements, We propose a 3D high-bandwidth high-capacity hybrid memory (H3DM) system utilizing a hybrid-3D (H3D)-aware remapping scheme to reduce expensive PCM writes to under 20% while avoiding DRAM refresh overhead. H3DM improves the performance up to 291% compared to the baseline GPU architecture while remaining within only 3% of an ideal case with DRAM-like access latency, on average. Moreover, by increasing the dataset size above the baseline GPU memory space, H3DM improves performance and power up to 648% and 87% compared to the baseline GPU architecture since it avoids expensive data transfers through off-chip communication links.

查看原文本刊更多论文

H3DM：面向 GPU 的高带宽、大容量混合 3D 内存设计

图形处理器（GPU）被广泛应用于数据量巨大的现代应用中。然而，GPU 的性能优势受限于其内存容量和带宽。尽管 GPU 厂商利用 3D 内存技术（HBM）提高了内存容量和带宽，但许多数据量高达 TB 的重要工作负载仍然无法容纳所提供的容量，并受到所提供带宽的限制。在 GPU 内存容量有限的情况下，程序员必须自行处理 GPU 和主机内存之间的数据移动，这给编程带来了很大的负担。为了提高编程的简便性，GPU 与主机使用统一的地址空间，允许超量订阅 GPU 内存，但一旦 GPU 遇到内存页面故障，这种方法在性能方面并不奏效。最近的许多研究都试图利用密集的非易失性存储器（NVM）和真正的三维堆叠来解决容量和带宽瓶颈问题。然而，这些工作主要集中在一个瓶颈上，或者没有提供适合未来需求的可扩展解决方案。在本文中，我们研究了在最先进的 GPU 配置上堆叠高密度、低功耗、免刷新的非易失性相变存储器（PCM）的真-3D 堆叠技术，以便在可用面积和功耗预算内提供更高的容量和带宽。PCM 的密度更高、功耗更低，通过在每个三维层中集成更多单元和堆叠更多层，可提供更高的容量。然而，我们注意到，堆叠超过六层的纯 PCM 存储器违反了热约束，并且由于其较高的写入延迟和能耗，严重损害了性能和能效。此外，它还会将 GPU 的使用寿命降低到一年以下。利用混合架构来充分利用 DRAM 和 PCM 存储器的优势已在先前的提案中得到广泛研究；然而，这种混合存储器架构的真正三维集成，尤其是在最先进的强大 GPU 架构之上的集成，尚未得到研究。我们通过实验证明，通过在 DRAM 中覆盖 80% 的写入请求并消除刷新开销，可以将 8 层 32GB 的 PCM 与 2 层 8GB 的 DRAM 进行真正的三维堆叠，从而获得总计 272GB 的内存容量。根据所探讨的设计要求，我们提出了一种三维高带宽大容量混合内存（H3DM）系统，利用混合三维（H3D）感知重映射方案，将昂贵的 PCM 写入量减少到 20% 以下，同时避免 DRAM 刷新开销。与基准 GPU 架构相比，H3DM 的性能提高了 291%，而平均访问延迟仅为理想情况下 DRAM 的 3%。此外，由于 H3DM 避免了通过片外通信链路进行昂贵的数据传输，因此与基准 GPU 架构相比，通过将数据集大小增加到基准 GPU 内存空间以上，H3DM 的性能和功耗分别提高了 648% 和 87%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proc. ACM Meas. Anal. Comput. Syst.

自引率

0.00%

发文量