Integrating 3D Resistive Memory Cache into GPGPU for Energy-Efficient Data Processing

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI:10.1109/PACT.2015.60

Jie Zhang, D. Donofrio, J. Shalf, Myoungsoo Jung

{"title":"Integrating 3D Resistive Memory Cache into GPGPU for Energy-Efficient Data Processing","authors":"Jie Zhang, D. Donofrio, J. Shalf, Myoungsoo Jung","doi":"10.1109/PACT.2015.60","DOIUrl":null,"url":null,"abstract":"General purpose graphics processing units (GPUs) have become a promising solution to process massive data by taking advantages of multithreading. Thanks to thread-level parallelism, GPU-accelerated applications improve the overall system performance by up to 40 times, compared to CPU-only architecture. However, data-intensive GPU applications often generate large amount of irregular data accesses, which results in cache thrashing and contention problems. The cache thrashing in turn can introduce a large number of off-chip memory accesses, which not only wastes tremendous energy to move data around on-chip cache and off-chip global memory, but also significantly limits system performance due to many stalled load/store instructions. In this work, we redesign the shared last-level cache (LLC) of GPU devices by introducing non-volatile memory (NVM), which can address the cache thrashing issues with low energy consumption. Specifically, we investigate two architectural approaches, one of each employs a 2D planar resistive random-access memory (RRAM) as our baseline NVM-cache and a 3D-stacked RRAM technology. Our baseline NVM-cache replaces the SRAM-based L2 cache with RRAM of similar area size; a memory die consists of eight subarrays, one of which a small fraction of memristor island by constructing 512x512 matrix. Since the feature size of SRAM is around 125 F2 (while that of RRAM around 4 F2), it can offer around 30x bigger storage capacity than the SRAM-based cache. To make our baseline NVM-cache denser, we proposed 3D-stacked NVM-cache, which piles up four memory layers, and each of them has a single pre-decode logic.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.60","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

General purpose graphics processing units (GPUs) have become a promising solution to process massive data by taking advantages of multithreading. Thanks to thread-level parallelism, GPU-accelerated applications improve the overall system performance by up to 40 times, compared to CPU-only architecture. However, data-intensive GPU applications often generate large amount of irregular data accesses, which results in cache thrashing and contention problems. The cache thrashing in turn can introduce a large number of off-chip memory accesses, which not only wastes tremendous energy to move data around on-chip cache and off-chip global memory, but also significantly limits system performance due to many stalled load/store instructions. In this work, we redesign the shared last-level cache (LLC) of GPU devices by introducing non-volatile memory (NVM), which can address the cache thrashing issues with low energy consumption. Specifically, we investigate two architectural approaches, one of each employs a 2D planar resistive random-access memory (RRAM) as our baseline NVM-cache and a 3D-stacked RRAM technology. Our baseline NVM-cache replaces the SRAM-based L2 cache with RRAM of similar area size; a memory die consists of eight subarrays, one of which a small fraction of memristor island by constructing 512x512 matrix. Since the feature size of SRAM is around 125 F2 (while that of RRAM around 4 F2), it can offer around 30x bigger storage capacity than the SRAM-based cache. To make our baseline NVM-cache denser, we proposed 3D-stacked NVM-cache, which piles up four memory layers, and each of them has a single pre-decode logic.

查看原文本刊更多论文

将三维电阻式内存缓存集成到GPGPU中实现高能效数据处理

通用图形处理单元(gpu)已经成为利用多线程处理海量数据的一种很有前途的解决方案。由于线程级别的并行性，gpu加速的应用程序与仅使用cpu的架构相比，可以将整体系统性能提高40倍。然而，数据密集型GPU应用程序通常会产生大量不规则的数据访问，从而导致缓存抖动和争用问题。缓存抖动反过来又会引入大量的片外内存访问，这不仅浪费了大量的能量来在片内缓存和片外全局内存之间移动数据，而且由于许多加载/存储指令停滞，还严重限制了系统性能。在这项工作中，我们通过引入非易失性存储器(NVM)来重新设计GPU设备的共享最后一级缓存(LLC)，以低能耗解决缓存抖动问题。具体来说，我们研究了两种架构方法，其中一种方法采用二维平面电阻随机存取存储器(RRAM)作为基准nvm缓存和3d堆叠RRAM技术。我们的基准nvm缓存用类似面积大小的RRAM取代基于sram的L2缓存;一个内存芯片由8个子阵列组成，其中一个子阵列通过构造512x512矩阵构成一小部分忆阻岛。由于SRAM的特征大小约为125 F2(而RRAM的特征大小约为4 F2)，因此它可以提供比基于SRAM的缓存大30倍的存储容量。为了使我们的基准nvm缓存更密集，我们提出了3d堆叠的nvm缓存，它堆积了四个内存层，每个层都有一个预解码逻辑。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量