Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs

2011 38th Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2011-06-04 DOI:10.1145/2000064.2000074

Asit K. Mishra, Xiangyu Dong, Guangyu Sun, Yuan Xie, N. Vijaykrishnan, C. Das

{"title":"Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs","authors":"Asit K. Mishra, Xiangyu Dong, Guangyu Sun, Yuan Xie, N. Vijaykrishnan, C. Das","doi":"10.1145/2000064.2000074","DOIUrl":null,"url":null,"abstract":"Emerging memory technologies such as STT-RAM, PCRAM, and resistive RAM are being explored as potential replacements to existing on-chip caches or main memories for future multi-core architectures. This is due to the many attractive features these memory technologies posses: high density, low leakage, and non-volatility. However, the latency and energy overhead associated with the write operations of these emerging memories has become a major obstacle in their adoption. Previous works have proposed various circuit and architectural level solutions to mitigate the write overhead. In this paper, we study the integration of STT-RAM in a 3D multi-core environment and propose solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STT-RAM technology. Our scheme is based on the observation that instead of staggering requests to a write-busy STT-RAM bank, the network should schedule requests to other idle cache banks for effectively hiding the latency. Thus, we prioritize cache accesses to the idle banks by delaying accesses to the STTRAM cache banks that are currently serving long latency write requests. Through a detailed characterization of the cache access patterns of 42 applications, we propose an efficient mechanism to facilitate such delayed writes to cache banks by (a) accurately estimating the busy time of each cache bank through logical partitioning of the cache layer and (b) prioritizing packets in a router requesting accesses to idle banks. Evaluations on a 3D architecture, consisting of 64 cores and 64 STT-RAM cache banks, show that our proposed approach provides 14% average IPC improvement for multi-threaded benchmarks, 19% instruction throughput benefits for multi-programmed workloads, and 6% latency reduction compared to a recently proposed write buffering mechanism.","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"108","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2000064.2000074","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 108

Abstract

Emerging memory technologies such as STT-RAM, PCRAM, and resistive RAM are being explored as potential replacements to existing on-chip caches or main memories for future multi-core architectures. This is due to the many attractive features these memory technologies posses: high density, low leakage, and non-volatility. However, the latency and energy overhead associated with the write operations of these emerging memories has become a major obstacle in their adoption. Previous works have proposed various circuit and architectural level solutions to mitigate the write overhead. In this paper, we study the integration of STT-RAM in a 3D multi-core environment and propose solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STT-RAM technology. Our scheme is based on the observation that instead of staggering requests to a write-busy STT-RAM bank, the network should schedule requests to other idle cache banks for effectively hiding the latency. Thus, we prioritize cache accesses to the idle banks by delaying accesses to the STTRAM cache banks that are currently serving long latency write requests. Through a detailed characterization of the cache access patterns of 42 applications, we propose an efficient mechanism to facilitate such delayed writes to cache banks by (a) accurately estimating the busy time of each cache bank through logical partitioning of the cache layer and (b) prioritizing packets in a router requesting accesses to idle banks. Evaluations on a 3D architecture, consisting of 64 cores and 64 STT-RAM cache banks, show that our proposed approach provides 14% average IPC improvement for multi-threaded benchmarks, 19% instruction throughput benefits for multi-programmed workloads, and 6% latency reduction compared to a recently proposed write buffering mechanism.

查看原文本刊更多论文

为cmp中的堆叠3D STT-RAM缓存构建片上互连

诸如STT-RAM、PCRAM和电阻式RAM等新兴存储技术正在被探索，作为未来多核架构中现有片上缓存或主存储器的潜在替代品。这是由于这些存储技术具有许多吸引人的特点:高密度、低泄漏和非易失性。然而，与这些新兴存储器的写入操作相关的延迟和能量开销已成为采用它们的主要障碍。以前的工作已经提出了各种电路和架构级别的解决方案来减少写入开销。本文研究了STT-RAM在3D多核环境下的集成，并在片上网络层面提出了解决方案，以规避STT-RAM技术在缓存架构中的写入开销问题。我们的方案是基于这样的观察:网络应该将请求调度到其他空闲缓存银行，而不是将请求错开到写繁忙的STT-RAM银行，以有效地隐藏延迟。因此，我们通过延迟对当前正在处理长延迟写请求的stream缓存银行的访问来优先考虑对空闲银行的缓存访问。通过对42个应用程序的缓存访问模式的详细描述，我们提出了一种有效的机制，通过(a)通过缓存层的逻辑分区准确估计每个缓存库的繁忙时间，以及(b)在请求访问空闲缓存库的路由器中对数据包进行优先级排序，来促进对缓存库的延迟写入。对由64核和64 STT-RAM缓存库组成的3D架构的评估表明，我们提出的方法为多线程基准测试提供了14%的平均IPC改进，为多编程工作负载提供了19%的指令吞吐量优势，与最近提出的写缓冲机制相比，延迟减少了6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 38th Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量