BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches

Chiachen Chou, A. Jaleel, Moinuddin K. Qureshi
{"title":"BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches","authors":"Chiachen Chou, A. Jaleel, Moinuddin K. Qureshi","doi":"10.1145/2749469.2750387","DOIUrl":null,"url":null,"abstract":"Die stacking memory technology can enable gigascale DRAM caches that can operate at 4x-8x higher bandwidth than commodity DRAM. Such caches can improve system performance by servicing data at a faster rate when the requested data is found in the cache, potentially increasing the memory bandwidth of the system by 4x-8x. Unfortunately, a DRAM cache uses the available memory bandwidth not only for data transfer on cache hits, but also for other secondary operations such as cache miss detection, fill on cache miss, and writeback lookup and content update on dirty evictions from the last-level on-chip cache. Ideally, we want the bandwidth consumed for such secondary operations to be negligible, and have almost all the bandwidth be available for transfer of useful data from the DRAM cache to the processor. We evaluate a 1GB DRAM cache, architected as Alloy Cache, and show that even the most bandwidth-efficient proposal for DRAM cache consumes 3.8x bandwidth compared to an idealized DRAM cache that does not consume any bandwidth for secondary operations. We also show that redesigning the DRAM cache to minimize the bandwidth consumed by secondary operations can potentially improve system performance by 22%. To that end, this paper proposes Bandwidth Efficient ARchitecture (BEAR) for DRAM caches. BEAR integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes. BEAR reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%. BEAR, with negligible overhead, outperforms an idealized SRAM Tag-Store design that incurs an unacceptable overhead of 64 megabytes, as well as Sector Cache designs that incur an SRAM storage overhead of 6 megabytes.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"21 1","pages":"198-210"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"82","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2749469.2750387","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 82

Abstract

Die stacking memory technology can enable gigascale DRAM caches that can operate at 4x-8x higher bandwidth than commodity DRAM. Such caches can improve system performance by servicing data at a faster rate when the requested data is found in the cache, potentially increasing the memory bandwidth of the system by 4x-8x. Unfortunately, a DRAM cache uses the available memory bandwidth not only for data transfer on cache hits, but also for other secondary operations such as cache miss detection, fill on cache miss, and writeback lookup and content update on dirty evictions from the last-level on-chip cache. Ideally, we want the bandwidth consumed for such secondary operations to be negligible, and have almost all the bandwidth be available for transfer of useful data from the DRAM cache to the processor. We evaluate a 1GB DRAM cache, architected as Alloy Cache, and show that even the most bandwidth-efficient proposal for DRAM cache consumes 3.8x bandwidth compared to an idealized DRAM cache that does not consume any bandwidth for secondary operations. We also show that redesigning the DRAM cache to minimize the bandwidth consumed by secondary operations can potentially improve system performance by 22%. To that end, this paper proposes Bandwidth Efficient ARchitecture (BEAR) for DRAM caches. BEAR integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes. BEAR reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%. BEAR, with negligible overhead, outperforms an idealized SRAM Tag-Store design that incurs an unacceptable overhead of 64 megabytes, as well as Sector Cache designs that incur an SRAM storage overhead of 6 megabytes.
BEAR:用于减轻千兆级DRAM缓存中带宽膨胀的技术
芯片堆叠存储技术可以使千兆级DRAM缓存能够以比普通DRAM高4 -8倍的带宽运行。当在缓存中找到请求的数据时,这样的缓存可以以更快的速度为数据提供服务,从而提高系统性能,可能会将系统的内存带宽增加4 -8倍。不幸的是,DRAM缓存使用的可用内存带宽不仅用于缓存命中时的数据传输,还用于其他次要操作,例如缓存缺失检测、缓存缺失时的填充、以及从最后一级片上缓存清除脏数据时的回写查找和内容更新。理想情况下,我们希望这种次要操作所消耗的带宽可以忽略不计,并且几乎所有带宽都可用于将有用的数据从DRAM缓存传输到处理器。我们评估了一个1GB的DRAM缓存,架构为Alloy cache,并表明即使是带宽效率最高的DRAM缓存方案,与不消耗任何带宽用于辅助操作的理想DRAM缓存相比,也要消耗3.8倍的带宽。我们还表明,重新设计DRAM缓存以最小化次要操作消耗的带宽可能会使系统性能提高22%。为此,本文提出了用于DRAM缓存的带宽高效架构(Bandwidth Efficient ARchitecture, BEAR)。BEAR集成了三个组件,分别用于减少遗漏检测、遗漏填充和回写探测所消耗的带宽。BEAR将DRAM缓存的带宽消耗降低了32%,从而将缓存命中延迟降低了24%,并将整体系统性能提高了10%。BEAR的开销可以忽略不计,优于理想的SRAM标签存储设计,后者会产生64兆字节的不可接受的开销,也优于扇区缓存设计,后者会产生6兆字节的SRAM存储开销。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信