Building a Low Latency, Highly Associative DRAM Cache with the Buffered Way Predictor

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-01 DOI:10.1109/SBAC-PAD.2016.22

Zhe Wang, Daniel A. Jiménez, Zhang Tao, G. Loh, Yuan Xie

{"title":"Building a Low Latency, Highly Associative DRAM Cache with the Buffered Way Predictor","authors":"Zhe Wang, Daniel A. Jiménez, Zhang Tao, G. Loh, Yuan Xie","doi":"10.1109/SBAC-PAD.2016.22","DOIUrl":null,"url":null,"abstract":"The emerging die-stacked DRAM technology allows computer architects to design a last-level cache (LLC) with high memory bandwidth and large capacity. There are four key requirements for DRAM cache design: minimizing on-chip tag storage overhead, optimizing access latency, improving hit rate, and reducing off-chip traffic. These requirements seem mutually incompatible. For example, to reduce the tag storage overhead, the recent proposed LH-cache co-locates tags and data in the same DRAM cache row, and the Alloy Cache proposed to alloy data and tags in the same cache line in a direct-mapped design. However, these ideas either require significant tag lookup latency or sacrifice hit rate for hit latency. To optimize all four key requirements, we propose the Buffered Way Predictor (BWP). The BWP predicts the way ID of a DRAM cache request with high accuracy and coverage, allowing data and tag to be fetched back to back. Thus, the read latency for the data can be completely hidden so that DRAM cache hitting requests have low access latency. The BWP technique is designed for highly associative block-based DRAM caches and achieves a low miss rate and low off-chip traffic. Our evaluation with multi-programmed workloads and a 128MB DRAM cache shows that a 128KB BWP achieves a 76.2% hit rate. The BWP improves performance by 8.8% and 12.3% compared to LH-cache and Alloy Cache, respectively.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2016.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

The emerging die-stacked DRAM technology allows computer architects to design a last-level cache (LLC) with high memory bandwidth and large capacity. There are four key requirements for DRAM cache design: minimizing on-chip tag storage overhead, optimizing access latency, improving hit rate, and reducing off-chip traffic. These requirements seem mutually incompatible. For example, to reduce the tag storage overhead, the recent proposed LH-cache co-locates tags and data in the same DRAM cache row, and the Alloy Cache proposed to alloy data and tags in the same cache line in a direct-mapped design. However, these ideas either require significant tag lookup latency or sacrifice hit rate for hit latency. To optimize all four key requirements, we propose the Buffered Way Predictor (BWP). The BWP predicts the way ID of a DRAM cache request with high accuracy and coverage, allowing data and tag to be fetched back to back. Thus, the read latency for the data can be completely hidden so that DRAM cache hitting requests have low access latency. The BWP technique is designed for highly associative block-based DRAM caches and achieves a low miss rate and low off-chip traffic. Our evaluation with multi-programmed workloads and a 128MB DRAM cache shows that a 128KB BWP achieves a 76.2% hit rate. The BWP improves performance by 8.8% and 12.3% compared to LH-cache and Alloy Cache, respectively.

查看原文本刊更多论文

用缓冲方式预测器构建低延迟、高度关联的DRAM缓存

新兴的芯片堆叠DRAM技术允许计算机架构师设计具有高内存带宽和大容量的最后一级缓存(LLC)。DRAM缓存设计有四个关键要求:最小化片上标签存储开销、优化访问延迟、提高命中率和减少片外流量。这些要求似乎互不相容。例如，为了减少标签存储开销，最近提出的LH-cache将标签和数据放在同一DRAM缓存行中，而Alloy cache则建议将数据和标签放在同一缓存行中，采用直接映射的设计。然而，这些想法要么需要显著的标签查找延迟，要么为了延迟而牺牲命中率。为了优化所有四个关键要求，我们提出了缓冲方式预测器(BWP)。BWP预测DRAM缓存请求的方式ID具有较高的准确性和覆盖范围，允许反向获取数据和标签。因此，数据的读取延迟可以完全隐藏，以便DRAM缓存命中请求具有较低的访问延迟。BWP技术是为高度关联的基于块的DRAM缓存而设计的，可以实现低丢失率和低片外流量。我们对多程序工作负载和128MB DRAM缓存的评估表明，128KB BWP的命中率达到76.2%。与h - Cache和Alloy Cache相比，BWP的性能分别提高了8.8%和12.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

自引率

0.00%

发文量