Zhe Wang, Daniel A. Jiménez, Zhang Tao, G. Loh, Yuan Xie
{"title":"Building a Low Latency, Highly Associative DRAM Cache with the Buffered Way Predictor","authors":"Zhe Wang, Daniel A. Jiménez, Zhang Tao, G. Loh, Yuan Xie","doi":"10.1109/SBAC-PAD.2016.22","DOIUrl":null,"url":null,"abstract":"The emerging die-stacked DRAM technology allows computer architects to design a last-level cache (LLC) with high memory bandwidth and large capacity. There are four key requirements for DRAM cache design: minimizing on-chip tag storage overhead, optimizing access latency, improving hit rate, and reducing off-chip traffic. These requirements seem mutually incompatible. For example, to reduce the tag storage overhead, the recent proposed LH-cache co-locates tags and data in the same DRAM cache row, and the Alloy Cache proposed to alloy data and tags in the same cache line in a direct-mapped design. However, these ideas either require significant tag lookup latency or sacrifice hit rate for hit latency. To optimize all four key requirements, we propose the Buffered Way Predictor (BWP). The BWP predicts the way ID of a DRAM cache request with high accuracy and coverage, allowing data and tag to be fetched back to back. Thus, the read latency for the data can be completely hidden so that DRAM cache hitting requests have low access latency. The BWP technique is designed for highly associative block-based DRAM caches and achieves a low miss rate and low off-chip traffic. Our evaluation with multi-programmed workloads and a 128MB DRAM cache shows that a 128KB BWP achieves a 76.2% hit rate. The BWP improves performance by 8.8% and 12.3% compared to LH-cache and Alloy Cache, respectively.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2016.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
The emerging die-stacked DRAM technology allows computer architects to design a last-level cache (LLC) with high memory bandwidth and large capacity. There are four key requirements for DRAM cache design: minimizing on-chip tag storage overhead, optimizing access latency, improving hit rate, and reducing off-chip traffic. These requirements seem mutually incompatible. For example, to reduce the tag storage overhead, the recent proposed LH-cache co-locates tags and data in the same DRAM cache row, and the Alloy Cache proposed to alloy data and tags in the same cache line in a direct-mapped design. However, these ideas either require significant tag lookup latency or sacrifice hit rate for hit latency. To optimize all four key requirements, we propose the Buffered Way Predictor (BWP). The BWP predicts the way ID of a DRAM cache request with high accuracy and coverage, allowing data and tag to be fetched back to back. Thus, the read latency for the data can be completely hidden so that DRAM cache hitting requests have low access latency. The BWP technique is designed for highly associative block-based DRAM caches and achieves a low miss rate and low off-chip traffic. Our evaluation with multi-programmed workloads and a 128MB DRAM cache shows that a 128KB BWP achieves a 76.2% hit rate. The BWP improves performance by 8.8% and 12.3% compared to LH-cache and Alloy Cache, respectively.