Bypass and insertion algorithms for exclusive last-level caches

2011 38th Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2011-06-04 DOI:10.1145/2000064.2000075

Jayesh Gaur, Mainak Chaudhuri, S. Subramoney

{"title":"Bypass and insertion algorithms for exclusive last-level caches","authors":"Jayesh Gaur, Mainak Chaudhuri, S. Subramoney","doi":"10.1145/2000064.2000075","DOIUrl":null,"url":null,"abstract":"Inclusive last-level caches (LLCs) waste precious silicon estate due to cross-level replication of cache blocks. As the industry moves toward cache hierarchies with larger inner levels, this wasted cache space leads to bigger performance losses compared to exclusive LLCs. However, exclusive LLCs make the design of replacement policies more challenging. While in an inclusive LLC a block can gather a filtered access history, this is not possible in an exclusive design because the block is de-allocated from the LLC on a hit. As a result, the popular least-recently-used replacement policy and its approximations are rendered ineffective and proper choice of insertion ages of cache blocks becomes even more important in exclusive designs. On the other hand, it is not necessary to fill every block into an exclusive LLC. This is known as selective cache bypassing and is not possible to implement in an inclusive LLC because that would violate inclusion. This paper explores insertion and bypass algorithms for exclusive LLCs. Our detailed execution-driven simulation results show that a combination of our best insertion and bypass policies delivers an improvement of up to 61.2% and on average (geometric mean) 3.4% in terms of instructions retired per cycle (IPC) for 97 single-threaded dynamic instruction traces spanning selected SPEC 2006 and server applications, running on a 2 MB 16-way exclusive LLC compared to a baseline exclusive design in the presence of well-tuned multi-stream hardware prefetchers. The corresponding improvements in throughput for 35 4-way multi-programmed workloads running with an 8 MB 16-way shared exclusive LLC are 20.6% (maximum) and 2.5% (geometric mean).","PeriodicalId":340732,"journal":{"name":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","volume":"476 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"107","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 38th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2000064.2000075","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 107

Abstract

Inclusive last-level caches (LLCs) waste precious silicon estate due to cross-level replication of cache blocks. As the industry moves toward cache hierarchies with larger inner levels, this wasted cache space leads to bigger performance losses compared to exclusive LLCs. However, exclusive LLCs make the design of replacement policies more challenging. While in an inclusive LLC a block can gather a filtered access history, this is not possible in an exclusive design because the block is de-allocated from the LLC on a hit. As a result, the popular least-recently-used replacement policy and its approximations are rendered ineffective and proper choice of insertion ages of cache blocks becomes even more important in exclusive designs. On the other hand, it is not necessary to fill every block into an exclusive LLC. This is known as selective cache bypassing and is not possible to implement in an inclusive LLC because that would violate inclusion. This paper explores insertion and bypass algorithms for exclusive LLCs. Our detailed execution-driven simulation results show that a combination of our best insertion and bypass policies delivers an improvement of up to 61.2% and on average (geometric mean) 3.4% in terms of instructions retired per cycle (IPC) for 97 single-threaded dynamic instruction traces spanning selected SPEC 2006 and server applications, running on a 2 MB 16-way exclusive LLC compared to a baseline exclusive design in the presence of well-tuned multi-stream hardware prefetchers. The corresponding improvements in throughput for 35 4-way multi-programmed workloads running with an 8 MB 16-way shared exclusive LLC are 20.6% (maximum) and 2.5% (geometric mean).

查看原文本刊更多论文

排他性最后一级缓存的旁路和插入算法

由于缓存块的跨层复制，包含最后一级缓存(llc)浪费了宝贵的硅资源。随着行业向具有更大内部级别的缓存层次结构发展，与排他性llc相比，这种浪费的缓存空间会导致更大的性能损失。然而，排他性有限责任公司使替代政策的设计更具挑战性。虽然在包容性LLC中，块可以收集过滤的访问历史，但在排他设计中不可能这样做，因为块在命中时从LLC中取消分配。因此，流行的最近最少使用的替换策略及其近似值变得无效，并且在独占设计中正确选择缓存块的插入年龄变得更加重要。另一方面，没有必要将每个块都填充到exclusive LLC中。这被称为选择性缓存绕过，并且不可能在inclusive LLC中实现，因为这会违反inclusion。本文探讨了排他性有限责任公司的插入和旁路算法。我们详细的执行驱动模拟结果表明，我们最好的插入和绕过策略的组合提供了高达61.2%的改进和平均(几何平均)3.4%的指令退役每周期(IPC)的97单线程动态指令跟踪，涵盖选定的SPEC 2006和服务器应用程序，在一个2 MB的16路独占LLC上运行，与基线独占设计相比，在良好调优的多流硬件预取器的存在。对于使用8 MB 16路共享独占LLC运行的35个4路多编程工作负载，相应的吞吐量改进分别为20.6%(最大值)和2.5%(几何平均值)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 38th Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量