Address Translation Conscious Caching and Prefetching for High Performance Cache Hierarchy

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI:10.1109/ispass55109.2022.00044

Vasudha, Biswabandan Panda

{"title":"Address Translation Conscious Caching and Prefetching for High Performance Cache Hierarchy","authors":"Vasudha, Biswabandan Panda","doi":"10.1109/ispass55109.2022.00044","DOIUrl":null,"url":null,"abstract":"Performance of Translation Lookaside Buffers (TLBs) and on-chip caches plays a crucial role in delivering high-performance for memory-intensive applications with irregular memory accesses. Our observations show that, on average, an L2 TLB (STLB) miss for address translation can stall the head of the reorder buffer (ROB) for a maximum of 50 cycles. The corresponding data request, also called as the replay load can stall the head of the ROB for more than 200 cycles. We show that current state-of-the-art mid-level (L2C) and last-level cache (LLC) replacement policies do not treat cache block with address translations and replay data access differently. As a result these policies fail to reduce ROB stalls because of translation and replay data access misses. To improve the performance further on top of high-performing cache replacement policies, we propose address translation and replay data access conscious cache replacement policies at L2C and LLC. Our enhancements help in reducing ROB stalls due to STLB misses by 28.76%. We also find that cache blocks storing replay loads are dead (no reuse after insertion), and cache replacement policies alone cannot mitigate the ROB stalls caused by replay data accesses. Hence, we propose an address translation hit triggered hardware prefetcher that brings replay data on an address translation hit at the L2C and LLC. This enhancement reduces ROB stalls due to replay data accesses by 18.5%. For a group of memory-intensive benchmarks with high STLB misses, our enhancements improve performance by 5.1% (reducing ROB stall cycles by 46.7%) and as high as 10.6%, on top of state-of-the-art cache replacement policies that are highly competitive. Our enhancements do not incur any additional storage overhead. However, we need additional flags from the page-table-walker into the cache hierarchy.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ispass55109.2022.00044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Performance of Translation Lookaside Buffers (TLBs) and on-chip caches plays a crucial role in delivering high-performance for memory-intensive applications with irregular memory accesses. Our observations show that, on average, an L2 TLB (STLB) miss for address translation can stall the head of the reorder buffer (ROB) for a maximum of 50 cycles. The corresponding data request, also called as the replay load can stall the head of the ROB for more than 200 cycles. We show that current state-of-the-art mid-level (L2C) and last-level cache (LLC) replacement policies do not treat cache block with address translations and replay data access differently. As a result these policies fail to reduce ROB stalls because of translation and replay data access misses. To improve the performance further on top of high-performing cache replacement policies, we propose address translation and replay data access conscious cache replacement policies at L2C and LLC. Our enhancements help in reducing ROB stalls due to STLB misses by 28.76%. We also find that cache blocks storing replay loads are dead (no reuse after insertion), and cache replacement policies alone cannot mitigate the ROB stalls caused by replay data accesses. Hence, we propose an address translation hit triggered hardware prefetcher that brings replay data on an address translation hit at the L2C and LLC. This enhancement reduces ROB stalls due to replay data accesses by 18.5%. For a group of memory-intensive benchmarks with high STLB misses, our enhancements improve performance by 5.1% (reducing ROB stall cycles by 46.7%) and as high as 10.6%, on top of state-of-the-art cache replacement policies that are highly competitive. Our enhancements do not incur any additional storage overhead. However, we need additional flags from the page-table-walker into the cache hierarchy.

查看原文本刊更多论文

高性能缓存层次结构的地址转换意识缓存和预取

翻译外置缓冲区(tlb)和片上缓存的性能在为具有不规则内存访问的内存密集型应用程序提供高性能方面起着关键作用。我们的观察表明，平均而言，地址转换的L2 TLB (STLB)丢失可以使重排序缓冲区(ROB)的头部停滞最多50个周期。相应的数据请求，也称为重放负载，可以使ROB的头部停滞200多个周期。我们表明，当前最先进的中级(L2C)和最后一级缓存(LLC)替换策略不会以不同的方式处理具有地址转换和重放数据访问的缓存块。因此，这些策略无法减少由于转换和重放数据访问失误而导致的ROB失速。为了在高性能缓存替换策略的基础上进一步提高性能，我们在L2C和LLC上提出了地址转换和重放数据访问意识缓存替换策略。我们的改进有助于减少由于STLB缺失导致的ROB停顿，减少了28.76%。我们还发现，存储重放负载的缓存块是死的(插入后没有重用)，单独的缓存替换策略不能缓解由重放数据访问引起的ROB停滞。因此，我们提出了一个地址转换命中触发的硬件预取器，它在L2C和LLC的地址转换命中时带来重放数据。这种增强将由于重放数据访问而导致的ROB延迟减少了18.5%。对于一组具有高STLB缺失的内存密集型基准测试，我们的增强将性能提高了5.1%(将ROB失速周期减少了46.7%)，最高可提高10.6%，这是在最先进的高速缓存替换策略之上，这些策略具有很强的竞争力。我们的增强不会产生任何额外的存储开销。但是，我们需要从页表行走器到缓存层次结构的附加标志。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

自引率

0.00%

发文量