受害者:通过利用未充分利用的缓存资源大幅增加地址转换范围

Konstantinos Kanellopoulos, Hong Chul Nam, F. Nisa Bostanci, Rahul Bera, Mohammad Sadrosadati, Rakesh Kumar, Davide-Basilio Bartolini, Onur Mutlu
{"title":"受害者:通过利用未充分利用的缓存资源大幅增加地址转换范围","authors":"Konstantinos Kanellopoulos, Hong Chul Nam, F. Nisa Bostanci, Rahul Bera, Mohammad Sadrosadati, Rakesh Kumar, Davide-Basilio Bartolini, Onur Mutlu","doi":"arxiv-2310.04158","DOIUrl":null,"url":null,"abstract":"Address translation is a performance bottleneck in data-intensive workloads\ndue to large datasets and irregular access patterns that lead to frequent\nhigh-latency page table walks (PTWs). PTWs can be reduced by using (i) large\nhardware TLBs or (ii) large software-managed TLBs. Unfortunately, both\nsolutions have significant drawbacks: increased access latency, power and area\n(for hardware TLBs), and costly memory accesses, the need for large contiguous\nmemory blocks, and complex OS modifications (for software-managed TLBs). We\npresent Victima, a new software-transparent mechanism that drastically\nincreases the translation reach of the processor by leveraging the\nunderutilized resources of the cache hierarchy. The key idea of Victima is to\nrepurpose L2 cache blocks to store clusters of TLB entries, thereby providing\nan additional low-latency and high-capacity component that backs up the\nlast-level TLB and thus reduces PTWs. Victima has two main components. First, a\nPTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on\nthe frequency and cost of the PTWs they lead to. Second, a TLB-aware cache\nreplacement policy prioritizes keeping TLB entries in the cache hierarchy by\nconsidering (i) the translation pressure (e.g., last-level TLB miss rate) and\n(ii) the reuse characteristics of the TLB entries. Our evaluation results show\nthat in native (virtualized) execution environments Victima improves average\nend-to-end application performance by 7.4% (28.7%) over the baseline four-level\nradix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art\nsoftware-managed TLB, across 11 diverse data-intensive workloads. Victima (i)\nis effective in both native and virtualized environments, (ii) is completely\ntransparent to application and system software, and (iii) incurs very small\narea and power overheads on a modern high-end CPU.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"24 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources\",\"authors\":\"Konstantinos Kanellopoulos, Hong Chul Nam, F. Nisa Bostanci, Rahul Bera, Mohammad Sadrosadati, Rakesh Kumar, Davide-Basilio Bartolini, Onur Mutlu\",\"doi\":\"arxiv-2310.04158\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Address translation is a performance bottleneck in data-intensive workloads\\ndue to large datasets and irregular access patterns that lead to frequent\\nhigh-latency page table walks (PTWs). PTWs can be reduced by using (i) large\\nhardware TLBs or (ii) large software-managed TLBs. Unfortunately, both\\nsolutions have significant drawbacks: increased access latency, power and area\\n(for hardware TLBs), and costly memory accesses, the need for large contiguous\\nmemory blocks, and complex OS modifications (for software-managed TLBs). We\\npresent Victima, a new software-transparent mechanism that drastically\\nincreases the translation reach of the processor by leveraging the\\nunderutilized resources of the cache hierarchy. The key idea of Victima is to\\nrepurpose L2 cache blocks to store clusters of TLB entries, thereby providing\\nan additional low-latency and high-capacity component that backs up the\\nlast-level TLB and thus reduces PTWs. Victima has two main components. First, a\\nPTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on\\nthe frequency and cost of the PTWs they lead to. Second, a TLB-aware cache\\nreplacement policy prioritizes keeping TLB entries in the cache hierarchy by\\nconsidering (i) the translation pressure (e.g., last-level TLB miss rate) and\\n(ii) the reuse characteristics of the TLB entries. Our evaluation results show\\nthat in native (virtualized) execution environments Victima improves average\\nend-to-end application performance by 7.4% (28.7%) over the baseline four-level\\nradix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art\\nsoftware-managed TLB, across 11 diverse data-intensive workloads. Victima (i)\\nis effective in both native and virtualized environments, (ii) is completely\\ntransparent to application and system software, and (iii) incurs very small\\narea and power overheads on a modern high-end CPU.\",\"PeriodicalId\":501333,\"journal\":{\"name\":\"arXiv - CS - Operating Systems\",\"volume\":\"24 6\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Operating Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2310.04158\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2310.04158","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

地址转换是数据密集型工作负载中的性能瓶颈,因为大型数据集和不规则的访问模式会导致频繁的高延迟页表遍历(PTWs)。可以通过使用(i)大型硬件tlb或(ii)大型软件管理的tlb来减少ptw。不幸的是,这两种解决方案都有明显的缺点:增加访问延迟、功率和面积(对于硬件tlb)、昂贵的内存访问、需要大的连续内存块以及复杂的操作系统修改(对于软件管理的tlb)。我们提出了受害者,一个新的软件透明机制,通过利用缓存层次结构中未充分利用的资源,极大地增加了处理器的翻译范围。受害的关键思想是重新利用L2缓存块来存储TLB条目的集群,从而提供一个额外的低延迟和高容量组件来备份最后一级TLB,从而减少ptw。受害者有两个主要组成部分。首先,aPTW成本预测器(PTW-CP)根据它们所导致的ptw的频率和成本来识别转换成本高的地址。其次,TLB感知缓存替换策略通过考虑(i)转换压力(例如,最后一层TLB缺失率)和(ii) TLB项的重用特征来优先保留TLB项在缓存层次结构中。我们的评估结果表明,在原生(虚拟化)执行环境中,在11种不同的数据密集型工作负载中,受害服务器的端到端应用程序性能比基于基数树的基线四层页表设计提高了7.4%(28.7%),比最先进的软件管理的TLB提高了6.2%(20.1%)。受害者(i)在本地和虚拟环境中都有效,(ii)对应用程序和系统软件完全透明,(iii)在现代高端CPU上占用非常小的面积和功率开销。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources
Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both solutions have significant drawbacks: increased access latency, power and area (for hardware TLBs), and costly memory accesses, the need for large contiguous memory blocks, and complex OS modifications (for software-managed TLBs). We present Victima, a new software-transparent mechanism that drastically increases the translation reach of the processor by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries, thereby providing an additional low-latency and high-capacity component that backs up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache replacement policy prioritizes keeping TLB entries in the cache hierarchy by considering (i) the translation pressure (e.g., last-level TLB miss rate) and (ii) the reuse characteristics of the TLB entries. Our evaluation results show that in native (virtualized) execution environments Victima improves average end-to-end application performance by 7.4% (28.7%) over the baseline four-level radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art software-managed TLB, across 11 diverse data-intensive workloads. Victima (i) is effective in both native and virtualized environments, (ii) is completely transparent to application and system software, and (iii) incurs very small area and power overheads on a modern high-end CPU.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信