Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources

arXiv - CS - Operating Systems Pub Date : 2023-10-06 DOI:arxiv-2310.04158

Konstantinos Kanellopoulos, Hong Chul Nam, F. Nisa Bostanci, Rahul Bera, Mohammad Sadrosadati, Rakesh Kumar, Davide-Basilio Bartolini, Onur Mutlu

{"title":"Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources","authors":"Konstantinos Kanellopoulos, Hong Chul Nam, F. Nisa Bostanci, Rahul Bera, Mohammad Sadrosadati, Rakesh Kumar, Davide-Basilio Bartolini, Onur Mutlu","doi":"arxiv-2310.04158","DOIUrl":null,"url":null,"abstract":"Address translation is a performance bottleneck in data-intensive workloads\ndue to large datasets and irregular access patterns that lead to frequent\nhigh-latency page table walks (PTWs). PTWs can be reduced by using (i) large\nhardware TLBs or (ii) large software-managed TLBs. Unfortunately, both\nsolutions have significant drawbacks: increased access latency, power and area\n(for hardware TLBs), and costly memory accesses, the need for large contiguous\nmemory blocks, and complex OS modifications (for software-managed TLBs). We\npresent Victima, a new software-transparent mechanism that drastically\nincreases the translation reach of the processor by leveraging the\nunderutilized resources of the cache hierarchy. The key idea of Victima is to\nrepurpose L2 cache blocks to store clusters of TLB entries, thereby providing\nan additional low-latency and high-capacity component that backs up the\nlast-level TLB and thus reduces PTWs. Victima has two main components. First, a\nPTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on\nthe frequency and cost of the PTWs they lead to. Second, a TLB-aware cache\nreplacement policy prioritizes keeping TLB entries in the cache hierarchy by\nconsidering (i) the translation pressure (e.g., last-level TLB miss rate) and\n(ii) the reuse characteristics of the TLB entries. Our evaluation results show\nthat in native (virtualized) execution environments Victima improves average\nend-to-end application performance by 7.4% (28.7%) over the baseline four-level\nradix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art\nsoftware-managed TLB, across 11 diverse data-intensive workloads. Victima (i)\nis effective in both native and virtualized environments, (ii) is completely\ntransparent to application and system software, and (iii) incurs very small\narea and power overheads on a modern high-end CPU.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"24 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2310.04158","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both solutions have significant drawbacks: increased access latency, power and area (for hardware TLBs), and costly memory accesses, the need for large contiguous memory blocks, and complex OS modifications (for software-managed TLBs). We present Victima, a new software-transparent mechanism that drastically increases the translation reach of the processor by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries, thereby providing an additional low-latency and high-capacity component that backs up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache replacement policy prioritizes keeping TLB entries in the cache hierarchy by considering (i) the translation pressure (e.g., last-level TLB miss rate) and (ii) the reuse characteristics of the TLB entries. Our evaluation results show that in native (virtualized) execution environments Victima improves average end-to-end application performance by 7.4% (28.7%) over the baseline four-level radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art software-managed TLB, across 11 diverse data-intensive workloads. Victima (i) is effective in both native and virtualized environments, (ii) is completely transparent to application and system software, and (iii) incurs very small area and power overheads on a modern high-end CPU.

查看原文本刊更多论文

受害者:通过利用未充分利用的缓存资源大幅增加地址转换范围

地址转换是数据密集型工作负载中的性能瓶颈，因为大型数据集和不规则的访问模式会导致频繁的高延迟页表遍历(PTWs)。可以通过使用(i)大型硬件tlb或(ii)大型软件管理的tlb来减少ptw。不幸的是，这两种解决方案都有明显的缺点:增加访问延迟、功率和面积(对于硬件tlb)、昂贵的内存访问、需要大的连续内存块以及复杂的操作系统修改(对于软件管理的tlb)。我们提出了受害者，一个新的软件透明机制，通过利用缓存层次结构中未充分利用的资源，极大地增加了处理器的翻译范围。受害的关键思想是重新利用L2缓存块来存储TLB条目的集群，从而提供一个额外的低延迟和高容量组件来备份最后一级TLB，从而减少ptw。受害者有两个主要组成部分。首先，aPTW成本预测器(PTW-CP)根据它们所导致的ptw的频率和成本来识别转换成本高的地址。其次，TLB感知缓存替换策略通过考虑(i)转换压力(例如，最后一层TLB缺失率)和(ii) TLB项的重用特征来优先保留TLB项在缓存层次结构中。我们的评估结果表明，在原生(虚拟化)执行环境中，在11种不同的数据密集型工作负载中，受害服务器的端到端应用程序性能比基于基数树的基线四层页表设计提高了7.4%(28.7%)，比最先进的软件管理的TLB提高了6.2%(20.1%)。受害者(i)在本地和虚拟环境中都有效，(ii)对应用程序和系统软件完全透明，(iii)在现代高端CPU上占用非常小的面积和功率开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Operating Systems

自引率

0.00%

发文量