CPpf: a prefetch aware LLC partitioning approach

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI:10.1145/3337821.3337895

Jun Xiao, A. Pimentel, Xu Liu

{"title":"CPpf: a prefetch aware LLC partitioning approach","authors":"Jun Xiao, A. Pimentel, Xu Liu","doi":"10.1145/3337821.3337895","DOIUrl":null,"url":null,"abstract":"Hardware cache prefetching is deployed in modern multicore processors to reduce memory latencies, addressing the memory wall problem. However, it tends to increase the Last Level Cache (LLC) contention among applications in multiprogrammed workloads, leading to a performance degradation for the overall system. To study the interaction between hardware prefetching and LLC cache management, we first analyze the variation of application performance when varying the effective LLC space in the presence and absence of hardware prefetching. We observe that hardware prefetching can compensate the application performance loss due to the reduced effective cache space. Motivated by this observation, we classify applications into two categories, prefetching sensitive (PS) and non prefetching sensitive (NPS) applications, by the degree of performance benefit they experience from hardware prefetchers. To address the cache contention and also to mitigate the potential prefetch-related cache interference, we propose CPpf, a cache partitioning approach for improving the shared cache management in the presence of hardware prefetching. CPpf consists of a method using Precise Event-Based Sampling techniques for the online classification of PS and NPS applications and a cache partitioning scheme using Cache Allocation technology to distribute the cache space among PS and NPS applications. We implemented CPpf as a user-level runtime system on Linux. Compared with a non-partitioning approach, CPpf achieves speedups of up to 1.20, 1.08 and 1.06 for workloads with 2, 4 and 8 single-threaded applications, respectively. Moreover, it achieves speedups of up to 1.22 and 1.11 for workloads composed of two applications with 4 threads and 8 threads, respectively.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 48th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3337821.3337895","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Hardware cache prefetching is deployed in modern multicore processors to reduce memory latencies, addressing the memory wall problem. However, it tends to increase the Last Level Cache (LLC) contention among applications in multiprogrammed workloads, leading to a performance degradation for the overall system. To study the interaction between hardware prefetching and LLC cache management, we first analyze the variation of application performance when varying the effective LLC space in the presence and absence of hardware prefetching. We observe that hardware prefetching can compensate the application performance loss due to the reduced effective cache space. Motivated by this observation, we classify applications into two categories, prefetching sensitive (PS) and non prefetching sensitive (NPS) applications, by the degree of performance benefit they experience from hardware prefetchers. To address the cache contention and also to mitigate the potential prefetch-related cache interference, we propose CPpf, a cache partitioning approach for improving the shared cache management in the presence of hardware prefetching. CPpf consists of a method using Precise Event-Based Sampling techniques for the online classification of PS and NPS applications and a cache partitioning scheme using Cache Allocation technology to distribute the cache space among PS and NPS applications. We implemented CPpf as a user-level runtime system on Linux. Compared with a non-partitioning approach, CPpf achieves speedups of up to 1.20, 1.08 and 1.06 for workloads with 2, 4 and 8 single-threaded applications, respectively. Moreover, it achieves speedups of up to 1.22 and 1.11 for workloads composed of two applications with 4 threads and 8 threads, respectively.

查看原文本刊更多论文

CPpf:一种支持预取的LLC分区方法

硬件缓存预取部署在现代多核处理器中，以减少内存延迟，解决内存墙问题。然而，它往往会增加多程序工作负载中应用程序之间的最后一级缓存争用，从而导致整个系统的性能下降。为了研究硬件预取与LLC缓存管理之间的交互作用，我们首先分析了在硬件预取和不预取的情况下，改变有效LLC空间对应用程序性能的影响。我们观察到硬件预取可以补偿由于减少的有效缓存空间而导致的应用程序性能损失。基于这一观察结果，我们根据应用程序从硬件预取器中获得的性能优势程度，将应用程序分为两类，预取敏感(PS)和非预取敏感(NPS)应用程序。为了解决缓存争用并减轻潜在的与预取相关的缓存干扰，我们提出了CPpf，这是一种缓存分区方法，用于改善硬件预取存在时的共享缓存管理。CPpf包括一种使用基于事件的精确采样技术对PS和NPS应用程序进行在线分类的方法，以及一种使用缓存分配技术在PS和NPS应用程序之间分配缓存空间的缓存分区方案。我们将CPpf实现为Linux上的用户级运行时系统。与非分区方法相比，对于具有2、4和8个单线程应用程序的工作负载，CPpf分别实现了高达1.20、1.08和1.06的速度提升。此外，对于由两个分别具有4线程和8线程的应用程序组成的工作负载，它可以实现高达1.22和1.11的速度提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 48th International Conference on Parallel Processing

自引率

0.00%

发文量