Co-optimizing memory-level parallelism and cache-level parallelism

Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation Pub Date : 2019-06-08 DOI:10.1145/3314221.3314599

Xulong Tang, M. Kandemir, Mustafa Karaköy, Meenakshi Arunachalam

{"title":"Co-optimizing memory-level parallelism and cache-level parallelism","authors":"Xulong Tang, M. Kandemir, Mustafa Karaköy, Meenakshi Arunachalam","doi":"10.1145/3314221.3314599","DOIUrl":null,"url":null,"abstract":"Minimizing cache misses has been the traditional goal in optimizing cache performance using compiler based techniques. However, continuously increasing dataset sizes combined with large numbers of cache banks and memory banks connected using on-chip networks in emerging manycores/accelerators makes cache hit–miss latency optimization as important as cache miss rate minimization. In this paper, we propose compiler support that optimizes both the latencies of last-level cache (LLC) hits and the latencies of LLC misses. Our approach tries to achieve this goal by improving the parallelism exhibited by LLC hits and LLC misses. More specifically, it tries to maximize both cache-level parallelism (CLP) and memory-level parallelism (MLP). This paper presents different incarnations of our approach, and evaluates them using a set of 12 multithreaded applications. Our results indicate that (i) optimizing MLP first and CLP later brings, on average, 11.31% performance improvement over an approach that already minimizes the number of LLC misses, and (ii) optimizing CLP first and MLP later brings 9.43% performance improvement. In comparison, balancing MLP and CLP brings 17.32% performance improvement on average.","PeriodicalId":441774,"journal":{"name":"Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3314221.3314599","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Minimizing cache misses has been the traditional goal in optimizing cache performance using compiler based techniques. However, continuously increasing dataset sizes combined with large numbers of cache banks and memory banks connected using on-chip networks in emerging manycores/accelerators makes cache hit–miss latency optimization as important as cache miss rate minimization. In this paper, we propose compiler support that optimizes both the latencies of last-level cache (LLC) hits and the latencies of LLC misses. Our approach tries to achieve this goal by improving the parallelism exhibited by LLC hits and LLC misses. More specifically, it tries to maximize both cache-level parallelism (CLP) and memory-level parallelism (MLP). This paper presents different incarnations of our approach, and evaluates them using a set of 12 multithreaded applications. Our results indicate that (i) optimizing MLP first and CLP later brings, on average, 11.31% performance improvement over an approach that already minimizes the number of LLC misses, and (ii) optimizing CLP first and MLP later brings 9.43% performance improvement. In comparison, balancing MLP and CLP brings 17.32% performance improvement on average.

查看原文本刊更多论文

共同优化内存级并行和缓存级并行

在使用基于编译器的技术优化缓存性能时，最小化缓存缺失一直是传统的目标。然而，不断增加的数据集大小，加上在新兴的多核/加速器中使用片上网络连接的大量缓存库和内存库，使得缓存命中缺失延迟优化与缓存缺失率最小化一样重要。在本文中，我们建议编译器支持优化最后一级缓存(LLC)命中和LLC未命中的延迟。我们的方法试图通过提高LLC命中和LLC未命中表现出的并行性来实现这一目标。更具体地说，它试图最大化缓存级并行性(CLP)和内存级并行性(MLP)。本文介绍了我们的方法的不同体现，并使用一组12个多线程应用程序对它们进行了评估。我们的结果表明:(i)首先优化MLP，然后再优化CLP，平均而言，与已经最小化LLC失误数量的方法相比，性能提高了11.31%;(ii)首先优化CLP，然后再优化MLP，性能提高了9.43%。相比之下，平衡MLP和CLP平均带来17.32%的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

自引率

0.00%

发文量