用缓存命中率换取内存性能

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI:10.1145/2628071.2628082

W. Ding, M. Kandemir, D. Guttman, Adwait Jog, C. Das, Praveen Yedlapalli

{"title":"用缓存命中率换取内存性能","authors":"W. Ding, M. Kandemir, D. Guttman, Adwait Jog, C. Das, Praveen Yedlapalli","doi":"10.1145/2628071.2628082","DOIUrl":null,"url":null,"abstract":"Most of the prior compiler based data locality optimization works target exclusively cache locality optimization, and row-buffer locality in DRAM banks received much less attention. In particular, to the best of our knowledge, there is no single compiler based approach that can improve row-buffer locality in executing irregular applications. This presents a critical problem considering the fact that executing irregular applications in a power and performance efficient manner will be a key requirement to extract maximum benefits from emerging multicore machines and exascale systems. Motivated by these observations, this paper makes the following contributions. First, it presents a compiler-runtime cooperative data layout optimization approach that takes as input an irregular program that has already been optimized for cache locality and generates an output code with the same cache performance but better row-buffer locality (lower number of row-buffer misses). Second, it discusses a more aggressive strategy that sacrifices some cache performance in order to further improve row-buffer performance (i.e., it trades cache performance for memory system performance). The ultimate goal of this strategy is to find the right tradeoff point between cache performance and row-buffer performance so that the overall application performance is improved. Third, the paper performs a detailed evaluation of these two approaches using both an AMD Opteron based multicore system and a multicore simulator. The experimental results, collected using five real-world irregular applications, show that (i) conventional cache optimizations do not improve row-buffer locality significantly; (ii) our first approach achieves about 9.8% execution time improvement by keeping the number of cache misses the same as a cache-optimized code but reducing the number of row-buffer misses; and (iii) our second approach achieves even higher execution time improvements (13.8% on average) by sacrificing cache performance for additional memory performance.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Trading cache hit rate for memory performance\",\"authors\":\"W. Ding, M. Kandemir, D. Guttman, Adwait Jog, C. Das, Praveen Yedlapalli\",\"doi\":\"10.1145/2628071.2628082\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Most of the prior compiler based data locality optimization works target exclusively cache locality optimization, and row-buffer locality in DRAM banks received much less attention. In particular, to the best of our knowledge, there is no single compiler based approach that can improve row-buffer locality in executing irregular applications. This presents a critical problem considering the fact that executing irregular applications in a power and performance efficient manner will be a key requirement to extract maximum benefits from emerging multicore machines and exascale systems. Motivated by these observations, this paper makes the following contributions. First, it presents a compiler-runtime cooperative data layout optimization approach that takes as input an irregular program that has already been optimized for cache locality and generates an output code with the same cache performance but better row-buffer locality (lower number of row-buffer misses). Second, it discusses a more aggressive strategy that sacrifices some cache performance in order to further improve row-buffer performance (i.e., it trades cache performance for memory system performance). The ultimate goal of this strategy is to find the right tradeoff point between cache performance and row-buffer performance so that the overall application performance is improved. Third, the paper performs a detailed evaluation of these two approaches using both an AMD Opteron based multicore system and a multicore simulator. The experimental results, collected using five real-world irregular applications, show that (i) conventional cache optimizations do not improve row-buffer locality significantly; (ii) our first approach achieves about 9.8% execution time improvement by keeping the number of cache misses the same as a cache-optimized code but reducing the number of row-buffer misses; and (iii) our second approach achieves even higher execution time improvements (13.8% on average) by sacrificing cache performance for additional memory performance.\",\"PeriodicalId\":263670,\"journal\":{\"name\":\"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)\",\"volume\":\"59 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2628071.2628082\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2628071.2628082","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

以往大多数基于编译器的数据局部性优化工作只针对缓存局部性优化，而DRAM组中的行缓冲局部性很少受到关注。特别是，据我们所知，在执行不规则应用程序时，没有一种基于编译器的方法可以改善行缓冲区局部性。考虑到以高效的方式执行不规则应用程序将是从新兴的多核机器和百亿亿级系统中获取最大利益的关键要求，这就提出了一个关键问题。在这些观察的激励下，本文做出了以下贡献。首先，它提出了一种编译器-运行时协同数据布局优化方法，该方法将已经针对缓存局部性进行了优化的不规则程序作为输入，并生成具有相同缓存性能但更好的行缓冲区局部性(更少的行缓冲区未命中数)的输出代码。其次，它讨论了一种更激进的策略，它牺牲了一些缓存性能，以进一步提高行缓冲区性能(即，它以缓存性能换取内存系统性能)。此策略的最终目标是在缓存性能和行缓冲区性能之间找到合适的折衷点，从而提高整个应用程序的性能。第三，本文使用基于AMD Opteron的多核系统和多核模拟器对这两种方法进行了详细的评估。使用五个实际不规则应用程序收集的实验结果表明:(i)传统的缓存优化并不能显著提高行缓冲局部性;(ii)我们的第一种方法通过保持缓存丢失的数量与缓存优化代码相同，但减少行缓冲区丢失的数量，实现了大约9.8%的执行时间改进;(iii)我们的第二种方法通过牺牲缓存性能来获得额外的内存性能，实现了更高的执行时间改进(平均13.8%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Trading cache hit rate for memory performance

Most of the prior compiler based data locality optimization works target exclusively cache locality optimization, and row-buffer locality in DRAM banks received much less attention. In particular, to the best of our knowledge, there is no single compiler based approach that can improve row-buffer locality in executing irregular applications. This presents a critical problem considering the fact that executing irregular applications in a power and performance efficient manner will be a key requirement to extract maximum benefits from emerging multicore machines and exascale systems. Motivated by these observations, this paper makes the following contributions. First, it presents a compiler-runtime cooperative data layout optimization approach that takes as input an irregular program that has already been optimized for cache locality and generates an output code with the same cache performance but better row-buffer locality (lower number of row-buffer misses). Second, it discusses a more aggressive strategy that sacrifices some cache performance in order to further improve row-buffer performance (i.e., it trades cache performance for memory system performance). The ultimate goal of this strategy is to find the right tradeoff point between cache performance and row-buffer performance so that the overall application performance is improved. Third, the paper performs a detailed evaluation of these two approaches using both an AMD Opteron based multicore system and a multicore simulator. The experimental results, collected using five real-world irregular applications, show that (i) conventional cache optimizations do not improve row-buffer locality significantly; (ii) our first approach achieves about 9.8% execution time improvement by keeping the number of cache misses the same as a cache-optimized code but reducing the number of row-buffer misses; and (iii) our second approach achieves even higher execution time improvements (13.8% on average) by sacrificing cache performance for additional memory performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 23rd International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量