Compiler optimizations for improving data locality

ASPLOS VI Pub Date : 1994-11-01 DOI:10.1145/195473.195557

S. Carr, K. McKinley, C. Tseng

{"title":"Compiler optimizations for improving data locality","authors":"S. Carr, K. McKinley, C. Tseng","doi":"10.1145/195473.195557","DOIUrl":null,"url":null,"abstract":"In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs.\nTo validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve because benchmark programs typically have high hit rates even for small data caches; however, our optimizations significantly improved several programs.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"332","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ASPLOS VI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/195473.195557","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 332

Abstract

In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve because benchmark programs typically have high hit rates even for small data caches; however, our optimizations significantly improved several programs.

查看原文本刊更多论文

用于改进数据局部性的编译器优化

在过去的十年中，处理器的速度已经明显快于内存的速度。小而快速的缓存存储器被设计用来克服这种差异，但是它们只有在程序显示数据局域性时才有效。在本文中，我们提出了编译器优化，以提高数据局部性基于一个简单而准确的成本模型。该模型同时计算缓存线的时间和空间重用，以找到理想的循环组织。成本模型驱动由环路置换、环路融合、环路分布和环路反转组成的复合转换的应用。我们证明了这些程序转换对于优化许多程序是有用的。为了验证我们的优化策略，我们实现了我们的算法，并在大量科学程序和内核上运行了实验。核实验表明，我们的模型和算法可以选择并达到最佳性能。对于30多个完整的应用程序，我们执行了原始版本和转换版本，并模拟了缓存命中率。我们收集了有关这些程序固有特征的统计数据，以及我们改进其数据局域性的能力。据我们所知，这些研究在广度和深度上尚属首次。我们发现性能改进很难实现，因为基准程序通常具有高命中率，即使对于小数据缓存也是如此;然而，我们的优化显著改善了几个程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ASPLOS VI

自引率

0.00%

发文量