Fast indexing for blocked array layouts to improve multi-level cache locality

Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004. Pub Date : 2004-05-24 DOI:10.1109/INTERA.2004.1299515

Evangelia Athanasaki, N. Koziris

{"title":"Fast indexing for blocked array layouts to improve multi-level cache locality","authors":"Evangelia Athanasaki, N. Koziris","doi":"10.1109/INTERA.2004.1299515","DOIUrl":null,"url":null,"abstract":"One of the key challenges computer architects and compiler writers are facing, is the increasing discrepancy between processor cycle times and main memory access times. To overcome this problem, program transformations that decrease cache misses are used, to reduce average latency for memory accesses. Tiling is a widely used loop iteration reordering technique for improving locality of references. In this paper, we further reduce cache misses, restructuring the memory layout of multi-dimensional arrays, that are accessed by tiled instruction code. In our method, array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. We present a straightforward way to easily translate multi-dimensional indexing of arrays into their blocked memory layout using simple binary-mask operations. Indices for such array layouts are now easily calculated based on the algebra of dilated integers, similarly to morton-order indexing. Actual experimental results on three different hardware platforms, using 5 benchmarks, illustrate that execution time is greatly improved when combining tiled code with tiled array layouts and binary mask-based index translation functions. Both TLB and L1 cache misses are concurrently minimized, for the same tile size, thus, applying the proposed layouts, locality of references is greatly improved. Finally, simulations using the Simplescalar tool, verify that our enhanced performance is due to the considerable reduction of cache misses in all levels of memory hierarchy.","PeriodicalId":262940,"journal":{"name":"Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INTERA.2004.1299515","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

One of the key challenges computer architects and compiler writers are facing, is the increasing discrepancy between processor cycle times and main memory access times. To overcome this problem, program transformations that decrease cache misses are used, to reduce average latency for memory accesses. Tiling is a widely used loop iteration reordering technique for improving locality of references. In this paper, we further reduce cache misses, restructuring the memory layout of multi-dimensional arrays, that are accessed by tiled instruction code. In our method, array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. We present a straightforward way to easily translate multi-dimensional indexing of arrays into their blocked memory layout using simple binary-mask operations. Indices for such array layouts are now easily calculated based on the algebra of dilated integers, similarly to morton-order indexing. Actual experimental results on three different hardware platforms, using 5 benchmarks, illustrate that execution time is greatly improved when combining tiled code with tiled array layouts and binary mask-based index translation functions. Both TLB and L1 cache misses are concurrently minimized, for the same tile size, thus, applying the proposed layouts, locality of references is greatly improved. Finally, simulations using the Simplescalar tool, verify that our enhanced performance is due to the considerable reduction of cache misses in all levels of memory hierarchy.

查看原文本刊更多论文

快速索引阻塞数组布局，以提高多级缓存局部性

计算机架构师和编译器编写者面临的主要挑战之一是处理器周期时间和主存访问时间之间的差异越来越大。为了克服这个问题，使用减少缓存丢失的程序转换来减少内存访问的平均延迟。平铺是一种广泛使用的循环迭代重排序技术，用于提高引用的局部性。在本文中，我们进一步减少缓存缺失，重构由平铺指令代码访问的多维数组的内存布局。在我们的方法中，数组元素以阻塞的方式存储，就像它们被平铺指令流扫描一样。我们提出了一种简单的方法，可以使用简单的二进制掩码操作，轻松地将数组的多维索引转换为其阻塞内存布局。这种数组布局的索引现在很容易基于扩展整数的代数计算，类似于morton-order索引。在三个不同的硬件平台上，使用5个基准测试的实际实验结果表明，将平铺代码与平铺数组布局和基于二进制掩码的索引转换函数结合使用时，执行时间大大提高。对于相同的tile大小，TLB和L1缓存缺失同时被最小化，因此，应用提议的布局，引用的局域性得到了极大的改善。最后，使用Simplescalar工具进行模拟，验证了我们的性能增强是由于在所有内存层次结构中大幅减少了缓存丢失。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.

自引率

0.00%

发文量