通过阻塞数组布局改进缓存局部性

12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings. Pub Date : 2004-03-08 DOI:10.1109/EMPDP.2004.1271460

Evangelia Athanasaki, N. Koziris

{"title":"通过阻塞数组布局改进缓存局部性","authors":"Evangelia Athanasaki, N. Koziris","doi":"10.1109/EMPDP.2004.1271460","DOIUrl":null,"url":null,"abstract":"Minimizing cache misses is one of the most important factors to reduce average latency for memory accesses. Tiled codes modify the instruction stream to exploit cache locality for array accesses. Here, we further reduce cache misses, restructuring the memory layout of multidimensional arrays, that are accessed by tiled instruction code. In our method, array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. We present a straightforward way to easily translate multidimensional indexing of arrays into their blocked memory layout using simple binary-mask operations. Indices for such array layouts are easily calculated based on the algebra of dilated integers, similarly to morton-order indexing. Actual experimental results, using matrix multiplication and LU-decomposition on various size arrays, illustrate that execution time is greatly improved when combining tiled code with tiled array layouts and binary mask-based index translation functions. Simulations using the Simplescalar tool, verify that enhanced performance is due to the considerable reduction of total cache misses.","PeriodicalId":105726,"journal":{"name":"12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings.","volume":"157 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Improving cache locality with blocked array layouts\",\"authors\":\"Evangelia Athanasaki, N. Koziris\",\"doi\":\"10.1109/EMPDP.2004.1271460\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Minimizing cache misses is one of the most important factors to reduce average latency for memory accesses. Tiled codes modify the instruction stream to exploit cache locality for array accesses. Here, we further reduce cache misses, restructuring the memory layout of multidimensional arrays, that are accessed by tiled instruction code. In our method, array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. We present a straightforward way to easily translate multidimensional indexing of arrays into their blocked memory layout using simple binary-mask operations. Indices for such array layouts are easily calculated based on the algebra of dilated integers, similarly to morton-order indexing. Actual experimental results, using matrix multiplication and LU-decomposition on various size arrays, illustrate that execution time is greatly improved when combining tiled code with tiled array layouts and binary mask-based index translation functions. Simulations using the Simplescalar tool, verify that enhanced performance is due to the considerable reduction of total cache misses.\",\"PeriodicalId\":105726,\"journal\":{\"name\":\"12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings.\",\"volume\":\"157 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-03-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EMPDP.2004.1271460\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EMPDP.2004.1271460","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

最小化缓存丢失是减少内存访问平均延迟的最重要因素之一。平铺代码修改指令流以利用缓存局部性进行数组访问。在这里，我们进一步减少缓存缺失，重构由平铺指令代码访问的多维数组的内存布局。在我们的方法中，数组元素以阻塞的方式存储，就像它们被平铺指令流扫描一样。我们提供了一种简单的方法，可以使用简单的二进制掩码操作，轻松地将数组的多维索引转换为它们的阻塞内存布局。这种数组布局的索引很容易基于扩展整数的代数计算，类似于morton-order索引。在不同大小的数组上使用矩阵乘法和lu分解的实际实验结果表明，将平铺代码与平铺数组布局和基于二进制掩码的索引转换函数相结合可以大大提高执行时间。使用Simplescalar工具进行模拟，验证性能的增强是由于大量减少了总缓存丢失。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving cache locality with blocked array layouts

Minimizing cache misses is one of the most important factors to reduce average latency for memory accesses. Tiled codes modify the instruction stream to exploit cache locality for array accesses. Here, we further reduce cache misses, restructuring the memory layout of multidimensional arrays, that are accessed by tiled instruction code. In our method, array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. We present a straightforward way to easily translate multidimensional indexing of arrays into their blocked memory layout using simple binary-mask operations. Indices for such array layouts are easily calculated based on the algebra of dilated integers, similarly to morton-order indexing. Actual experimental results, using matrix multiplication and LU-decomposition on various size arrays, illustrate that execution time is greatly improved when combining tiled code with tiled array layouts and binary mask-based index translation functions. Simulations using the Simplescalar tool, verify that enhanced performance is due to the considerable reduction of total cache misses.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings.

自引率

0.00%

发文量