Fine-Granular Computation and Data Layout Reorganization for Improving Locality

M. Kandemir, Xulong Tang, Jagadish B. Kotra, Mustafa Karaköy
{"title":"Fine-Granular Computation and Data Layout Reorganization for Improving Locality","authors":"M. Kandemir, Xulong Tang, Jagadish B. Kotra, Mustafa Karaköy","doi":"10.1145/3508352.3549386","DOIUrl":null,"url":null,"abstract":"While data locality and cache performance have been investigated in great depth by prior research (in the context of both high-end systems and embedded/mobile systems), one of the important characteristics of prior approaches is that they transform loop and/or data space (e.g., array layout) as a whole. Unfortunately, such coarse-grain approaches bring three critical issues. First, they implicitly assume that all parts of a given array would equally benefit from the identified data layout transformation. Second, they also assume that a given loop transformation would have the same locality impact on an entire data array. Third and more importantly, such coarse-grain approaches are local by their nature and difficult to achieve globally optimal executions. Motivated by these drawbacks of existing code and data space reorganization/optimization techniques, this paper proposes to determine multiple loop transformation matrices for each loop nest in the program and multiple data layout transformations for each array accessed by the program, in an attempt to exploit data locality at a finer granularity. It leverages bipartite graph matching and extends the proposed fine-granular integrated loop-layout strategy to a multicore setting as well. Our experimental results show that the proposed approach significantly improves the data locality and outperforms existing schemes – 9.1% average performance improvement in single-threaded executions and 11.5% average improvement in multi-threaded executions over the state-of-the-art.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508352.3549386","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

While data locality and cache performance have been investigated in great depth by prior research (in the context of both high-end systems and embedded/mobile systems), one of the important characteristics of prior approaches is that they transform loop and/or data space (e.g., array layout) as a whole. Unfortunately, such coarse-grain approaches bring three critical issues. First, they implicitly assume that all parts of a given array would equally benefit from the identified data layout transformation. Second, they also assume that a given loop transformation would have the same locality impact on an entire data array. Third and more importantly, such coarse-grain approaches are local by their nature and difficult to achieve globally optimal executions. Motivated by these drawbacks of existing code and data space reorganization/optimization techniques, this paper proposes to determine multiple loop transformation matrices for each loop nest in the program and multiple data layout transformations for each array accessed by the program, in an attempt to exploit data locality at a finer granularity. It leverages bipartite graph matching and extends the proposed fine-granular integrated loop-layout strategy to a multicore setting as well. Our experimental results show that the proposed approach significantly improves the data locality and outperforms existing schemes – 9.1% average performance improvement in single-threaded executions and 11.5% average improvement in multi-threaded executions over the state-of-the-art.
改进局部性的细粒度计算和数据布局重组
虽然数据局部性和缓存性能已经通过先前的研究(在高端系统和嵌入式/移动系统的背景下)进行了深入的研究,但先前方法的一个重要特征是它们将循环和/或数据空间(例如,数组布局)作为一个整体进行转换。不幸的是,这种粗粒度的方法带来了三个关键问题。首先,它们隐式地假设给定数组的所有部分都将同样受益于标识的数据布局转换。其次,它们还假设给定的循环转换对整个数据数组具有相同的局部性影响。第三,更重要的是,这种粗粒度方法本质上是局部的,难以实现全局最优执行。鉴于现有代码和数据空间重组/优化技术的这些缺陷,本文提出为程序中的每个循环巢确定多个循环转换矩阵,并为程序访问的每个数组确定多个数据布局转换,以尝试在更细的粒度上利用数据局部性。它利用二部图匹配,并将所提出的细粒度集成环路布局策略扩展到多核设置。我们的实验结果表明,所提出的方法显着提高了数据局域性,并且优于现有的方案——与最先进的方案相比,单线程执行的平均性能提高了9.1%,多线程执行的平均性能提高了11.5%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信