Cache Accurate Time Skewing in Iterative Stencil Computations

R. Strzodka, Mohammed Shaheen, Dawid Pajak, H. Seidel
{"title":"Cache Accurate Time Skewing in Iterative Stencil Computations","authors":"R. Strzodka, Mohammed Shaheen, Dawid Pajak, H. Seidel","doi":"10.1109/ICPP.2011.47","DOIUrl":null,"url":null,"abstract":"We present a time skewing algorithm that breaks the memory wall for certain iterative stencil computations. A stencil computation, even with constant weights, is a completely memory-bound algorithm. For example, for a large 3D domain of $500^3$ doubles and 100 iterations on a quad-core Xeon X5482 3.2GHz system, a hand-vectorized and parallelized naive 7-point stencil implementation achieves only 1.4 GFLOPS because the system memory bandwidth limits the performance. Although many efforts have been undertaken to improve the performance of such nested loops, for large data sets they still lag far behind synthetic benchmark performance. The state-of-art automatic locality optimizer PluTo achieves 3.7 GFLOPS for the above stencil, whereas a parallel benchmark executing the inner stencil computation directly on registers performs at 25.1 GFLOPS. In comparison, our algorithm achieves 13.0 GFLOPS (52\\% of the stencil peak benchmark).We present results for 2D and 3D domains in double precision including problems with gigabyte large data sets. The results are compared against hand-optimized naive schemes, PluTo, the stencil peak benchmark and results from literature. For constant stencils of slope one we break the dependence on the low system bandwidth and achieve at least 50\\% of the stencil peak, thus performing within a factor two of an ideal system with infinite bandwidth (the benchmark runs on registers without memory access). For large stencils and banded matrices the additional data transfers let the limitations of the system bandwidth come into play again, however, our algorithm still gains a large improvement over the other schemes.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"74","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2011.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 74

Abstract

We present a time skewing algorithm that breaks the memory wall for certain iterative stencil computations. A stencil computation, even with constant weights, is a completely memory-bound algorithm. For example, for a large 3D domain of $500^3$ doubles and 100 iterations on a quad-core Xeon X5482 3.2GHz system, a hand-vectorized and parallelized naive 7-point stencil implementation achieves only 1.4 GFLOPS because the system memory bandwidth limits the performance. Although many efforts have been undertaken to improve the performance of such nested loops, for large data sets they still lag far behind synthetic benchmark performance. The state-of-art automatic locality optimizer PluTo achieves 3.7 GFLOPS for the above stencil, whereas a parallel benchmark executing the inner stencil computation directly on registers performs at 25.1 GFLOPS. In comparison, our algorithm achieves 13.0 GFLOPS (52\% of the stencil peak benchmark).We present results for 2D and 3D domains in double precision including problems with gigabyte large data sets. The results are compared against hand-optimized naive schemes, PluTo, the stencil peak benchmark and results from literature. For constant stencils of slope one we break the dependence on the low system bandwidth and achieve at least 50\% of the stencil peak, thus performing within a factor two of an ideal system with infinite bandwidth (the benchmark runs on registers without memory access). For large stencils and banded matrices the additional data transfers let the limitations of the system bandwidth come into play again, however, our algorithm still gains a large improvement over the other schemes.
迭代模板计算中的缓存精确时间倾斜
我们提出了一种时间倾斜算法,它打破了某些迭代模板计算的内存墙。模板计算,即使具有恒定的权重,也是完全受内存限制的算法。例如,在四核Xeon X5482 3.2GHz系统上,对于$500^3$ double和100次迭代的大型3D域,由于系统内存带宽限制,手动矢量化和并行化的简易7点模板实现只能实现1.4 GFLOPS。尽管已经进行了许多努力来提高这种嵌套循环的性能,但对于大型数据集,它们仍然远远落后于合成基准性能。对于上述模板,最先进的自动局部优化器PluTo实现了3.7 GFLOPS,而直接在寄存器上执行内部模板计算的并行基准测试的性能为25.1 GFLOPS。相比之下,我们的算法达到13.0 GFLOPS(模板峰值基准的52%)。我们提出了双精度二维和三维领域的结果,包括千兆字节大数据集的问题。将结果与手工优化的朴素方案、PluTo、模板峰值基准和文献结果进行了比较。对于斜率为1的恒定模板,我们打破了对低系统带宽的依赖,并实现了至少50%的模板峰值,从而在具有无限带宽的理想系统的两倍内执行(基准测试在没有内存访问的寄存器上运行)。对于大型模板和带状矩阵,额外的数据传输使系统带宽的限制再次发挥作用,然而,我们的算法仍然比其他方案获得了很大的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信