Cache Accurate Time Skewing in Iterative Stencil Computations

2011 International Conference on Parallel Processing Pub Date : 2011-09-13 DOI:10.1109/ICPP.2011.47

R. Strzodka, Mohammed Shaheen, Dawid Pajak, H. Seidel

{"title":"Cache Accurate Time Skewing in Iterative Stencil Computations","authors":"R. Strzodka, Mohammed Shaheen, Dawid Pajak, H. Seidel","doi":"10.1109/ICPP.2011.47","DOIUrl":null,"url":null,"abstract":"We present a time skewing algorithm that breaks the memory wall for certain iterative stencil computations. A stencil computation, even with constant weights, is a completely memory-bound algorithm. For example, for a large 3D domain of $500^3$ doubles and 100 iterations on a quad-core Xeon X5482 3.2GHz system, a hand-vectorized and parallelized naive 7-point stencil implementation achieves only 1.4 GFLOPS because the system memory bandwidth limits the performance. Although many efforts have been undertaken to improve the performance of such nested loops, for large data sets they still lag far behind synthetic benchmark performance. The state-of-art automatic locality optimizer PluTo achieves 3.7 GFLOPS for the above stencil, whereas a parallel benchmark executing the inner stencil computation directly on registers performs at 25.1 GFLOPS. In comparison, our algorithm achieves 13.0 GFLOPS (52\\% of the stencil peak benchmark).We present results for 2D and 3D domains in double precision including problems with gigabyte large data sets. The results are compared against hand-optimized naive schemes, PluTo, the stencil peak benchmark and results from literature. For constant stencils of slope one we break the dependence on the low system bandwidth and achieve at least 50\\% of the stencil peak, thus performing within a factor two of an ideal system with infinite bandwidth (the benchmark runs on registers without memory access). For large stencils and banded matrices the additional data transfers let the limitations of the system bandwidth come into play again, however, our algorithm still gains a large improvement over the other schemes.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"74","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2011.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 74

Abstract

We present a time skewing algorithm that breaks the memory wall for certain iterative stencil computations. A stencil computation, even with constant weights, is a completely memory-bound algorithm. For example, for a large 3D domain of $500^3$ doubles and 100 iterations on a quad-core Xeon X5482 3.2GHz system, a hand-vectorized and parallelized naive 7-point stencil implementation achieves only 1.4 GFLOPS because the system memory bandwidth limits the performance. Although many efforts have been undertaken to improve the performance of such nested loops, for large data sets they still lag far behind synthetic benchmark performance. The state-of-art automatic locality optimizer PluTo achieves 3.7 GFLOPS for the above stencil, whereas a parallel benchmark executing the inner stencil computation directly on registers performs at 25.1 GFLOPS. In comparison, our algorithm achieves 13.0 GFLOPS (52\% of the stencil peak benchmark).We present results for 2D and 3D domains in double precision including problems with gigabyte large data sets. The results are compared against hand-optimized naive schemes, PluTo, the stencil peak benchmark and results from literature. For constant stencils of slope one we break the dependence on the low system bandwidth and achieve at least 50\% of the stencil peak, thus performing within a factor two of an ideal system with infinite bandwidth (the benchmark runs on registers without memory access). For large stencils and banded matrices the additional data transfers let the limitations of the system bandwidth come into play again, however, our algorithm still gains a large improvement over the other schemes.

查看原文本刊更多论文

迭代模板计算中的缓存精确时间倾斜

我们提出了一种时间倾斜算法，它打破了某些迭代模板计算的内存墙。模板计算，即使具有恒定的权重，也是完全受内存限制的算法。例如，在四核Xeon X5482 3.2GHz系统上，对于$500^3$ double和100次迭代的大型3D域，由于系统内存带宽限制，手动矢量化和并行化的简易7点模板实现只能实现1.4 GFLOPS。尽管已经进行了许多努力来提高这种嵌套循环的性能，但对于大型数据集，它们仍然远远落后于合成基准性能。对于上述模板，最先进的自动局部优化器PluTo实现了3.7 GFLOPS，而直接在寄存器上执行内部模板计算的并行基准测试的性能为25.1 GFLOPS。相比之下，我们的算法达到13.0 GFLOPS(模板峰值基准的52%)。我们提出了双精度二维和三维领域的结果，包括千兆字节大数据集的问题。将结果与手工优化的朴素方案、PluTo、模板峰值基准和文献结果进行了比较。对于斜率为1的恒定模板，我们打破了对低系统带宽的依赖，并实现了至少50%的模板峰值，从而在具有无限带宽的理想系统的两倍内执行(基准测试在没有内存访问的寄存器上运行)。对于大型模板和带状矩阵，额外的数据传输使系统带宽的限制再次发挥作用，然而，我们的算法仍然比其他方案获得了很大的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 International Conference on Parallel Processing

自引率

0.00%

发文量