TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI:10.1145/2925426.2926288

Sanyam Mehta, R. Garg, Nishad Trivedi, P. Yew

{"title":"TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes","authors":"Sanyam Mehta, R. Garg, Nishad Trivedi, P. Yew","doi":"10.1145/2925426.2926288","DOIUrl":null,"url":null,"abstract":"Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance. In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"118 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2925426.2926288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance. In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.

查看原文本刊更多论文

TurboTiling:利用预取来提高平铺代码的性能

循环平铺或阻塞通过将问题域划分为块，然后重复访问块中的数据来改进时间局部性。虽然这会减少重用，但它也会导致一个经常被忽视的副作用:破坏流数据访问模式。因此，平铺代码无法利用当前处理器中复杂的硬件预取器来提取额外的性能。在这项工作中，我们提出了一种平铺算法来利用预取来提高平铺代码的性能。为了实现这一点，我们建议对最后一级缓存进行平铺，而不是像通常建议的那样对更高级别的缓存进行平铺。这种方法不仅在平铺代码中公开了可用于预取的流访问模式，而且还允许减少对内存的片外流量(因此，可以更好地随内核数量进行缩放)。因此，尽管我们平铺最后一级缓存，但我们有效地访问了更高一级缓存中的数据，因为数据被及时预取以进行计算。为了实现这一目标，我们提出了一种算法来选择旨在最大化数据重用和最小化现代多核处理器共享最后一级缓存中的冲突缺失的平铺大小。我们发现，与现有的针对更高级别L1/L2缓存且不利用硬件预取的平铺算法相比，最后一级缓存的平铺和有效的硬件预取的综合效果有了显著的改进。在使用不同问题大小的Intel 8核机器上运行时，相对于最先进的算法选择的最佳tile大小，对于较小和较大的问题大小，它分别实现了27%和48%的平均改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 International Conference on Supercomputing

自引率

0.00%

发文量