一个实用的仿射环巢瓷砖尺寸选择模型

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing Pub Date : 2021-06-03 DOI:10.1145/3447818.3462213

Kumudha Narasimhan, Aravind Acharya, Abhinav Baid, Uday Bondhugula

{"title":"一个实用的仿射环巢瓷砖尺寸选择模型","authors":"Kumudha Narasimhan, Aravind Acharya, Abhinav Baid, Uday Bondhugula","doi":"10.1145/3447818.3462213","DOIUrl":null,"url":null,"abstract":"Loop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic tiling of loop nests rely on auto-tuning to find good tile sizes. Tile size selection models proposed in the literature either fall back to modeling complex non-linear optimization problems or tackle a narrow class of inputs. Hence, a fast and generic tile size selection model is desirable for it to be adopted into compiler infrastructures like those of GCC, LLVM, or MLIR. In this paper, we propose a new, fast and lightweight tile size selection model that considers temporal and spatial reuse along dimensions of a loop nest. For an n-dimensional loop nest, we determine the tile sizes by calculating the zeros of a polynomial in a single variable of degree at most n. Our tile size calculation model also accounts for vectorizability of the innermost dimension. We demonstrate the generality of our approach by selecting benchmarks from various domains: linear algebra kernels, digital signal processing (DSP) and image processing. We implement our tile size selection model in PolyMage (a domain-specific language and compiler for image processing pipelines) and Pluto (state-of-the-art polyhedral auto-parallelizer). Implementing the model in PolyMage allows us to extend it to DSP and linear algebra domains and also incorporate idiom recognition phases so that optimized vendor-specific library implementations could be utilized whenever profitable. Our experiments demonstrate a significant geomean performance gain of 2.2x over Matlab on benchmarks from the DSP domain. For PolyBench, we obtain a geomean speedup of 1.04x (maximum speedup of 1.3x) over Pluto.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"A practical tile size selection model for affine loop nests\",\"authors\":\"Kumudha Narasimhan, Aravind Acharya, Abhinav Baid, Uday Bondhugula\",\"doi\":\"10.1145/3447818.3462213\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Loop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic tiling of loop nests rely on auto-tuning to find good tile sizes. Tile size selection models proposed in the literature either fall back to modeling complex non-linear optimization problems or tackle a narrow class of inputs. Hence, a fast and generic tile size selection model is desirable for it to be adopted into compiler infrastructures like those of GCC, LLVM, or MLIR. In this paper, we propose a new, fast and lightweight tile size selection model that considers temporal and spatial reuse along dimensions of a loop nest. For an n-dimensional loop nest, we determine the tile sizes by calculating the zeros of a polynomial in a single variable of degree at most n. Our tile size calculation model also accounts for vectorizability of the innermost dimension. We demonstrate the generality of our approach by selecting benchmarks from various domains: linear algebra kernels, digital signal processing (DSP) and image processing. We implement our tile size selection model in PolyMage (a domain-specific language and compiler for image processing pipelines) and Pluto (state-of-the-art polyhedral auto-parallelizer). Implementing the model in PolyMage allows us to extend it to DSP and linear algebra domains and also incorporate idiom recognition phases so that optimized vendor-specific library implementations could be utilized whenever profitable. Our experiments demonstrate a significant geomean performance gain of 2.2x over Matlab on benchmarks from the DSP domain. For PolyBench, we obtain a geomean speedup of 1.04x (maximum speedup of 1.3x) over Pluto.\",\"PeriodicalId\":73273,\"journal\":{\"name\":\"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing\",\"volume\":\"35 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3447818.3462213\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447818.3462213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

针对局部性的循环平铺是通用和特定领域编译的重要转换，因为它允许程序利用深度内存层次结构的好处。大多数代码生成工具都具有执行循环巢自动平铺的基础结构，它们依赖于自动调优来找到合适的平铺大小。文献中提出的瓷砖尺寸选择模型要么回归到复杂非线性优化问题的建模，要么解决一个狭窄的输入类别。因此，需要一个快速且通用的tile大小选择模型，以便将其采用到诸如GCC、LLVM或MLIR之类的编译器基础结构中。在本文中，我们提出了一种新的，快速和轻量级的瓷砖尺寸选择模型，该模型考虑了沿环形巢尺寸的时间和空间重用。对于n维的循环巢，我们通过计算最多n次的单个变量中的多项式的零点来确定瓦片大小。我们的瓦片大小计算模型还考虑了最内维的向量化。我们通过选择来自不同领域的基准来证明我们方法的通用性:线性代数核，数字信号处理(DSP)和图像处理。我们在PolyMage(用于图像处理管道的特定领域语言和编译器)和Pluto(最先进的多面体自动并行化器)中实现了我们的贴图大小选择模型。在PolyMage中实现模型允许我们将其扩展到DSP和线性代数领域，并且还包含习语识别阶段，以便优化特定于供应商的库实现可以在有利可图的时候使用。我们的实验表明，在DSP领域的基准测试中，与Matlab相比，几何性能显著提高2.2倍。对于PolyBench，我们在Pluto上获得了1.04倍的几何加速(最大加速为1.3倍)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A practical tile size selection model for affine loop nests

Loop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic tiling of loop nests rely on auto-tuning to find good tile sizes. Tile size selection models proposed in the literature either fall back to modeling complex non-linear optimization problems or tackle a narrow class of inputs. Hence, a fast and generic tile size selection model is desirable for it to be adopted into compiler infrastructures like those of GCC, LLVM, or MLIR. In this paper, we propose a new, fast and lightweight tile size selection model that considers temporal and spatial reuse along dimensions of a loop nest. For an n-dimensional loop nest, we determine the tile sizes by calculating the zeros of a polynomial in a single variable of degree at most n. Our tile size calculation model also accounts for vectorizability of the innermost dimension. We demonstrate the generality of our approach by selecting benchmarks from various domains: linear algebra kernels, digital signal processing (DSP) and image processing. We implement our tile size selection model in PolyMage (a domain-specific language and compiler for image processing pipelines) and Pluto (state-of-the-art polyhedral auto-parallelizer). Implementing the model in PolyMage allows us to extend it to DSP and linear algebra domains and also incorporate idiom recognition phases so that optimized vendor-specific library implementations could be utilized whenever profitable. Our experiments demonstrate a significant geomean performance gain of 2.2x over Matlab on benchmarks from the DSP domain. For PolyBench, we obtain a geomean speedup of 1.04x (maximum speedup of 1.3x) over Pluto.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

自引率

0.00%

发文量