SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing Pub Date : 2018-02-24 DOI:10.1145/3178433.3178436

Christopher I. Rodrigues, Amarin Phaosawasdi, Peng Wu

{"title":"SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors","authors":"Christopher I. Rodrigues, Amarin Phaosawasdi, Peng Wu","doi":"10.1145/3178433.3178436","DOIUrl":null,"url":null,"abstract":"Developers often rely on automatic vectorization to speed up fine-grained data-parallel code. However, for loop nests where the loops are shorter than the processor's SIMD width, automatic vectorization performs poorly. Vectorizers attempt to vectorize a single short loop, using (at best) a fraction of the processor's SIMD capacity. It is not straightforward to vectorize multiple nested loops together because they typically have memory accesses with multiple strides, which conventional methods cannot profitably vectorize. We present a solution in the context of compiling small tensor multiplication. Our compiler vectorizes several inner loops in order to utilize wide vector parallelism. To handle complicated strides, we devise a vectorizable form of loop tiling. The compiler transforms loops to improve memory locality, then caches tiles of data in vector registers. Strided access patterns are transformed into permute instructions. We show that our compiler is able to significantly speed up many small tensor multiplication algorithms. It judges 13.5% of a randomly generated sample of algorithms to be profitable to vectorize. On these, it generates code 1.55x as fast on average as that produced by GCC's state-of-the-art vectorizer, with a maximum speedup of 10x. We discuss potential extensions to vectorize more general algorithms.","PeriodicalId":197479,"journal":{"name":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3178433.3178436","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Developers often rely on automatic vectorization to speed up fine-grained data-parallel code. However, for loop nests where the loops are shorter than the processor's SIMD width, automatic vectorization performs poorly. Vectorizers attempt to vectorize a single short loop, using (at best) a fraction of the processor's SIMD capacity. It is not straightforward to vectorize multiple nested loops together because they typically have memory accesses with multiple strides, which conventional methods cannot profitably vectorize. We present a solution in the context of compiling small tensor multiplication. Our compiler vectorizes several inner loops in order to utilize wide vector parallelism. To handle complicated strides, we devise a vectorizable form of loop tiling. The compiler transforms loops to improve memory locality, then caches tiles of data in vector registers. Strided access patterns are transformed into permute instructions. We show that our compiler is able to significantly speed up many small tensor multiplication algorithms. It judges 13.5% of a randomly generated sample of algorithms to be profitable to vectorize. On these, it generates code 1.55x as fast on average as that produced by GCC's state-of-the-art vectorizer, with a maximum speedup of 10x. We discuss potential extensions to vectorize more general algorithms.

查看原文本刊更多论文

宽SIMD矢量处理器的小张量乘法核的sim化

开发人员通常依赖于自动向量化来加速细粒度数据并行代码。但是，对于循环比处理器的SIMD宽度短的循环巢，自动向量化的性能很差。向量化器尝试对单个短循环进行向量化，使用(最多)处理器SIMD容量的一小部分。将多个嵌套循环向量化在一起并不简单，因为它们通常具有多个步进的内存访问，而传统方法无法有效地向量化。我们在编译小张量乘法的背景下给出了一个解决方案。我们的编译器向量化了几个内部循环，以利用广泛的向量并行性。为了处理复杂的步进，我们设计了一种可矢量化的循环平铺形式。编译器转换循环以改善内存局部性，然后将数据块缓存到向量寄存器中。跨行访问模式被转换成置换指令。我们证明了我们的编译器能够显著加快许多小张量乘法算法。它判断随机生成的算法样本中有13.5%是可以进行矢量化的。在这些代码上，它生成代码的平均速度是GCC最先进的矢量器生成代码的1.55倍，最大加速为10倍。我们讨论了向量化更一般算法的潜在扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing

自引率

0.00%

发文量