Pooling Acceleration in the DaVinci Architecture Using Im2col and Col2im Instructions

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2021-06-01 DOI:10.1109/IPDPSW52791.2021.00016

Caio Salvador Rohwedder, J. P. L. Carvalho, J. N. Amaral, G. Araújo, Giancarlo Colmenares, Kai-Ting Amy Wang

{"title":"Pooling Acceleration in the DaVinci Architecture Using Im2col and Col2im Instructions","authors":"Caio Salvador Rohwedder, J. P. L. Carvalho, J. N. Amaral, G. Araújo, Giancarlo Colmenares, Kai-Ting Amy Wang","doi":"10.1109/IPDPSW52791.2021.00016","DOIUrl":null,"url":null,"abstract":"Image-to-column (Im2col) and column-to-image (Col2im) are data transformations extensively used to map convolution to matrix multiplication. These transformations rearrange the inputs of convolution to avoid its strided memory access pattern, thus providing a friendlier data layout for CPUs and GPUs. In artificial intelligence (AI) accelerators, these transformations allow convolution to be computed in matrix-multiplier units. Implemented in software, however, they impose a significant overhead that must be compensated by the efficiency gains of matrix multipliers. DaVinci is an AI accelerator architecture that introduces instructions to optimize Im2col and Col2im. Another core layer of convolutional neural networks that presents a strided memory access pattern is pooling. This paper explores the specialized Im2col and Col2im instructions to accelerate pooling layers in DaVinci. An experimental evaluation reveals that the proposed pooling implementations can yield speedups of up to 5.8 times compared to a baseline that does not use these specialized instructions. The speedups follow from an improved memory layout in the inputs of pooling, as this layout leads to better utilization of the vector processing unit in DaVinci.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW52791.2021.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Image-to-column (Im2col) and column-to-image (Col2im) are data transformations extensively used to map convolution to matrix multiplication. These transformations rearrange the inputs of convolution to avoid its strided memory access pattern, thus providing a friendlier data layout for CPUs and GPUs. In artificial intelligence (AI) accelerators, these transformations allow convolution to be computed in matrix-multiplier units. Implemented in software, however, they impose a significant overhead that must be compensated by the efficiency gains of matrix multipliers. DaVinci is an AI accelerator architecture that introduces instructions to optimize Im2col and Col2im. Another core layer of convolutional neural networks that presents a strided memory access pattern is pooling. This paper explores the specialized Im2col and Col2im instructions to accelerate pooling layers in DaVinci. An experimental evaluation reveals that the proposed pooling implementations can yield speedups of up to 5.8 times compared to a baseline that does not use these specialized instructions. The speedups follow from an improved memory layout in the inputs of pooling, as this layout leads to better utilization of the vector processing unit in DaVinci.

查看原文本刊更多论文

使用Im2col和Col2im指令的达芬奇架构池加速

图像到列(Im2col)和列到图像(Col2im)是广泛用于将卷积映射到矩阵乘法的数据转换。这些转换重新安排了卷积的输入，以避免其跨行内存访问模式，从而为cpu和gpu提供了更友好的数据布局。在人工智能(AI)加速器中，这些转换允许以矩阵乘法器单位计算卷积。然而，在软件中实现时，它们带来了巨大的开销，必须通过矩阵乘法器的效率增益来补偿。达芬奇是一个人工智能加速器架构，它引入了优化Im2col和Col2im的指令。卷积神经网络的另一个核心层是池化，它呈现出一种跨越式的内存访问模式。本文探索了专用的Im2col和Col2im指令来加速达芬奇中的层池化。一项实验评估显示，与不使用这些专门指令的基线相比，建议的池化实现可以产生高达5.8倍的加速。速度的提高源于池输入中改进的内存布局，因为这种布局可以更好地利用达芬奇中的矢量处理单元。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量