Im2win: Memory Efficient Convolution On SIMD Architectures

2022 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2022-09-19 DOI:10.1109/HPEC55821.2022.9926408

Shuai-bing Lu, Jun Chu, X. Liu

{"title":"Im2win: Memory Efficient Convolution On SIMD Architectures","authors":"Shuai-bing Lu, Jun Chu, X. Liu","doi":"10.1109/HPEC55821.2022.9926408","DOIUrl":null,"url":null,"abstract":"Convolution is the most expensive operation among neural network operations, thus its performance is critical to the overall performance of neural networks. Commonly used convolution approaches, including general matrix multiplication (GEMM)-based convolution and direct convolution, rely on im2col for data transformation or do not use data transformation at all, respectively. However, the im2col data transformation can lead to at least 2 x memory footprint compared to not using data transformation at all, thus limiting the size of neural network models running on memory-limited systems. Meanwhile, not using data transformation usually performs poorly due to nonconsecutive memory access although it consumes less memory. To solve those problems, we propose a new memory-efficient data transformation algorithm, called im2win. This algorithm refactorizes a row of square or rectangle dot product windows of the input image and flattens unique elements within these windows into a row in the output tensor, which enables consecutive memory access and data reuse, and thus greatly reduces the memory overhead. Furthermore, we propose a high-performance im2win-based convolution algorithm with various optimizations, including vectorization, loop reordering, etc. Our experimental results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation based on im2col, and achieves average to 3.6 × and 5.3× speedup in performance compared to the im2col-based convolution and not using data transformation, respectively.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC55821.2022.9926408","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Convolution is the most expensive operation among neural network operations, thus its performance is critical to the overall performance of neural networks. Commonly used convolution approaches, including general matrix multiplication (GEMM)-based convolution and direct convolution, rely on im2col for data transformation or do not use data transformation at all, respectively. However, the im2col data transformation can lead to at least 2 x memory footprint compared to not using data transformation at all, thus limiting the size of neural network models running on memory-limited systems. Meanwhile, not using data transformation usually performs poorly due to nonconsecutive memory access although it consumes less memory. To solve those problems, we propose a new memory-efficient data transformation algorithm, called im2win. This algorithm refactorizes a row of square or rectangle dot product windows of the input image and flattens unique elements within these windows into a row in the output tensor, which enables consecutive memory access and data reuse, and thus greatly reduces the memory overhead. Furthermore, we propose a high-performance im2win-based convolution algorithm with various optimizations, including vectorization, loop reordering, etc. Our experimental results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation based on im2col, and achieves average to 3.6 × and 5.3× speedup in performance compared to the im2col-based convolution and not using data transformation, respectively.

查看原文本刊更多论文

Im2win: SIMD架构上的内存高效卷积

卷积运算是神经网络运算中代价最高的运算，其性能对神经网络的整体性能至关重要。常用的卷积方法，包括基于一般矩阵乘法(GEMM)的卷积和直接卷积，分别依赖im2col进行数据转换，或者根本不使用数据转换。然而，与根本不使用数据转换相比，im2col数据转换可能导致至少2倍的内存占用，从而限制了在内存有限的系统上运行的神经网络模型的大小。同时，不使用数据转换虽然消耗较少的内存，但由于非连续内存访问，通常性能较差。为了解决这些问题，我们提出了一种新的内存高效数据转换算法，称为im2win。该算法对输入图像的一行正方形或矩形点积窗口进行重构，并将这些窗口内的唯一元素平坦化为输出张量中的一行，从而实现连续的内存访问和数据重用，从而大大降低了内存开销。此外，我们提出了一种高性能的基于im2win的卷积算法，并进行了各种优化，包括向量化、循环重排序等。我们的实验结果表明，与PyTorch基于im2col的卷积实现相比，我们的算法平均减少了41.6%的内存开销，与基于im2col的卷积和不使用数据转换相比，我们的算法在性能上分别实现了3.6倍和5.3倍的平均加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量