An optimization of im2col, an important method of CNNs, based on continuous address access

Haoyu Wang, Chengguang Ma
{"title":"An optimization of im2col, an important method of CNNs, based on continuous address access","authors":"Haoyu Wang, Chengguang Ma","doi":"10.1109/ICCECE51280.2021.9342343","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) are now widely used in various common tasks such as image classification, semantic segmentation, and face recognition. Convolution layers are the core layers of CNNs, the computing speed of the convolution layer will directly affect the computing speed of the entire network, thereby affecting the real-time performance. The current general convolutional layer acceleration method is to use the image to column (im2col) algorithm to split the input image into a column matrix, then use the general matrix multiplication (GEMM) to perform matrix multiplication on the column vector and the convolution kernel. This operation can greatly improve the computing speed of the convolutional layer because most computing platforms have more mature optimizations for GEMM. However, DSP is very fast for vector multiplication and addition. In the inference of the convolutional layer, the memory access of the im2col algorithm consumes far more time than the GEMM. This has become a bottleneck for further optimization of computing speed. In this article, I will present an im2col algorithm acceleration method in the case of a single stride based on continuous memory address read. With this method, the speed of the im2col algorithm can be increased by more than 10 times when processing a single-step convolutional layer. This is a portable method. In this article, I’11 show the optimization effects on Xtensa BBE64ep DSP cores and stm32f4 processors.","PeriodicalId":229425,"journal":{"name":"2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCECE51280.2021.9342343","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Convolutional neural networks (CNNs) are now widely used in various common tasks such as image classification, semantic segmentation, and face recognition. Convolution layers are the core layers of CNNs, the computing speed of the convolution layer will directly affect the computing speed of the entire network, thereby affecting the real-time performance. The current general convolutional layer acceleration method is to use the image to column (im2col) algorithm to split the input image into a column matrix, then use the general matrix multiplication (GEMM) to perform matrix multiplication on the column vector and the convolution kernel. This operation can greatly improve the computing speed of the convolutional layer because most computing platforms have more mature optimizations for GEMM. However, DSP is very fast for vector multiplication and addition. In the inference of the convolutional layer, the memory access of the im2col algorithm consumes far more time than the GEMM. This has become a bottleneck for further optimization of computing speed. In this article, I will present an im2col algorithm acceleration method in the case of a single stride based on continuous memory address read. With this method, the speed of the im2col algorithm can be increased by more than 10 times when processing a single-step convolutional layer. This is a portable method. In this article, I’11 show the optimization effects on Xtensa BBE64ep DSP cores and stm32f4 processors.
基于连续地址访问的cnn重要方法im2col优化
卷积神经网络(cnn)目前广泛应用于各种常见任务,如图像分类、语义分割和人脸识别。卷积层是cnn的核心层,卷积层的计算速度将直接影响整个网络的计算速度,从而影响实时性能。目前通用的卷积层加速方法是使用图像到列(im2col)算法将输入图像分割成列矩阵,然后使用通用矩阵乘法(GEMM)对列向量和卷积核进行矩阵乘法。这种操作可以大大提高卷积层的计算速度,因为大多数计算平台对GEMM都有更成熟的优化。然而,DSP是非常快的矢量乘法和加法。在卷积层的推理中,im2col算法的内存访问消耗的时间远远超过GEMM算法。这已经成为进一步优化计算速度的瓶颈。在本文中,我将介绍一种基于连续内存地址读取的单步im2col算法加速方法。使用该方法,在处理单步卷积层时,im2col算法的速度可以提高10倍以上。这是一种便携的方法。在本文中,我将展示在Xtensa BBE64ep DSP内核和stm32f4处理器上的优化效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信