LRFAN: a multi-scale large receptive field attention neural network

Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things Pub Date : 2023-05-26 DOI:10.1145/3603781.3603834

Ci Song, Zhao Zhang

{"title":"LRFAN: a multi-scale large receptive field attention neural network","authors":"Ci Song, Zhao Zhang","doi":"10.1145/3603781.3603834","DOIUrl":null,"url":null,"abstract":"Transformer, which was originally used as a natural language processor, has rapidly gained importance in the field of computer vision since the introduction of ViT. The more efficient Transformer model challenges the dominance of convolution neural networks. In order to capture long-range dependencies, some convolution models have obtained performance gains by convolving very large kernels. However, as the size of convolution kernels grows, the computational complexity grows on the one hand, while speed begins to saturate on the other. In this paper, we propose a multi-scale large receptive field attention module (LRFA) that extracts feature information at different scales by grouping and superimposing different numbers of small-size convolutions. On the other hand, the superposition can have the effect of large kernel convolution, which reduces the computational complexity. LRFA overcomes the inability of conventional convolution neural networks to capture long-range dependencies and the inability of self-attention models to account for local feature information. We design an LRFA-based neural network, a multi-scale large receptive field attention neural network (LRFAN), which adjusts the superimposed convolution kernels size based on network depth and input feature information, and can adapt to the input feature maps to better capture long-range dependencies. Extensive experiments demonstrate that we outperform the conventional convolution neural network and the visual Transformer model in computer vision tasks such as image classification and object detection.","PeriodicalId":391180,"journal":{"name":"Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3603781.3603834","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer, which was originally used as a natural language processor, has rapidly gained importance in the field of computer vision since the introduction of ViT. The more efficient Transformer model challenges the dominance of convolution neural networks. In order to capture long-range dependencies, some convolution models have obtained performance gains by convolving very large kernels. However, as the size of convolution kernels grows, the computational complexity grows on the one hand, while speed begins to saturate on the other. In this paper, we propose a multi-scale large receptive field attention module (LRFA) that extracts feature information at different scales by grouping and superimposing different numbers of small-size convolutions. On the other hand, the superposition can have the effect of large kernel convolution, which reduces the computational complexity. LRFA overcomes the inability of conventional convolution neural networks to capture long-range dependencies and the inability of self-attention models to account for local feature information. We design an LRFA-based neural network, a multi-scale large receptive field attention neural network (LRFAN), which adjusts the superimposed convolution kernels size based on network depth and input feature information, and can adapt to the input feature maps to better capture long-range dependencies. Extensive experiments demonstrate that we outperform the conventional convolution neural network and the visual Transformer model in computer vision tasks such as image classification and object detection.

查看原文本刊更多论文

LRFAN:一个多尺度大感受野注意神经网络

Transformer最初是用作自然语言处理器的，自ViT引入以来，在计算机视觉领域迅速获得了重要地位。更高效的Transformer模型挑战了卷积神经网络的主导地位。为了捕获远程依赖关系，一些卷积模型通过卷积非常大的内核获得了性能提升。然而，随着卷积核规模的增大，一方面计算复杂度的增加，另一方面速度开始趋于饱和。本文提出了一种多尺度大感受场注意模块(LRFA)，该模块通过分组叠加不同数量的小尺度卷积提取不同尺度的特征信息。另一方面，叠加可以产生大核卷积的效果，降低了计算复杂度。LRFA克服了传统卷积神经网络无法捕获远程依赖关系和自关注模型无法解释局部特征信息的缺点。我们设计了一种基于lrfa的神经网络，即多尺度大感受场注意神经网络(LRFAN)，它根据网络深度和输入特征信息调整叠加卷积核的大小，能够适应输入特征映射，更好地捕获远程依赖关系。大量的实验表明，我们在图像分类和目标检测等计算机视觉任务中优于传统的卷积神经网络和视觉Transformer模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things

自引率

0.00%

发文量