{"title":"LRFAN: a multi-scale large receptive field attention neural network","authors":"Ci Song, Zhao Zhang","doi":"10.1145/3603781.3603834","DOIUrl":null,"url":null,"abstract":"Transformer, which was originally used as a natural language processor, has rapidly gained importance in the field of computer vision since the introduction of ViT. The more efficient Transformer model challenges the dominance of convolution neural networks. In order to capture long-range dependencies, some convolution models have obtained performance gains by convolving very large kernels. However, as the size of convolution kernels grows, the computational complexity grows on the one hand, while speed begins to saturate on the other. In this paper, we propose a multi-scale large receptive field attention module (LRFA) that extracts feature information at different scales by grouping and superimposing different numbers of small-size convolutions. On the other hand, the superposition can have the effect of large kernel convolution, which reduces the computational complexity. LRFA overcomes the inability of conventional convolution neural networks to capture long-range dependencies and the inability of self-attention models to account for local feature information. We design an LRFA-based neural network, a multi-scale large receptive field attention neural network (LRFAN), which adjusts the superimposed convolution kernels size based on network depth and input feature information, and can adapt to the input feature maps to better capture long-range dependencies. Extensive experiments demonstrate that we outperform the conventional convolution neural network and the visual Transformer model in computer vision tasks such as image classification and object detection.","PeriodicalId":391180,"journal":{"name":"Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3603781.3603834","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Transformer, which was originally used as a natural language processor, has rapidly gained importance in the field of computer vision since the introduction of ViT. The more efficient Transformer model challenges the dominance of convolution neural networks. In order to capture long-range dependencies, some convolution models have obtained performance gains by convolving very large kernels. However, as the size of convolution kernels grows, the computational complexity grows on the one hand, while speed begins to saturate on the other. In this paper, we propose a multi-scale large receptive field attention module (LRFA) that extracts feature information at different scales by grouping and superimposing different numbers of small-size convolutions. On the other hand, the superposition can have the effect of large kernel convolution, which reduces the computational complexity. LRFA overcomes the inability of conventional convolution neural networks to capture long-range dependencies and the inability of self-attention models to account for local feature information. We design an LRFA-based neural network, a multi-scale large receptive field attention neural network (LRFAN), which adjusts the superimposed convolution kernels size based on network depth and input feature information, and can adapt to the input feature maps to better capture long-range dependencies. Extensive experiments demonstrate that we outperform the conventional convolution neural network and the visual Transformer model in computer vision tasks such as image classification and object detection.