A lightweight vision transformer with symmetric modules for vision tasks

IF 0.8 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Intelligent Data Analysis Pub Date : 2023-10-19 DOI:10.3233/ida-227205

Shengjun Liang, Mingxin Yu, Wenshuai Lu, Xinglong Ji, Xiongxin Tang, Xiaolin Liu, Rui You

{"title":"A lightweight vision transformer with symmetric modules for vision tasks","authors":"Shengjun Liang, Mingxin Yu, Wenshuai Lu, Xinglong Ji, Xiongxin Tang, Xiaolin Liu, Rui You","doi":"10.3233/ida-227205","DOIUrl":null,"url":null,"abstract":"Transformer-based networks have demonstrated their powerful performance in various vision tasks. However, these transformer-based networks are heavyweight and cannot be applied to edge computing (mobile) devices. Despite that the lightweight transformer network has emerged, several problems remain, i.e., weak feature extraction ability, feature redundancy, and lack of convolutional inductive bias. To address these three problems, we propose a lightweight visual transformer (Symmetric Former, SFormer), which contains two novel modules (Symmetric Block and Symmetric FFN). Specifically, we design Symmetric Block to expand feature capacity inside the module and enhance the long-range modeling capability of attention mechanism. To increase the compactness of the model and introduce inductive bias, we introduce convolutional cheap operations to design Symmetric FFN. We compared the SFormer with existing lightweight transformers on several vision tasks. Remarkably, on the image recognition task of ImageNet [13], SFormer gains 1.2% and 1.6% accuracy improvements compared to PVTv2-b0 and Swin Transformer, respectively. On the semantic segmentation task of ADE20K [64], SFormer delivers performance improvements of 0.2% and 0.7% compared to PVTv2-b0 and Swin Transformer, respectively. On the cityscapes dataset [11], SFormer delivers performance improvements of 2.5% and 4.2% compared to PVTv2-b0 and Swin Transformer, respectively. The code is open-source and available at: https://github.com/ISCLab-Bistu/Symmetric_Former.git.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.8000,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Data Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/ida-227205","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer-based networks have demonstrated their powerful performance in various vision tasks. However, these transformer-based networks are heavyweight and cannot be applied to edge computing (mobile) devices. Despite that the lightweight transformer network has emerged, several problems remain, i.e., weak feature extraction ability, feature redundancy, and lack of convolutional inductive bias. To address these three problems, we propose a lightweight visual transformer (Symmetric Former, SFormer), which contains two novel modules (Symmetric Block and Symmetric FFN). Specifically, we design Symmetric Block to expand feature capacity inside the module and enhance the long-range modeling capability of attention mechanism. To increase the compactness of the model and introduce inductive bias, we introduce convolutional cheap operations to design Symmetric FFN. We compared the SFormer with existing lightweight transformers on several vision tasks. Remarkably, on the image recognition task of ImageNet [13], SFormer gains 1.2% and 1.6% accuracy improvements compared to PVTv2-b0 and Swin Transformer, respectively. On the semantic segmentation task of ADE20K [64], SFormer delivers performance improvements of 0.2% and 0.7% compared to PVTv2-b0 and Swin Transformer, respectively. On the cityscapes dataset [11], SFormer delivers performance improvements of 2.5% and 4.2% compared to PVTv2-b0 and Swin Transformer, respectively. The code is open-source and available at: https://github.com/ISCLab-Bistu/Symmetric_Former.git.

查看原文本刊更多论文

一个轻量级的视觉转换器，具有用于视觉任务的对称模块

基于变压器的网络在各种视觉任务中表现出了强大的性能。然而，这些基于变压器的网络是重量级的，不能应用于边缘计算(移动)设备。尽管轻量级变压器网络已经出现，但仍然存在特征提取能力弱、特征冗余、缺乏卷积感应偏置等问题。为了解决这三个问题，我们提出了一个轻量级的可视化变压器(对称前，SFormer)，它包含两个新颖的模块(对称块和对称FFN)。具体来说，我们设计了对称块来扩展模块内部的特征容量，增强注意机制的远程建模能力。为了增加模型的紧凑性并引入归纳偏置，我们引入卷积廉价运算来设计对称FFN。我们将SFormer与现有的轻型变压器在几个视觉任务上进行了比较。值得注意的是，在ImageNet的图像识别任务上[13]，与PVTv2-b0和Swin Transformer相比，SFormer的准确率分别提高了1.2%和1.6%。在ADE20K的语义分割任务上[64]，SFormer比PVTv2-b0和Swin Transformer的性能分别提高了0.2%和0.7%。在城市景观数据集[11]上，与PVTv2-b0和Swin Transformer相比，SFormer的性能分别提高了2.5%和4.2%。代码是开源的，可以在https://github.com/ISCLab-Bistu/Symmetric_Former.git上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Intelligent Data Analysis 工程技术-计算机：人工智能

CiteScore

2.20

自引率

5.90%

发文量

审稿时长

3.3 months

期刊介绍： Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.