ELAFormer: Early Local Attention in multi-scale vision transFormers

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-06-13 DOI:10.1016/j.knosys.2025.113851

Xin Zhou, Zhaohui Ren, Yongchao Zhang, Zeyu Jiang, Tianzhuang Yu, Hengfa Luo, Shihua Zhou

{"title":"ELAFormer: Early Local Attention in multi-scale vision transFormers","authors":"Xin Zhou, Zhaohui Ren, Yongchao Zhang, Zeyu Jiang, Tianzhuang Yu, Hengfa Luo, Shihua Zhou","doi":"10.1016/j.knosys.2025.113851","DOIUrl":null,"url":null,"abstract":"<div><div>Vision Transformers have demonstrated remarkable success in vision tasks and have shown great potential when compared to CNN-based models. However, Transformers tend to prioritize the global context and overlook the local features between patches. Recent studies suggest that initializing the relative position between query and key tokens can limit attention distance, allowing for effective attention to local features without using convolutional blocks, similar to convolutional kernels. Based on this insight, this paper proposes a new hybrid multi-scale model called <strong>E</strong>fficient <strong>L</strong>ocal <strong>A</strong>ttention trans<strong>F</strong>ormer (ELAFormer). In this model, we propose a Window-based Positional Self-Attention (WPSA) module that focuses on adjacent tokens for short-distance features when querying the key token. Furthermore, we improve the conventional Spatial Reduction Attention (SRA) module by employing Depth-wise Separable (DS) convolution instead of standard down-sampling convolution(DSSRA) for long-distance contexts. By stacking these two modules, extensive experiments demonstrate that our model, with a small size of only 28M, achieves 82.9% accuracy on ImageNet classification with an input size of 224 × 224. Our model outperforms state-of-the-art Transformer models. The small ELAFormer model surpasses the tiny focal transformer by +1.3% mAP with RetinaNet 1x on COCO and +1.8/+2.0% mIoU/MS mIouU with UperNet on ADE20k, serving as a strong backbone for the most challenging computer vision tasks.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"325 ","pages":"Article 113851"},"PeriodicalIF":7.2000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125008974","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Vision Transformers have demonstrated remarkable success in vision tasks and have shown great potential when compared to CNN-based models. However, Transformers tend to prioritize the global context and overlook the local features between patches. Recent studies suggest that initializing the relative position between query and key tokens can limit attention distance, allowing for effective attention to local features without using convolutional blocks, similar to convolutional kernels. Based on this insight, this paper proposes a new hybrid multi-scale model called Efficient Local Attention transFormer (ELAFormer). In this model, we propose a Window-based Positional Self-Attention (WPSA) module that focuses on adjacent tokens for short-distance features when querying the key token. Furthermore, we improve the conventional Spatial Reduction Attention (SRA) module by employing Depth-wise Separable (DS) convolution instead of standard down-sampling convolution(DSSRA) for long-distance contexts. By stacking these two modules, extensive experiments demonstrate that our model, with a small size of only 28M, achieves 82.9% accuracy on ImageNet classification with an input size of 224 × 224. Our model outperforms state-of-the-art Transformer models. The small ELAFormer model surpasses the tiny focal transformer by +1.3% mAP with RetinaNet 1x on COCO and +1.8/+2.0% mIoU/MS mIouU with UperNet on ADE20k, serving as a strong backbone for the most challenging computer vision tasks.

查看原文本刊更多论文

ELAFormer：多尺度视觉变压器的早期局部关注

视觉变形器在视觉任务中取得了显著的成功，与基于cnn的模型相比，显示出巨大的潜力。然而，变形金刚倾向于优先考虑全局上下文，而忽略了补丁之间的局部特征。最近的研究表明，初始化查询和键标记之间的相对位置可以限制注意距离，允许在不使用卷积块的情况下有效地关注局部特征，类似于卷积核。基于此，本文提出了一种新的混合多尺度模型——高效局部注意力转换器（Efficient Local Attention transFormer, ELAFormer）。在这个模型中，我们提出了一个基于窗口的位置自注意（WPSA）模块，该模块在查询密钥令牌时关注邻近令牌的短距离特征。此外，我们改进了传统的空间减少注意（SRA）模块，采用深度可分（DS）卷积代替标准的下采样卷积（DSSRA）用于远距离上下文。通过叠加这两个模块，大量的实验表明，我们的模型只有28M的小尺寸，在输入尺寸为224 × 224的ImageNet分类上，准确率达到82.9%。我们的模型优于最先进的变压器模型。小型ELAFormer模型在COCO上使用retanet 1x，比小型焦变压器高出+1.3% mAP，在ADE20k上使用UperNet，比小型焦变压器高出+1.8/+2.0% mIoU/MS mIouU，成为最具挑战性的计算机视觉任务的强大支柱。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.