{"title":"ELAFormer: Early Local Attention in multi-scale vision transFormers","authors":"Xin Zhou, Zhaohui Ren, Yongchao Zhang, Zeyu Jiang, Tianzhuang Yu, Hengfa Luo, Shihua Zhou","doi":"10.1016/j.knosys.2025.113851","DOIUrl":null,"url":null,"abstract":"<div><div>Vision Transformers have demonstrated remarkable success in vision tasks and have shown great potential when compared to CNN-based models. However, Transformers tend to prioritize the global context and overlook the local features between patches. Recent studies suggest that initializing the relative position between query and key tokens can limit attention distance, allowing for effective attention to local features without using convolutional blocks, similar to convolutional kernels. Based on this insight, this paper proposes a new hybrid multi-scale model called <strong>E</strong>fficient <strong>L</strong>ocal <strong>A</strong>ttention trans<strong>F</strong>ormer (ELAFormer). In this model, we propose a Window-based Positional Self-Attention (WPSA) module that focuses on adjacent tokens for short-distance features when querying the key token. Furthermore, we improve the conventional Spatial Reduction Attention (SRA) module by employing Depth-wise Separable (DS) convolution instead of standard down-sampling convolution(DSSRA) for long-distance contexts. By stacking these two modules, extensive experiments demonstrate that our model, with a small size of only 28M, achieves 82.9% accuracy on ImageNet classification with an input size of 224 × 224. Our model outperforms state-of-the-art Transformer models. The small ELAFormer model surpasses the tiny focal transformer by +1.3% mAP with RetinaNet 1x on COCO and +1.8/+2.0% mIoU/MS mIouU with UperNet on ADE20k, serving as a strong backbone for the most challenging computer vision tasks.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"325 ","pages":"Article 113851"},"PeriodicalIF":7.2000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125008974","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Vision Transformers have demonstrated remarkable success in vision tasks and have shown great potential when compared to CNN-based models. However, Transformers tend to prioritize the global context and overlook the local features between patches. Recent studies suggest that initializing the relative position between query and key tokens can limit attention distance, allowing for effective attention to local features without using convolutional blocks, similar to convolutional kernels. Based on this insight, this paper proposes a new hybrid multi-scale model called Efficient Local Attention transFormer (ELAFormer). In this model, we propose a Window-based Positional Self-Attention (WPSA) module that focuses on adjacent tokens for short-distance features when querying the key token. Furthermore, we improve the conventional Spatial Reduction Attention (SRA) module by employing Depth-wise Separable (DS) convolution instead of standard down-sampling convolution(DSSRA) for long-distance contexts. By stacking these two modules, extensive experiments demonstrate that our model, with a small size of only 28M, achieves 82.9% accuracy on ImageNet classification with an input size of 224 × 224. Our model outperforms state-of-the-art Transformer models. The small ELAFormer model surpasses the tiny focal transformer by +1.3% mAP with RetinaNet 1x on COCO and +1.8/+2.0% mIoU/MS mIouU with UperNet on ADE20k, serving as a strong backbone for the most challenging computer vision tasks.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.