Yanming Ye, Qiang Sun, Kailong Cheng, Xingfa Shen, Dongjing Wang
{"title":"A lightweight mechanism for vision-transformer-based object detection","authors":"Yanming Ye, Qiang Sun, Kailong Cheng, Xingfa Shen, Dongjing Wang","doi":"10.1007/s40747-025-01904-x","DOIUrl":null,"url":null,"abstract":"<p>DETR (DEtection TRansformer) is a CV model for object detection that replaces traditional complex methods with a Transformer architecture, and has achieved significant improvement over previous methods, particularly in handling small and medium-sized objects. However, the attention mechanism-based detection framework of DETR exhibits limitations in small and medium-sized object detection. It struggles to extract fine-grained details of small and medium-sized objects from low-resolution features, and its computational complexity increases significantly with the input scale, thereby constraining real-time detection efficiency. To address these limitations, we introduce the Cross Feature Attention (XFA) mechanism and propose XFCOS (XFA-based with FCOS), a novel object detection model built upon it. XFA simplifies the attention mechanism’s computational process and reduces complexity through L2 normalization and two one-dimensional convolutions applied in different directions. This design reduces the computational complexity from quadratic to linear while preserving spatial context awareness. XFCOS enhances the original TSP-FCOS (Transformer-based Set Prediction with FCOS) model by integrating XFA into the transformer encoder, creating a CNN-ViT hybrid architecture, significantly reducing computational costs without sacrificing accuracy. Extensive experiments demonstrate that XFCOS achieves state-of-the-art performance while addressing DETR’s convergence and efficiency limitations. On Pascal VOC 2007, XFCOS attains 54.7 AP and 60.7 AP<span>\\(_\\textrm{75}\\)</span> - surpassing DETR by 4.5 AP and 7.9 AP<span>\\(_\\textrm{75}\\)</span> respectively, establishing new benchmarks among ResNet-50-based detectors. The model shows particular strength in small object detection, achieving 24.0 AP<span>\\(_\\textrm{S}\\)</span> and 43.9 AP<span>\\(_\\textrm{M}\\)</span> on COCO 2017, representing 3.3 AP<span>\\(_\\textrm{S}\\)</span> and 3.8 AP<span>\\(_{\\textrm{M}}\\)</span> improvements over DETR. Through computational optimization, XFCOS reduces encoder FLOPs to 13.5G, representing a 17.2% decrease compared to TSP-FCOS’s 16.3G, while cutting activation memory from 285.78 to 264.64M, a reduction of 7.4%. This significantly enhances computational efficiency.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"89 1","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-025-01904-x","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
DETR (DEtection TRansformer) is a CV model for object detection that replaces traditional complex methods with a Transformer architecture, and has achieved significant improvement over previous methods, particularly in handling small and medium-sized objects. However, the attention mechanism-based detection framework of DETR exhibits limitations in small and medium-sized object detection. It struggles to extract fine-grained details of small and medium-sized objects from low-resolution features, and its computational complexity increases significantly with the input scale, thereby constraining real-time detection efficiency. To address these limitations, we introduce the Cross Feature Attention (XFA) mechanism and propose XFCOS (XFA-based with FCOS), a novel object detection model built upon it. XFA simplifies the attention mechanism’s computational process and reduces complexity through L2 normalization and two one-dimensional convolutions applied in different directions. This design reduces the computational complexity from quadratic to linear while preserving spatial context awareness. XFCOS enhances the original TSP-FCOS (Transformer-based Set Prediction with FCOS) model by integrating XFA into the transformer encoder, creating a CNN-ViT hybrid architecture, significantly reducing computational costs without sacrificing accuracy. Extensive experiments demonstrate that XFCOS achieves state-of-the-art performance while addressing DETR’s convergence and efficiency limitations. On Pascal VOC 2007, XFCOS attains 54.7 AP and 60.7 AP\(_\textrm{75}\) - surpassing DETR by 4.5 AP and 7.9 AP\(_\textrm{75}\) respectively, establishing new benchmarks among ResNet-50-based detectors. The model shows particular strength in small object detection, achieving 24.0 AP\(_\textrm{S}\) and 43.9 AP\(_\textrm{M}\) on COCO 2017, representing 3.3 AP\(_\textrm{S}\) and 3.8 AP\(_{\textrm{M}}\) improvements over DETR. Through computational optimization, XFCOS reduces encoder FLOPs to 13.5G, representing a 17.2% decrease compared to TSP-FCOS’s 16.3G, while cutting activation memory from 285.78 to 264.64M, a reduction of 7.4%. This significantly enhances computational efficiency.
DETR (DEtection TRansformer)是一种用于对象检测的CV模型,它用TRansformer架构取代了传统的复杂方法,并且比以前的方法取得了显著的改进,特别是在处理中小型对象方面。然而,基于注意机制的DETR检测框架在中小目标检测中存在局限性。它难以从低分辨率特征中提取中小目标的细粒度细节,计算复杂度随着输入规模的增加而显著增加,从而制约了实时检测效率。为了解决这些限制,我们引入了交叉特征注意(XFA)机制,并提出了一种基于XFA的新型目标检测模型XFCOS (XFA-based with FCOS)。XFA通过L2归一化和两个不同方向的一维卷积简化了注意机制的计算过程,降低了复杂度。该设计将计算复杂度从二次型降低到线性型,同时保留了空间上下文感知。XFCOS通过将XFA集成到变压器编码器中,增强了原始的TSP-FCOS(基于变压器的集预测与FCOS)模型,创建了CNN-ViT混合架构,在不牺牲精度的情况下显着降低了计算成本。大量的实验表明,XFCOS在解决DETR的收敛和效率限制的同时达到了最先进的性能。在Pascal VOC 2007上,XFCOS达到了54.7 AP和60.7 AP \(_\textrm{75}\),分别比DETR高出4.5 AP和7.9 AP \(_\textrm{75}\),在基于resnet -50的检测器中建立了新的基准。该模型在小目标检测方面表现出了特别的强度,在COCO 2017上实现了24.0 AP \(_\textrm{S}\)和43.9 AP \(_\textrm{M}\),比DETR提高了3.3 AP \(_\textrm{S}\)和3.8 AP \(_{\textrm{M}}\)。通过计算优化,XFCOS将编码器FLOPs降低到13.5G,相当于17.2 g% decrease compared to TSP-FCOS’s 16.3G, while cutting activation memory from 285.78 to 264.64M, a reduction of 7.4%. This significantly enhances computational efficiency.
期刊介绍:
Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.