Rethinking the multi-scale feature hierarchy in object detection transformer (DETR)

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Soft Computing Pub Date : 2025-03-28 DOI:10.1016/j.asoc.2025.113081

Fanglin Liu , Qinghe Zheng , Xinyu Tian , Feng Shu , Weiwei Jiang , Miaohui Wang , Abdussalam Elhanashi , Sergio Saponara

{"title":"Rethinking the multi-scale feature hierarchy in object detection transformer (DETR)","authors":"Fanglin Liu , Qinghe Zheng , Xinyu Tian , Feng Shu , Weiwei Jiang , Miaohui Wang , Abdussalam Elhanashi , Sergio Saponara","doi":"10.1016/j.asoc.2025.113081","DOIUrl":null,"url":null,"abstract":"<div><div>The Detection Transformer (DETR) has emerged as the dominant paradigm in the field of object detection due to its end-to-end architectural design. Researchers have explored various aspects of DETR, including its structure, pre-training strategies, attention mechanisms, and query embeddings, achiving significant progress. However, high computational costs limit the efficient use of multi-scale feature maps and hinder the full exploitation of complex multi-branch structures. We examine the negative impact of multi-scale features on the computational cost of DETRs and find that introducing long sequence data to the encoder is suboptimal. In this work, we aim to further push the boundaries of DETR’s performance and efficiency from the model structure perspective, thus developing the fusion detection Transformer (F-DETR) with heterogeneous scale multi-branch structure. To the best of our knowledge, this is the first explicit attempt to integrate multi-scale features into the end-to-end DETR structure. Specifically, we propose a multi-branch structure to simultaneously utilize feature maps at different levels, facilitating the interaction of local and global features. Additionally, we select certain joint latent variables from the interactive information flow to initialize the object container, a technique commonly used in query-based detectors. Experimental results show that F-DETR achieves a 43.9 % AP using 36 training epochs on the popular public COCO dataset. Furthermore, our approach demonstrates a better trade-off between accuracy and complexity compared to the original DETR.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"175 ","pages":"Article 113081"},"PeriodicalIF":7.2000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625003928","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The Detection Transformer (DETR) has emerged as the dominant paradigm in the field of object detection due to its end-to-end architectural design. Researchers have explored various aspects of DETR, including its structure, pre-training strategies, attention mechanisms, and query embeddings, achiving significant progress. However, high computational costs limit the efficient use of multi-scale feature maps and hinder the full exploitation of complex multi-branch structures. We examine the negative impact of multi-scale features on the computational cost of DETRs and find that introducing long sequence data to the encoder is suboptimal. In this work, we aim to further push the boundaries of DETR’s performance and efficiency from the model structure perspective, thus developing the fusion detection Transformer (F-DETR) with heterogeneous scale multi-branch structure. To the best of our knowledge, this is the first explicit attempt to integrate multi-scale features into the end-to-end DETR structure. Specifically, we propose a multi-branch structure to simultaneously utilize feature maps at different levels, facilitating the interaction of local and global features. Additionally, we select certain joint latent variables from the interactive information flow to initialize the object container, a technique commonly used in query-based detectors. Experimental results show that F-DETR achieves a 43.9 % AP using 36 training epochs on the popular public COCO dataset. Furthermore, our approach demonstrates a better trade-off between accuracy and complexity compared to the original DETR.

查看原文本刊更多论文

对目标检测变压器（DETR）中多尺度特征层次的再思考

检测变压器（DETR）由于其端到端架构设计而成为目标检测领域的主导范式。研究人员对DETR的结构、预训练策略、注意机制和查询嵌入等方面进行了探索，并取得了重大进展。然而，高计算成本限制了多尺度特征图的有效利用，阻碍了复杂多分支结构的充分利用。我们研究了多尺度特征对DETRs计算成本的负面影响，并发现将长序列数据引入编码器是次优的。在本工作中，我们旨在从模型结构的角度进一步突破DETR性能和效率的界限，从而开发具有异构尺度多分支结构的融合检测变压器（F-DETR）。据我们所知，这是第一次将多尺度特征集成到端到端DETR结构中的明确尝试。具体而言，我们提出了一种多分支结构，以同时利用不同层次的特征映射，促进局部和全局特征的交互。此外，我们从交互信息流中选择某些联合潜在变量来初始化对象容器，这是基于查询的检测器中常用的一种技术。实验结果表明，在流行的公共COCO数据集上，使用36个训练epoch， F-DETR达到43.9 %的AP。此外，与原始的DETR相比，我们的方法在准确性和复杂性之间取得了更好的平衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Soft Computing 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

6.90%

发文量

874

审稿时长

10.9 months

期刊介绍： Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.