DMCTDet: A density map-guided composite transformer network for object detection of UAV images

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication Pub Date : 2025-03-14 DOI:10.1016/j.image.2025.117284

Junjie Li , Si Guo , Shi Yi , Runhua He , Yong Jia

{"title":"DMCTDet: A density map-guided composite transformer network for object detection of UAV images","authors":"Junjie Li , Si Guo , Shi Yi , Runhua He , Yong Jia","doi":"10.1016/j.image.2025.117284","DOIUrl":null,"url":null,"abstract":"<div><div>The application of unmanned aerial vehicles (UAVs) in urban scene object detection is a vital area of research in urban planning, intelligent monitoring, disaster prevention, and urban surveillance.e However, detecting objects in urban scenes captured by UAVs is a challenging task mainly due to the small size of the objects, the variability within the same class, and the diversity of objects. To design an object detection network that can be applied to complex urban scenes, this study proposes a novel composite transformer object detection network guided by a density map (DMCTDet) for urban scene detection in UAV images. The distributional a priori information of objects can be fully exploited by density maps. In the detection stage, a composite backbone feature extraction network is constructed by Swin Transformer combined with Vision Longformer, which can fully extract the scale-variation objects. Adaptive multiscale feature pyramid enhancement modules (AMFPEM) are inserted in the feature fusion stage between both Swin Transformer and Vision Longformer to learn the relationship between object scale variation and enhance the feature representation capacity of small objects. In this way, the accuracy of urban scene detection is significantly improved, and weak aggregated objects are successfully detected from UAV images. Extensive ablation experiments and comparison experiments. of the proposed network are conducted on publicly available urban scene detection datasets of UAV images. The experimental results demonstrate the effectiveness of the designed network structure and the superiority of the proposed network compared to state-of-the-art methods in terms of detection accuracy.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"136 ","pages":"Article 117284"},"PeriodicalIF":3.4000,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0923596525000311","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

The application of unmanned aerial vehicles (UAVs) in urban scene object detection is a vital area of research in urban planning, intelligent monitoring, disaster prevention, and urban surveillance.e However, detecting objects in urban scenes captured by UAVs is a challenging task mainly due to the small size of the objects, the variability within the same class, and the diversity of objects. To design an object detection network that can be applied to complex urban scenes, this study proposes a novel composite transformer object detection network guided by a density map (DMCTDet) for urban scene detection in UAV images. The distributional a priori information of objects can be fully exploited by density maps. In the detection stage, a composite backbone feature extraction network is constructed by Swin Transformer combined with Vision Longformer, which can fully extract the scale-variation objects. Adaptive multiscale feature pyramid enhancement modules (AMFPEM) are inserted in the feature fusion stage between both Swin Transformer and Vision Longformer to learn the relationship between object scale variation and enhance the feature representation capacity of small objects. In this way, the accuracy of urban scene detection is significantly improved, and weak aggregated objects are successfully detected from UAV images. Extensive ablation experiments and comparison experiments. of the proposed network are conducted on publicly available urban scene detection datasets of UAV images. The experimental results demonstrate the effectiveness of the designed network structure and the superiority of the proposed network compared to state-of-the-art methods in terms of detection accuracy.

查看原文本刊更多论文

DMCTDet：一种用于无人机图像目标检测的密度图引导复合变压器网络

无人机（UAV）在城市场景物体检测中的应用是城市规划、智能监控、灾害预防和城市监控等领域的一个重要研究领域。e 然而，在无人机拍摄的城市场景中检测物体是一项具有挑战性的任务，这主要是由于物体的尺寸较小、同一类别中的可变性以及物体的多样性。为了设计一种可应用于复杂城市场景的物体检测网络，本研究提出了一种由密度图（DMCTDet）引导的新型复合变换器物体检测网络，用于无人机图像中的城市场景检测。密度图可以充分利用物体的先验分布信息。在检测阶段，利用 Swin Transformer 结合 Vision Longformer 构建复合骨干特征提取网络，可以充分提取尺度变化的物体。在特征融合阶段，Swin Transformer 和 Vision Longformer 之间插入了自适应多尺度特征金字塔增强模块（AMFPEM），以学习物体尺度变化之间的关系，增强小物体的特征表示能力。通过这种方法，城市场景检测的准确性得到显著提高，并成功地从无人机图像中检测出弱聚集物体。在公开的无人机图像城市场景检测数据集上对所提出的网络进行了广泛的消融实验和对比实验。实验结果证明了所设计的网络结构的有效性，以及与最先进的方法相比在检测精度方面的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Signal Processing-Image Communication 工程技术-工程：电子与电气

CiteScore

8.40

自引率

2.90%

发文量

138

审稿时长

5.2 months

期刊介绍： Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following: To present a forum for the advancement of theory and practice of image communication. To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems. To contribute to a rapid information exchange between the industrial and academic environments. The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world. Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments. Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.