Junjie Li , Si Guo , Shi Yi , Runhua He , Yong Jia
{"title":"DMCTDet: A density map-guided composite transformer network for object detection of UAV images","authors":"Junjie Li , Si Guo , Shi Yi , Runhua He , Yong Jia","doi":"10.1016/j.image.2025.117284","DOIUrl":null,"url":null,"abstract":"<div><div>The application of unmanned aerial vehicles (UAVs) in urban scene object detection is a vital area of research in urban planning, intelligent monitoring, disaster prevention, and urban surveillance.e However, detecting objects in urban scenes captured by UAVs is a challenging task mainly due to the small size of the objects, the variability within the same class, and the diversity of objects. To design an object detection network that can be applied to complex urban scenes, this study proposes a novel composite transformer object detection network guided by a density map (DMCTDet) for urban scene detection in UAV images. The distributional a priori information of objects can be fully exploited by density maps. In the detection stage, a composite backbone feature extraction network is constructed by Swin Transformer combined with Vision Longformer, which can fully extract the scale-variation objects. Adaptive multiscale feature pyramid enhancement modules (AMFPEM) are inserted in the feature fusion stage between both Swin Transformer and Vision Longformer to learn the relationship between object scale variation and enhance the feature representation capacity of small objects. In this way, the accuracy of urban scene detection is significantly improved, and weak aggregated objects are successfully detected from UAV images. Extensive ablation experiments and comparison experiments. of the proposed network are conducted on publicly available urban scene detection datasets of UAV images. The experimental results demonstrate the effectiveness of the designed network structure and the superiority of the proposed network compared to state-of-the-art methods in terms of detection accuracy.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"136 ","pages":"Article 117284"},"PeriodicalIF":3.4000,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0923596525000311","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
The application of unmanned aerial vehicles (UAVs) in urban scene object detection is a vital area of research in urban planning, intelligent monitoring, disaster prevention, and urban surveillance.e However, detecting objects in urban scenes captured by UAVs is a challenging task mainly due to the small size of the objects, the variability within the same class, and the diversity of objects. To design an object detection network that can be applied to complex urban scenes, this study proposes a novel composite transformer object detection network guided by a density map (DMCTDet) for urban scene detection in UAV images. The distributional a priori information of objects can be fully exploited by density maps. In the detection stage, a composite backbone feature extraction network is constructed by Swin Transformer combined with Vision Longformer, which can fully extract the scale-variation objects. Adaptive multiscale feature pyramid enhancement modules (AMFPEM) are inserted in the feature fusion stage between both Swin Transformer and Vision Longformer to learn the relationship between object scale variation and enhance the feature representation capacity of small objects. In this way, the accuracy of urban scene detection is significantly improved, and weak aggregated objects are successfully detected from UAV images. Extensive ablation experiments and comparison experiments. of the proposed network are conducted on publicly available urban scene detection datasets of UAV images. The experimental results demonstrate the effectiveness of the designed network structure and the superiority of the proposed network compared to state-of-the-art methods in terms of detection accuracy.
期刊介绍:
Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following:
To present a forum for the advancement of theory and practice of image communication.
To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems.
To contribute to a rapid information exchange between the industrial and academic environments.
The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world.
Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments.
Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.