Hongcheng Xue , Zhan Tang , Yuantian Xia , Longhe Wang , Lin Li
{"title":"HCTD:用于无人机航拍图像中精确目标检测的CNN-transformer混合","authors":"Hongcheng Xue , Zhan Tang , Yuantian Xia , Longhe Wang , Lin Li","doi":"10.1016/j.cviu.2025.104409","DOIUrl":null,"url":null,"abstract":"<div><div>Object detection in UAV imagery poses substantial challenges due to severe object scale variation, dense distributions of small objects, complex backgrounds, and arbitrary orientations. These factors, compounded by high inter-class similarity and large intra-class variation caused by multi-scale targets, occlusion, and environmental interference, make aerial object detection fundamentally different from conventional scenes. Existing methods often struggle to capture global semantic information effectively and tend to overlook critical issues such as feature loss during downsampling, information redundancy, and inconsistency in cross-level feature interactions. To address these limitations, this paper proposes a hybrid CNN-Transformer-based detector, termed HCTD, specifically designed for UAV image analysis. The proposed framework integrates three novel modules: (1) a Feature Filtering Module (FFM) that enhances discriminative responses and suppresses background noise through dual global pooling (max and average) strategies; (2) a Convolutional Additive Self-attention Feature Interaction (CASFI) module that replaces dot-product attention with a lightweight additive fusion of spatial and channel interactions, enabling efficient global context modeling at reduced computational cost; and (3) a Global Context Flow Feature Pyramid Network (GC2FPN) that facilitates multi-scale semantic propagation and alignment to improve small-object detection robustness. Extensive experiments on the VisDrone2019 dataset demonstrate that HCTD-R18 and HCTD-R50 achieve 38.2%/43.7% <span><math><msub><mrow><mi>AP</mi></mrow><mrow><mn>50</mn></mrow></msub></math></span>, 23.1%/24.6% <span><math><msub><mrow><mi>AP</mi></mrow><mrow><mn>75</mn></mrow></msub></math></span>, and 13.9%/14.7% <span><math><msub><mrow><mi>AP</mi></mrow><mrow><mi>S</mi></mrow></msub></math></span> respectively. Additionally, the TIDE toolkit is employed to analyze the absolute and relative contributions of six error types, providing deeper insight into the effectiveness of each module and offering valuable guidance for future improvements. The code is available at: <span><span>https://github.com/Mundane-X/HCTD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104409"},"PeriodicalIF":4.3000,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HCTD: A CNN-transformer hybrid for precise object detection in UAV aerial imagery\",\"authors\":\"Hongcheng Xue , Zhan Tang , Yuantian Xia , Longhe Wang , Lin Li\",\"doi\":\"10.1016/j.cviu.2025.104409\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Object detection in UAV imagery poses substantial challenges due to severe object scale variation, dense distributions of small objects, complex backgrounds, and arbitrary orientations. These factors, compounded by high inter-class similarity and large intra-class variation caused by multi-scale targets, occlusion, and environmental interference, make aerial object detection fundamentally different from conventional scenes. Existing methods often struggle to capture global semantic information effectively and tend to overlook critical issues such as feature loss during downsampling, information redundancy, and inconsistency in cross-level feature interactions. To address these limitations, this paper proposes a hybrid CNN-Transformer-based detector, termed HCTD, specifically designed for UAV image analysis. The proposed framework integrates three novel modules: (1) a Feature Filtering Module (FFM) that enhances discriminative responses and suppresses background noise through dual global pooling (max and average) strategies; (2) a Convolutional Additive Self-attention Feature Interaction (CASFI) module that replaces dot-product attention with a lightweight additive fusion of spatial and channel interactions, enabling efficient global context modeling at reduced computational cost; and (3) a Global Context Flow Feature Pyramid Network (GC2FPN) that facilitates multi-scale semantic propagation and alignment to improve small-object detection robustness. Extensive experiments on the VisDrone2019 dataset demonstrate that HCTD-R18 and HCTD-R50 achieve 38.2%/43.7% <span><math><msub><mrow><mi>AP</mi></mrow><mrow><mn>50</mn></mrow></msub></math></span>, 23.1%/24.6% <span><math><msub><mrow><mi>AP</mi></mrow><mrow><mn>75</mn></mrow></msub></math></span>, and 13.9%/14.7% <span><math><msub><mrow><mi>AP</mi></mrow><mrow><mi>S</mi></mrow></msub></math></span> respectively. Additionally, the TIDE toolkit is employed to analyze the absolute and relative contributions of six error types, providing deeper insight into the effectiveness of each module and offering valuable guidance for future improvements. The code is available at: <span><span>https://github.com/Mundane-X/HCTD</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"259 \",\"pages\":\"Article 104409\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225001328\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001328","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
HCTD: A CNN-transformer hybrid for precise object detection in UAV aerial imagery
Object detection in UAV imagery poses substantial challenges due to severe object scale variation, dense distributions of small objects, complex backgrounds, and arbitrary orientations. These factors, compounded by high inter-class similarity and large intra-class variation caused by multi-scale targets, occlusion, and environmental interference, make aerial object detection fundamentally different from conventional scenes. Existing methods often struggle to capture global semantic information effectively and tend to overlook critical issues such as feature loss during downsampling, information redundancy, and inconsistency in cross-level feature interactions. To address these limitations, this paper proposes a hybrid CNN-Transformer-based detector, termed HCTD, specifically designed for UAV image analysis. The proposed framework integrates three novel modules: (1) a Feature Filtering Module (FFM) that enhances discriminative responses and suppresses background noise through dual global pooling (max and average) strategies; (2) a Convolutional Additive Self-attention Feature Interaction (CASFI) module that replaces dot-product attention with a lightweight additive fusion of spatial and channel interactions, enabling efficient global context modeling at reduced computational cost; and (3) a Global Context Flow Feature Pyramid Network (GC2FPN) that facilitates multi-scale semantic propagation and alignment to improve small-object detection robustness. Extensive experiments on the VisDrone2019 dataset demonstrate that HCTD-R18 and HCTD-R50 achieve 38.2%/43.7% , 23.1%/24.6% , and 13.9%/14.7% respectively. Additionally, the TIDE toolkit is employed to analyze the absolute and relative contributions of six error types, providing deeper insight into the effectiveness of each module and offering valuable guidance for future improvements. The code is available at: https://github.com/Mundane-X/HCTD.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems