Adaptive Multiscale Attention Feature Aggregation for Multi-Modal 3D Occluded Object Detection

IF 1.3 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision Pub Date : 2025-07-17 DOI:10.1049/cvi2.70035

Yanfeng Han, Ming Yu, Jing Liu

{"title":"Adaptive Multiscale Attention Feature Aggregation for Multi-Modal 3D Occluded Object Detection","authors":"Yanfeng Han, Ming Yu, Jing Liu","doi":"10.1049/cvi2.70035","DOIUrl":null,"url":null,"abstract":"<p>Accurate perception and understanding of the three-dimensional environment is crucial for autonomous vehicles to navigate efficiently and make wise decisions. However, in complex real-world scenarios, the information obtained by a single-modal sensor is often incomplete, severely affecting the detection accuracy of occluded targets. To address this issue, this paper proposes a novel adaptive multi-scale attention aggregation strategy, efficiently fusing multi-scale feature representations of heterogeneous data to accurately capture the shape details and spatial relationships of targets in three-dimensional space. This strategy utilises learnable sparse keypoints to dynamically align heterogeneous features in a data-driven manner, adaptively modelling the cross-modal mapping relationships between keypoints and their corresponding multi-scale image features. Given the importance of accurately obtaining the three-dimensional shape information of targets for understanding the size and rotation pose of occluded targets, this paper adopts a shape prior knowledge-based constraint method and data augmentation strategy to guide the model to more accurately perceive the complete three-dimensional shape and rotation pose of occluded targets. Experimental results show that our proposed model achieves 2.15%, 3.24% and 2.75% improvement in 3D<sub>R40</sub> mAP score under the easy, moderate and hard difficulty levels compared to MVXNet, significantly enhancing the detection accuracy and robustness of occluded targets in complex scenarios.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70035","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.70035","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Accurate perception and understanding of the three-dimensional environment is crucial for autonomous vehicles to navigate efficiently and make wise decisions. However, in complex real-world scenarios, the information obtained by a single-modal sensor is often incomplete, severely affecting the detection accuracy of occluded targets. To address this issue, this paper proposes a novel adaptive multi-scale attention aggregation strategy, efficiently fusing multi-scale feature representations of heterogeneous data to accurately capture the shape details and spatial relationships of targets in three-dimensional space. This strategy utilises learnable sparse keypoints to dynamically align heterogeneous features in a data-driven manner, adaptively modelling the cross-modal mapping relationships between keypoints and their corresponding multi-scale image features. Given the importance of accurately obtaining the three-dimensional shape information of targets for understanding the size and rotation pose of occluded targets, this paper adopts a shape prior knowledge-based constraint method and data augmentation strategy to guide the model to more accurately perceive the complete three-dimensional shape and rotation pose of occluded targets. Experimental results show that our proposed model achieves 2.15%, 3.24% and 2.75% improvement in 3D_R40 mAP score under the easy, moderate and hard difficulty levels compared to MVXNet, significantly enhancing the detection accuracy and robustness of occluded targets in complex scenarios.

Abstract Image

查看原文本刊更多论文

多模态三维遮挡目标检测的自适应多尺度注意特征聚合

对三维环境的准确感知和理解对于自动驾驶汽车有效导航和做出明智决策至关重要。然而，在复杂的现实场景中，单模态传感器获取的信息往往是不完整的，严重影响了被遮挡目标的检测精度。针对这一问题，本文提出了一种新的自适应多尺度注意力聚合策略，有效融合异构数据的多尺度特征表示，在三维空间中准确捕获目标的形状细节和空间关系。该策略利用可学习的稀疏关键点以数据驱动的方式动态对齐异构特征，自适应建模关键点与其对应的多尺度图像特征之间的跨模态映射关系。鉴于准确获取目标的三维形状信息对于理解被遮挡目标的大小和旋转位姿的重要性，本文采用基于形状先验知识的约束方法和数据增强策略，指导模型更准确地感知被遮挡目标的完整三维形状和旋转位姿。实验结果表明，与MVXNet相比，该模型在简单、中等和困难难度下的3DR40 mAP评分分别提高了2.15%、3.24%和2.75%，显著提高了复杂场景下遮挡目标的检测精度和鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IET Computer Vision 工程技术-工程：电子与电气

CiteScore

3.30

自引率

11.80%

发文量

审稿时长

3.4 months

期刊介绍： IET Computer Vision seeks original research papers in a wide range of areas of computer vision. The vision of the journal is to publish the highest quality research work that is relevant and topical to the field, but not forgetting those works that aim to introduce new horizons and set the agenda for future avenues of research in computer vision. IET Computer Vision welcomes submissions on the following topics: Biologically and perceptually motivated approaches to low level vision (feature detection, etc.); Perceptual grouping and organisation Representation, analysis and matching of 2D and 3D shape Shape-from-X Object recognition Image understanding Learning with visual inputs Motion analysis and object tracking Multiview scene analysis Cognitive approaches in low, mid and high level vision Control in visual systems Colour, reflectance and light Statistical and probabilistic models Face and gesture Surveillance Biometrics and security Robotics Vehicle guidance Automatic model aquisition Medical image analysis and understanding Aerial scene analysis and remote sensing Deep learning models in computer vision Both methodological and applications orientated papers are welcome. Manuscripts submitted are expected to include a detailed and analytical review of the literature and state-of-the-art exposition of the original proposed research and its methodology, its thorough experimental evaluation, and last but not least, comparative evaluation against relevant and state-of-the-art methods. Submissions not abiding by these minimum requirements may be returned to authors without being sent to review. Special Issues Current Call for Papers: Computer Vision for Smart Cameras and Camera Networks - https://digital-library.theiet.org/files/IET_CVI_SC.pdf Computer Vision for the Creative Industries - https://digital-library.theiet.org/files/IET_CVI_CVCI.pdf