OA-DET3D: Embedding Object Awareness As A General Plug-in for Multi-Camera 3D Object Detection

IF 9.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2025-08-25 DOI:10.1007/s11263-025-02544-x

Xiaomeng Chu, Jiajun Deng, Jianmin Ji, Yu Zhang, Houqiang Li, Yanyong Zhang

{"title":"OA-DET3D: Embedding Object Awareness As A General Plug-in for Multi-Camera 3D Object Detection","authors":"Xiaomeng Chu, Jiajun Deng, Jianmin Ji, Yu Zhang, Houqiang Li, Yanyong Zhang","doi":"10.1007/s11263-025-02544-x","DOIUrl":null,"url":null,"abstract":"<p>The recent advance in multi-camera 3D object detection is featured by bird’s-eye view (BEV) representation or object queries. However, the ill-posed transformation from image-plane view to 3D space inevitably causes feature clutter and distortion, making the objects blur into the background. To this end, we explore how to incorporate supplementary cues for differentiating objects in the transformed feature representation. Formally, we introduce OA-DET3D, a general plug-in module that improves 3D object detection by bringing object awareness into a variety of existing 3D object detection pipelines. Specifically, OA-DET3D boosts the representation of objects by leveraging object-centric depth information and foreground pseudo points. First, we use object-level supervision from the properties of each 3D bounding box to guide the network in learning the depth distribution. Next, we select foreground pixels using a 2D object detector and project them into 3D space for pseudo-voxel feature encoding. Finally, the object-aware depth features and pseudo-voxel features are incorporated into the BEV representation or query feature from the baseline model with a deformable attention mechanism. We conduct extensive experiments on the nuScenes dataset and Argoverse 2 dataset to validate the merits of our proposed OA-DET3D. Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and comprehensive detection score. The code is available at https://github.com/cxmomo/OA-DET3D.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-025-02544-x","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The recent advance in multi-camera 3D object detection is featured by bird’s-eye view (BEV) representation or object queries. However, the ill-posed transformation from image-plane view to 3D space inevitably causes feature clutter and distortion, making the objects blur into the background. To this end, we explore how to incorporate supplementary cues for differentiating objects in the transformed feature representation. Formally, we introduce OA-DET3D, a general plug-in module that improves 3D object detection by bringing object awareness into a variety of existing 3D object detection pipelines. Specifically, OA-DET3D boosts the representation of objects by leveraging object-centric depth information and foreground pseudo points. First, we use object-level supervision from the properties of each 3D bounding box to guide the network in learning the depth distribution. Next, we select foreground pixels using a 2D object detector and project them into 3D space for pseudo-voxel feature encoding. Finally, the object-aware depth features and pseudo-voxel features are incorporated into the BEV representation or query feature from the baseline model with a deformable attention mechanism. We conduct extensive experiments on the nuScenes dataset and Argoverse 2 dataset to validate the merits of our proposed OA-DET3D. Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and comprehensive detection score. The code is available at https://github.com/cxmomo/OA-DET3D.

查看原文本刊更多论文

OA-DET3D：嵌入对象感知作为多相机3D对象检测的通用插件

多相机三维目标检测的最新进展是鸟瞰图（BEV）表示或目标查询。然而，从图像平面视图到三维空间的病态变换不可避免地会造成特征的混乱和失真，使目标模糊到背景中。为此，我们探讨了如何在转换后的特征表示中结合补充线索来区分对象。正式地，我们引入OA-DET3D，这是一个通用的插件模块，通过将对象感知引入各种现有的3D对象检测管道来改进3D对象检测。具体来说，OA-DET3D通过利用以对象为中心的深度信息和前景伪点来增强对象的表示。首先，我们从每个3D边界框的属性中使用对象级监督来指导网络学习深度分布。接下来，我们使用2D对象检测器选择前景像素，并将其投影到3D空间中进行伪体素特征编码。最后，利用可变形的注意机制，将目标感知深度特征和伪体素特征整合到基线模型的BEV表示或查询特征中。我们在nuScenes数据集和Argoverse 2数据集上进行了大量实验，以验证我们提出的OA-DET3D的优点。我们的方法在平均精度和综合检测分数方面都比基于bev的基线得到了一致的改进。代码可在https://github.com/cxmomo/OA-DET3D上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.