CoBEV: Elevating Roadside 3D Object Detection With Depth and Height Complementarity

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-09-24 DOI:10.1109/TIP.2024.3463409

Hao Shi;Chengshan Pang;Jiaming Zhang;Kailun Yang;Yuhao Wu;Huajian Ni;Yining Lin;Rainer Stiefelhagen;Kaiwei Wang

{"title":"CoBEV: Elevating Roadside 3D Object Detection With Depth and Height Complementarity","authors":"Hao Shi;Chengshan Pang;Jiaming Zhang;Kailun Yang;Yuhao Wu;Huajian Ni;Yining Lin;Rainer Stiefelhagen;Kaiwei Wang","doi":"10.1109/TIP.2024.3463409","DOIUrl":null,"url":null,"abstract":"Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel’s depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in \n<monospace>heterologous</monospace>\n Settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at CoBEV.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5424-5439"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10693306/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel’s depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous Settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at CoBEV.

查看原文本刊更多论文

CoBEV：利用深度和高度互补性提升路边 3D 物体检测能力

路边摄像头驱动的三维物体检测是智能交通系统中的一项重要任务，它扩大了以视觉为中心的车辆的感知范围，提高了道路安全性。以往的研究存在只使用深度或高度信息的局限性，而我们发现深度和高度都很重要，而且它们实际上是互补的。深度特征包含精确的几何线索，而高度特征主要侧重于区分不同类别的高度区间，本质上是提供语义背景。这种洞察力促使我们开发了互补 BEV（CoBEV），这是一种新颖的端到端单目 3D 物体检测框架，它将深度和高度整合在一起，以构建稳健的 BEV 表示。从本质上讲，CoBEV 估算每个像素的深度和高度分布，并利用新提出的两阶段互补特征选择（CFS）模块将相机特征提升到三维空间进行横向融合。此外，还无缝集成了 BEV 特征提炼框架，利用融合模式 CoBEV 教师的先验知识进一步提高检测精度。我们在基于路边摄像头的 DAIR-V2X-I 和 Rope3D 公共三维检测基准以及私有 Supremind-Road 数据集上进行了广泛的实验，结果表明 CoBEV 不仅达到了新的一流水平的精度，而且在具有挑战性的长距离场景和高噪声摄像头干扰下显著提高了以前方法的鲁棒性，并在场景和摄像头参数急剧变化的异源设置中大幅增强了泛化能力。在 DAIR-V2X-I 简易模式下，摄像机模型的车辆 AP 得分首次达到 80%。源代码将在 CoBEV 公开发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量