CoBEV: Elevating Roadside 3D Object Detection With Depth and Height Complementarity

Hao Shi;Chengshan Pang;Jiaming Zhang;Kailun Yang;Yuhao Wu;Huajian Ni;Yining Lin;Rainer Stiefelhagen;Kaiwei Wang
{"title":"CoBEV: Elevating Roadside 3D Object Detection With Depth and Height Complementarity","authors":"Hao Shi;Chengshan Pang;Jiaming Zhang;Kailun Yang;Yuhao Wu;Huajian Ni;Yining Lin;Rainer Stiefelhagen;Kaiwei Wang","doi":"10.1109/TIP.2024.3463409","DOIUrl":null,"url":null,"abstract":"Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel’s depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in \n<monospace>heterologous</monospace>\n Settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at CoBEV.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5424-5439"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10693306/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel’s depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous Settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at CoBEV.
CoBEV:利用深度和高度互补性提升路边 3D 物体检测能力
路边摄像头驱动的三维物体检测是智能交通系统中的一项重要任务,它扩大了以视觉为中心的车辆的感知范围,提高了道路安全性。以往的研究存在只使用深度或高度信息的局限性,而我们发现深度和高度都很重要,而且它们实际上是互补的。深度特征包含精确的几何线索,而高度特征主要侧重于区分不同类别的高度区间,本质上是提供语义背景。这种洞察力促使我们开发了互补 BEV(CoBEV),这是一种新颖的端到端单目 3D 物体检测框架,它将深度和高度整合在一起,以构建稳健的 BEV 表示。从本质上讲,CoBEV 估算每个像素的深度和高度分布,并利用新提出的两阶段互补特征选择(CFS)模块将相机特征提升到三维空间进行横向融合。此外,还无缝集成了 BEV 特征提炼框架,利用融合模式 CoBEV 教师的先验知识进一步提高检测精度。我们在基于路边摄像头的 DAIR-V2X-I 和 Rope3D 公共三维检测基准以及私有 Supremind-Road 数据集上进行了广泛的实验,结果表明 CoBEV 不仅达到了新的一流水平的精度,而且在具有挑战性的长距离场景和高噪声摄像头干扰下显著提高了以前方法的鲁棒性,并在场景和摄像头参数急剧变化的异源设置中大幅增强了泛化能力。在 DAIR-V2X-I 简易模式下,摄像机模型的车辆 AP 得分首次达到 80%。源代码将在 CoBEV 公开发布。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信