BEVHeight++：朝着鲁棒的视觉中心3D物体检测

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-03-11 DOI:10.1109/TPAMI.2025.3549711

Lei Yang;Tao Tang;Jun Li;Kun Yuan;Kai Wu;Peng Chen;Li Wang;Yi Huang;Lei Li;Xinyu Zhang;Kaicheng Yu

{"title":"BEVHeight++：朝着鲁棒的视觉中心3D物体检测","authors":"Lei Yang;Tao Tang;Jun Li;Kun Yuan;Kai Wu;Peng Chen;Li Wang;Yi Huang;Lei Li;Xinyu Zhang;Kaicheng Yu","doi":"10.1109/TPAMI.2025.3549711","DOIUrl":null,"url":null,"abstract":"While most recent autonomous driving system focuses on developing perception methods on ego-vehicle sensors, people tend to overlook an alternative approach to leverage intelligent roadside cameras to extend the perception ability beyond the visual range. We discover that the state-of-the-art vision-centric detection methods perform poorly on roadside cameras. This is because these methods mainly focus on recovering the depth regarding the camera center, where the depth difference between the car and the ground quickly shrinks while the distance increases. In this paper, we propose a simple yet effective approach, dubbed BEVHeight++, to address this issue. In essence, we regress the height to the ground to achieve a distance-agnostic formulation to ease the optimization process of camera-only perception methods. By incorporating both height and depth encoding techniques, we achieve a more accurate and robust projection from 2D to BEV spaces. On popular 3D detection benchmarks of roadside cameras, our method surpasses all previous vision-centric methods by a significant margin. In terms of the ego-vehicle scenario, BEVHeight++ surpasses depth-only methods with increases of +2.8% NDS and +1.7% mAP on the nuScenes test set, and even higher gains of +9.3% NDS and +8.8% mAP on the nuScenes-C benchmark with object-level distortion. Consistent and substantial performance improvements are achieved across the KITTI, KITTI-360, and Waymo datasets as well.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 6","pages":"5094-5111"},"PeriodicalIF":18.6000,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BEVHeight++: Toward Robust Visual Centric 3D Object Detection\",\"authors\":\"Lei Yang;Tao Tang;Jun Li;Kun Yuan;Kai Wu;Peng Chen;Li Wang;Yi Huang;Lei Li;Xinyu Zhang;Kaicheng Yu\",\"doi\":\"10.1109/TPAMI.2025.3549711\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While most recent autonomous driving system focuses on developing perception methods on ego-vehicle sensors, people tend to overlook an alternative approach to leverage intelligent roadside cameras to extend the perception ability beyond the visual range. We discover that the state-of-the-art vision-centric detection methods perform poorly on roadside cameras. This is because these methods mainly focus on recovering the depth regarding the camera center, where the depth difference between the car and the ground quickly shrinks while the distance increases. In this paper, we propose a simple yet effective approach, dubbed BEVHeight++, to address this issue. In essence, we regress the height to the ground to achieve a distance-agnostic formulation to ease the optimization process of camera-only perception methods. By incorporating both height and depth encoding techniques, we achieve a more accurate and robust projection from 2D to BEV spaces. On popular 3D detection benchmarks of roadside cameras, our method surpasses all previous vision-centric methods by a significant margin. In terms of the ego-vehicle scenario, BEVHeight++ surpasses depth-only methods with increases of +2.8% NDS and +1.7% mAP on the nuScenes test set, and even higher gains of +9.3% NDS and +8.8% mAP on the nuScenes-C benchmark with object-level distortion. Consistent and substantial performance improvements are achieved across the KITTI, KITTI-360, and Waymo datasets as well.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"47 6\",\"pages\":\"5094-5111\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-03-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10919014/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10919014/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

虽然最近的自动驾驶系统专注于开发基于自我车辆传感器的感知方法，但人们往往忽略了利用智能路边摄像头将感知能力扩展到视觉范围之外的另一种方法。我们发现最先进的以视觉为中心的检测方法在路边摄像头上表现不佳。这是因为这些方法主要集中在相机中心的深度恢复上，随着距离的增加，车与地面的深度差会迅速缩小。在本文中，我们提出了一个简单而有效的方法，称为BEVHeight++，来解决这个问题。从本质上讲，我们将高度回归到地面，以实现距离不可知的公式，以简化仅相机感知方法的优化过程。通过结合高度和深度编码技术，我们实现了从2D到BEV空间的更准确和健壮的投影。在路边摄像头流行的3D检测基准上，我们的方法大大超过了以前所有以视觉为中心的方法。在自我-车辆场景中，BEVHeight++在nuScenes测试集上的NDS和mAP分别提高了+2.8%和+1.7%，超过了深度-only方法，在nuScenes- c测试集上的NDS和mAP的增幅更高，分别达到了+9.3%和+8.8%。在KITTI、KITTI-360和Waymo数据集上也实现了一致和实质性的性能改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

BEVHeight++: Toward Robust Visual Centric 3D Object Detection

While most recent autonomous driving system focuses on developing perception methods on ego-vehicle sensors, people tend to overlook an alternative approach to leverage intelligent roadside cameras to extend the perception ability beyond the visual range. We discover that the state-of-the-art vision-centric detection methods perform poorly on roadside cameras. This is because these methods mainly focus on recovering the depth regarding the camera center, where the depth difference between the car and the ground quickly shrinks while the distance increases. In this paper, we propose a simple yet effective approach, dubbed BEVHeight++, to address this issue. In essence, we regress the height to the ground to achieve a distance-agnostic formulation to ease the optimization process of camera-only perception methods. By incorporating both height and depth encoding techniques, we achieve a more accurate and robust projection from 2D to BEV spaces. On popular 3D detection benchmarks of roadside cameras, our method surpasses all previous vision-centric methods by a significant margin. In terms of the ego-vehicle scenario, BEVHeight++ surpasses depth-only methods with increases of +2.8% NDS and +1.7% mAP on the nuScenes test set, and even higher gains of +9.3% NDS and +8.8% mAP on the nuScenes-C benchmark with object-level distortion. Consistent and substantial performance improvements are achieved across the KITTI, KITTI-360, and Waymo datasets as well.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量