视图到标签：多视图一致性的自监督单目3D物体检测

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI:10.1016/j.cviu.2025.104320

Issa Mouawad , Nikolas Brasch , Fabian Manhardt , Federico Tombari , Francesca Odone

{"title":"视图到标签：多视图一致性的自监督单目3D物体检测","authors":"Issa Mouawad , Nikolas Brasch , Fabian Manhardt , Federico Tombari , Francesca Odone","doi":"10.1016/j.cviu.2025.104320","DOIUrl":null,"url":null,"abstract":"<div><div>For autonomous vehicles, driving safely is highly dependent on the capability to correctly perceive the environment in the 3D space, hence the task of 3D object detection represents a fundamental aspect of perception. While 3D sensors deliver accurate metric perception, monocular approaches enjoy cost and availability advantages that are valuable in a wide range of applications. Unfortunately, training monocular methods requires a vast amount of annotated data. To compensate for this need, we propose a novel approach to self-supervise 3D object detection purely from RGB video sequences, leveraging geometric constraints and weak labels. Unlike other approaches that exploit additional sensors during training, <em>our method relies on the temporal continuity of video sequences.</em> A supervised pre-training on synthetic data produces initial plausible 3D boxes, then our geometric and photometrically grounded losses provide a strong self-supervision signal that allows the model to be fine-tuned on real data without labels.</div><div>Our experiments on Autonomous Driving benchmark datasets showcase the effectiveness and generality of our approach and the competitive performance compared to other self-supervised approaches.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104320"},"PeriodicalIF":4.3000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"View-to-label: Multi-view consistency for self-supervised monocular 3D object detection\",\"authors\":\"Issa Mouawad , Nikolas Brasch , Fabian Manhardt , Federico Tombari , Francesca Odone\",\"doi\":\"10.1016/j.cviu.2025.104320\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>For autonomous vehicles, driving safely is highly dependent on the capability to correctly perceive the environment in the 3D space, hence the task of 3D object detection represents a fundamental aspect of perception. While 3D sensors deliver accurate metric perception, monocular approaches enjoy cost and availability advantages that are valuable in a wide range of applications. Unfortunately, training monocular methods requires a vast amount of annotated data. To compensate for this need, we propose a novel approach to self-supervise 3D object detection purely from RGB video sequences, leveraging geometric constraints and weak labels. Unlike other approaches that exploit additional sensors during training, <em>our method relies on the temporal continuity of video sequences.</em> A supervised pre-training on synthetic data produces initial plausible 3D boxes, then our geometric and photometrically grounded losses provide a strong self-supervision signal that allows the model to be fine-tuned on real data without labels.</div><div>Our experiments on Autonomous Driving benchmark datasets showcase the effectiveness and generality of our approach and the competitive performance compared to other self-supervised approaches.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"254 \",\"pages\":\"Article 104320\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225000438\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225000438","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

对于自动驾驶汽车来说，安全驾驶高度依赖于在3D空间中正确感知环境的能力，因此3D物体检测任务代表了感知的一个基本方面。虽然3D传感器提供准确的度量感知，但单目方法具有成本和可用性优势，在广泛的应用中具有价值。不幸的是，训练单目方法需要大量带注释的数据。为了弥补这一需求，我们提出了一种新的方法，利用几何约束和弱标签，纯粹从RGB视频序列中自我监督3D物体检测。与其他在训练过程中利用额外传感器的方法不同，我们的方法依赖于视频序列的时间连续性。在合成数据上进行有监督的预训练，产生初步的可信的3D盒子，然后我们的几何和光度接地损失提供了一个强大的自我监督信号，允许模型在没有标签的情况下对真实数据进行微调。我们在自动驾驶基准数据集上的实验展示了我们方法的有效性和通用性，以及与其他自监督方法相比的竞争性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

View-to-label: Multi-view consistency for self-supervised monocular 3D object detection

查看原文本刊更多论文

View-to-label: Multi-view consistency for self-supervised monocular 3D object detection

For autonomous vehicles, driving safely is highly dependent on the capability to correctly perceive the environment in the 3D space, hence the task of 3D object detection represents a fundamental aspect of perception. While 3D sensors deliver accurate metric perception, monocular approaches enjoy cost and availability advantages that are valuable in a wide range of applications. Unfortunately, training monocular methods requires a vast amount of annotated data. To compensate for this need, we propose a novel approach to self-supervise 3D object detection purely from RGB video sequences, leveraging geometric constraints and weak labels. Unlike other approaches that exploit additional sensors during training, our method relies on the temporal continuity of video sequences. A supervised pre-training on synthetic data produces initial plausible 3D boxes, then our geometric and photometrically grounded losses provide a strong self-supervision signal that allows the model to be fine-tuned on real data without labels.

Our experiments on Autonomous Driving benchmark datasets showcase the effectiveness and generality of our approach and the competitive performance compared to other self-supervised approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems