MoVis: When 3D Object Detection Is Like Human Monocular Vision

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-06 DOI:10.1109/TIP.2025.3544880

Zijie Wang;Jizheng Yi;Aibin Chen;Guangjie Han

{"title":"MoVis: When 3D Object Detection Is Like Human Monocular Vision","authors":"Zijie Wang;Jizheng Yi;Aibin Chen;Guangjie Han","doi":"10.1109/TIP.2025.3544880","DOIUrl":null,"url":null,"abstract":"Monocular 3D object detection has garnered significant attention for its outstanding cost effectiveness compared with multi-sensor systems. However, previous work mainly acquires object 3D properties in a heuristic way, with less emphasis on the cues between objects. Inspired by the mechanisms of monocular vision, we propose MoVis, an innovative 3D object detection framework that skillfully combines object hierarchy and color sequence cues. Specifically, a decoupled Spatial Relationship Encoder (SRE) is designed to effectively feed back the high-level encoding results with object hierarchical relationships to low-level features. This method not only effectively reduces the computational overhead of multi-scale coding, but also significantly improves the detection accuracy of occluded objects by incorporating the hierarchical relationship between objects into multi-scale features. Moreover, to obtain more precise object depth information, an Object-level Depth Modulator (ODM) based on the concept of conditional random fields is designed, which employs color sequences. Ultimately, the results of the SRE and ODM are efficiently fused by our Spatial Context Processor (SCP) to accurately perceive the 3D attributes of the objects. Extensive experiments on the KITTI and Rope3D benchmarks show that MoVis achieves state-of-the-art performance. Our MoVis represents a progressive approach that emulates how human monocular vision utilizes monocular cues to perceive 3D scenes.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3025-3040"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10916602/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Monocular 3D object detection has garnered significant attention for its outstanding cost effectiveness compared with multi-sensor systems. However, previous work mainly acquires object 3D properties in a heuristic way, with less emphasis on the cues between objects. Inspired by the mechanisms of monocular vision, we propose MoVis, an innovative 3D object detection framework that skillfully combines object hierarchy and color sequence cues. Specifically, a decoupled Spatial Relationship Encoder (SRE) is designed to effectively feed back the high-level encoding results with object hierarchical relationships to low-level features. This method not only effectively reduces the computational overhead of multi-scale coding, but also significantly improves the detection accuracy of occluded objects by incorporating the hierarchical relationship between objects into multi-scale features. Moreover, to obtain more precise object depth information, an Object-level Depth Modulator (ODM) based on the concept of conditional random fields is designed, which employs color sequences. Ultimately, the results of the SRE and ODM are efficiently fused by our Spatial Context Processor (SCP) to accurately perceive the 3D attributes of the objects. Extensive experiments on the KITTI and Rope3D benchmarks show that MoVis achieves state-of-the-art performance. Our MoVis represents a progressive approach that emulates how human monocular vision utilizes monocular cues to perceive 3D scenes.

查看原文本刊更多论文

当3D物体检测像人类的单目视觉

与多传感器系统相比，单目三维目标检测因其出色的成本效益而受到广泛关注。然而，以前的工作主要是通过启发式的方式获取物体的3D属性，而不是强调物体之间的线索。受单目视觉机制的启发，我们提出了MoVis，这是一个创新的3D目标检测框架，它巧妙地结合了目标层次和颜色序列线索。具体而言，设计了解耦空间关系编码器（SRE），将具有对象层次关系的高层编码结果有效地反馈给低层特征。该方法不仅有效降低了多尺度编码的计算量，而且通过将目标之间的层次关系融入到多尺度特征中，显著提高了被遮挡目标的检测精度。此外，为了获得更精确的目标深度信息，设计了基于条件随机场概念的对象级深度调制器（ODM），该调制器采用颜色序列。最后，我们的空间上下文处理器（SCP）将SRE和ODM的结果有效地融合在一起，以准确地感知物体的3D属性。在KITTI和Rope3D基准测试上进行的大量实验表明，MoVis达到了最先进的性能。我们的MoVis代表了一种先进的方法，模拟了人类单眼视觉如何利用单眼线索来感知3D场景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量