{"title":"MoVis: When 3D Object Detection Is Like Human Monocular Vision","authors":"Zijie Wang;Jizheng Yi;Aibin Chen;Guangjie Han","doi":"10.1109/TIP.2025.3544880","DOIUrl":null,"url":null,"abstract":"Monocular 3D object detection has garnered significant attention for its outstanding cost effectiveness compared with multi-sensor systems. However, previous work mainly acquires object 3D properties in a heuristic way, with less emphasis on the cues between objects. Inspired by the mechanisms of monocular vision, we propose MoVis, an innovative 3D object detection framework that skillfully combines object hierarchy and color sequence cues. Specifically, a decoupled Spatial Relationship Encoder (SRE) is designed to effectively feed back the high-level encoding results with object hierarchical relationships to low-level features. This method not only effectively reduces the computational overhead of multi-scale coding, but also significantly improves the detection accuracy of occluded objects by incorporating the hierarchical relationship between objects into multi-scale features. Moreover, to obtain more precise object depth information, an Object-level Depth Modulator (ODM) based on the concept of conditional random fields is designed, which employs color sequences. Ultimately, the results of the SRE and ODM are efficiently fused by our Spatial Context Processor (SCP) to accurately perceive the 3D attributes of the objects. Extensive experiments on the KITTI and Rope3D benchmarks show that MoVis achieves state-of-the-art performance. Our MoVis represents a progressive approach that emulates how human monocular vision utilizes monocular cues to perceive 3D scenes.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3025-3040"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10916602/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Monocular 3D object detection has garnered significant attention for its outstanding cost effectiveness compared with multi-sensor systems. However, previous work mainly acquires object 3D properties in a heuristic way, with less emphasis on the cues between objects. Inspired by the mechanisms of monocular vision, we propose MoVis, an innovative 3D object detection framework that skillfully combines object hierarchy and color sequence cues. Specifically, a decoupled Spatial Relationship Encoder (SRE) is designed to effectively feed back the high-level encoding results with object hierarchical relationships to low-level features. This method not only effectively reduces the computational overhead of multi-scale coding, but also significantly improves the detection accuracy of occluded objects by incorporating the hierarchical relationship between objects into multi-scale features. Moreover, to obtain more precise object depth information, an Object-level Depth Modulator (ODM) based on the concept of conditional random fields is designed, which employs color sequences. Ultimately, the results of the SRE and ODM are efficiently fused by our Spatial Context Processor (SCP) to accurately perceive the 3D attributes of the objects. Extensive experiments on the KITTI and Rope3D benchmarks show that MoVis achieves state-of-the-art performance. Our MoVis represents a progressive approach that emulates how human monocular vision utilizes monocular cues to perceive 3D scenes.