{"title":"Structure perception and edge refinement network for monocular depth estimation","authors":"Shuangquan Zuo , Yun Xiao , Xuanhong Wang , Hao Lv , Hongwei Chen","doi":"10.1016/j.cviu.2025.104348","DOIUrl":null,"url":null,"abstract":"<div><div>Monocular depth estimation is fundamental for scene understanding and visual downstream tasks. In recent years, with the development of deep learning, increasing complex networks and powerful mechanisms have significantly improved the performance of monocular depth estimation. Nevertheless, predicting dense pixel depths from a single RGB image remains challenging due to the ill-posed issues and inherent ambiguity. Two unresolved issues persist: (1) Depth features are limited in perceiving the scene structure accurately, leading to inaccurate region estimation. (2) Low-level features, which are rich in details, are not fully utilized, causing the missing of details and ambiguous edges. The crux to accurate dense depth restoration is to efficiently handle global scene structure as well as local details. To solve these two issues, we propose the Scene perception and Edge refinement network for Monocular Depth Estimation (SE-MDE). Specifically, we carefully design a depth-enhanced encoder (DEE) to effectively perceive the overall structure of the scene while refining the feature responses of different regions. Meanwhile, we introduce a dense edge-guided network (DENet) that maximizes the utilization of low-level features to enhance the depth of details and edges. Extensive experiments validate the effectiveness of our method, with several experimental results on the NYU v2 indoor dataset and KITTI outdoor dataset demonstrate the state-of-the-art performance of the proposed method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"256 ","pages":"Article 104348"},"PeriodicalIF":4.3000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225000712","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Monocular depth estimation is fundamental for scene understanding and visual downstream tasks. In recent years, with the development of deep learning, increasing complex networks and powerful mechanisms have significantly improved the performance of monocular depth estimation. Nevertheless, predicting dense pixel depths from a single RGB image remains challenging due to the ill-posed issues and inherent ambiguity. Two unresolved issues persist: (1) Depth features are limited in perceiving the scene structure accurately, leading to inaccurate region estimation. (2) Low-level features, which are rich in details, are not fully utilized, causing the missing of details and ambiguous edges. The crux to accurate dense depth restoration is to efficiently handle global scene structure as well as local details. To solve these two issues, we propose the Scene perception and Edge refinement network for Monocular Depth Estimation (SE-MDE). Specifically, we carefully design a depth-enhanced encoder (DEE) to effectively perceive the overall structure of the scene while refining the feature responses of different regions. Meanwhile, we introduce a dense edge-guided network (DENet) that maximizes the utilization of low-level features to enhance the depth of details and edges. Extensive experiments validate the effectiveness of our method, with several experimental results on the NYU v2 indoor dataset and KITTI outdoor dataset demonstrate the state-of-the-art performance of the proposed method.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems