{"title":"Local plane estimation network with multi-scale fusion for efficient monocular depth estimation","authors":"Lei Song, Bo Jiang, Huaibo Song","doi":"10.1016/j.eswa.2025.128746","DOIUrl":null,"url":null,"abstract":"<div><div>Estimating scene depth from a single image is a critical yet challenging task in computer vision, with widespread applications in autonomous driving, 3D reconstruction, and scene understanding. In monocular depth estimation, effective depth representation and accurate extraction and integration of local details are crucial for reliable results. However, existing methods face two major challenges in dealing with complex depth relationships. (a) a lack of efficient feature representation mechanisms, often relying on pixel-level dense depth estimation to capture local details, which leads to significant computational overhead and (b) inefficiency in extending the depth representation range, particularly when distinguishing near and far objects, making it difficult to effectively balance global depth relationships and local details. To address these challenges, this study introduces the Local Plane Estimation with Multi-Scale Fusion Network (LMNet) for monocular depth estimation. The encoder utilizes stacked Transformer blocks to extract multi-scale global depth features and capture long-range dependencies. The decoder incorporates a Local Plane Estimation (LPE) module that generates local plane parameters from multi-scale features, enabling efficient recovery of depth details and improving local accuracy. Furthermore, the Multi-scale Attentive Fusion (MAF) module performs weighted fusion of multi-scale depth features using attention mechanisms, adaptively assigning contribution weights, reducing redundancy, and dynamically prioritizing feature integration to ensure structural consistency and detailed representation in the depth map. The synergistic design of these modules significantly enhances both the quality and efficiency of monocular depth estimation. Extensive experiments show that LMNet achieves significant advantages in both depth estimation accuracy and computational efficiency. Compared to NeWCRFs, LMNet achieves a 15.05 % reduction in RMSE error and a 10.63 % decrease in inference time on the NYU Depth V2 dataset, while compressing its model size to just 11.3 MB. In zero-shot evaluation on the high-resolution HRWSI dataset, LMNet attains an average inference latency of only 90.9 ms and an RMSE of 0.355, further validating its excellent balance between fast inference and high-precision estimation.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"293 ","pages":"Article 128746"},"PeriodicalIF":7.5000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425023644","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Estimating scene depth from a single image is a critical yet challenging task in computer vision, with widespread applications in autonomous driving, 3D reconstruction, and scene understanding. In monocular depth estimation, effective depth representation and accurate extraction and integration of local details are crucial for reliable results. However, existing methods face two major challenges in dealing with complex depth relationships. (a) a lack of efficient feature representation mechanisms, often relying on pixel-level dense depth estimation to capture local details, which leads to significant computational overhead and (b) inefficiency in extending the depth representation range, particularly when distinguishing near and far objects, making it difficult to effectively balance global depth relationships and local details. To address these challenges, this study introduces the Local Plane Estimation with Multi-Scale Fusion Network (LMNet) for monocular depth estimation. The encoder utilizes stacked Transformer blocks to extract multi-scale global depth features and capture long-range dependencies. The decoder incorporates a Local Plane Estimation (LPE) module that generates local plane parameters from multi-scale features, enabling efficient recovery of depth details and improving local accuracy. Furthermore, the Multi-scale Attentive Fusion (MAF) module performs weighted fusion of multi-scale depth features using attention mechanisms, adaptively assigning contribution weights, reducing redundancy, and dynamically prioritizing feature integration to ensure structural consistency and detailed representation in the depth map. The synergistic design of these modules significantly enhances both the quality and efficiency of monocular depth estimation. Extensive experiments show that LMNet achieves significant advantages in both depth estimation accuracy and computational efficiency. Compared to NeWCRFs, LMNet achieves a 15.05 % reduction in RMSE error and a 10.63 % decrease in inference time on the NYU Depth V2 dataset, while compressing its model size to just 11.3 MB. In zero-shot evaluation on the high-resolution HRWSI dataset, LMNet attains an average inference latency of only 90.9 ms and an RMSE of 0.355, further validating its excellent balance between fast inference and high-precision estimation.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.