Local plane estimation network with multi-scale fusion for efficient monocular depth estimation

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-06-25 DOI:10.1016/j.eswa.2025.128746

Lei Song, Bo Jiang, Huaibo Song

{"title":"Local plane estimation network with multi-scale fusion for efficient monocular depth estimation","authors":"Lei Song, Bo Jiang, Huaibo Song","doi":"10.1016/j.eswa.2025.128746","DOIUrl":null,"url":null,"abstract":"<div><div>Estimating scene depth from a single image is a critical yet challenging task in computer vision, with widespread applications in autonomous driving, 3D reconstruction, and scene understanding. In monocular depth estimation, effective depth representation and accurate extraction and integration of local details are crucial for reliable results. However, existing methods face two major challenges in dealing with complex depth relationships. (a) a lack of efficient feature representation mechanisms, often relying on pixel-level dense depth estimation to capture local details, which leads to significant computational overhead and (b) inefficiency in extending the depth representation range, particularly when distinguishing near and far objects, making it difficult to effectively balance global depth relationships and local details. To address these challenges, this study introduces the Local Plane Estimation with Multi-Scale Fusion Network (LMNet) for monocular depth estimation. The encoder utilizes stacked Transformer blocks to extract multi-scale global depth features and capture long-range dependencies. The decoder incorporates a Local Plane Estimation (LPE) module that generates local plane parameters from multi-scale features, enabling efficient recovery of depth details and improving local accuracy. Furthermore, the Multi-scale Attentive Fusion (MAF) module performs weighted fusion of multi-scale depth features using attention mechanisms, adaptively assigning contribution weights, reducing redundancy, and dynamically prioritizing feature integration to ensure structural consistency and detailed representation in the depth map. The synergistic design of these modules significantly enhances both the quality and efficiency of monocular depth estimation. Extensive experiments show that LMNet achieves significant advantages in both depth estimation accuracy and computational efficiency. Compared to NeWCRFs, LMNet achieves a 15.05 % reduction in RMSE error and a 10.63 % decrease in inference time on the NYU Depth V2 dataset, while compressing its model size to just 11.3 MB. In zero-shot evaluation on the high-resolution HRWSI dataset, LMNet attains an average inference latency of only 90.9 ms and an RMSE of 0.355, further validating its excellent balance between fast inference and high-precision estimation.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"293 ","pages":"Article 128746"},"PeriodicalIF":7.5000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425023644","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Estimating scene depth from a single image is a critical yet challenging task in computer vision, with widespread applications in autonomous driving, 3D reconstruction, and scene understanding. In monocular depth estimation, effective depth representation and accurate extraction and integration of local details are crucial for reliable results. However, existing methods face two major challenges in dealing with complex depth relationships. (a) a lack of efficient feature representation mechanisms, often relying on pixel-level dense depth estimation to capture local details, which leads to significant computational overhead and (b) inefficiency in extending the depth representation range, particularly when distinguishing near and far objects, making it difficult to effectively balance global depth relationships and local details. To address these challenges, this study introduces the Local Plane Estimation with Multi-Scale Fusion Network (LMNet) for monocular depth estimation. The encoder utilizes stacked Transformer blocks to extract multi-scale global depth features and capture long-range dependencies. The decoder incorporates a Local Plane Estimation (LPE) module that generates local plane parameters from multi-scale features, enabling efficient recovery of depth details and improving local accuracy. Furthermore, the Multi-scale Attentive Fusion (MAF) module performs weighted fusion of multi-scale depth features using attention mechanisms, adaptively assigning contribution weights, reducing redundancy, and dynamically prioritizing feature integration to ensure structural consistency and detailed representation in the depth map. The synergistic design of these modules significantly enhances both the quality and efficiency of monocular depth estimation. Extensive experiments show that LMNet achieves significant advantages in both depth estimation accuracy and computational efficiency. Compared to NeWCRFs, LMNet achieves a 15.05 % reduction in RMSE error and a 10.63 % decrease in inference time on the NYU Depth V2 dataset, while compressing its model size to just 11.3 MB. In zero-shot evaluation on the high-resolution HRWSI dataset, LMNet attains an average inference latency of only 90.9 ms and an RMSE of 0.355, further validating its excellent balance between fast inference and high-precision estimation.

查看原文本刊更多论文

基于多尺度融合的局部平面估计网络，实现高效的单目深度估计

在计算机视觉中，从单个图像估计场景深度是一项关键但具有挑战性的任务，在自动驾驶、3D重建和场景理解中有着广泛的应用。在单目深度估计中，有效的深度表示和准确的局部细节提取与融合是获得可靠结果的关键。然而，现有的方法在处理复杂的深度关系时面临两个主要挑战。(a)缺乏有效的特征表示机制，通常依赖于像素级密集深度估计来捕获局部细节，这导致了显著的计算开销；(b)扩展深度表示范围的效率低下，特别是在区分远近物体时，使得难以有效地平衡全局深度关系和局部细节。为了解决这些问题，本研究引入了局部平面估计与多尺度融合网络（LMNet）进行单目深度估计。编码器利用堆叠的Transformer块提取多尺度全局深度特征并捕获远程依赖关系。该解码器集成了一个局部平面估计（LPE）模块，该模块可以从多尺度特征中生成局部平面参数，从而有效地恢复深度细节并提高局部精度。此外，多尺度注意融合（MAF）模块利用注意机制对多尺度深度特征进行加权融合，自适应分配贡献权重，减少冗余，并动态确定特征集成的优先级，以确保深度图中的结构一致性和详细表示。这些模块的协同设计显著提高了单目深度估计的质量和效率。大量的实验表明，LMNet在深度估计精度和计算效率方面都具有显著的优势。与NeWCRFs相比，LMNet在NYU Depth V2数据集上的RMSE误差减少了15.05%，推理时间减少了10.63%，同时将其模型大小压缩到11.3 MB。在高分辨率HRWSI数据集的零射击评估中，LMNet的平均推理延迟仅为90.9 ms， RMSE为0.355，进一步验证了其在快速推理和高精度估计之间的良好平衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.