DFF-Mono:A lightweight self-supervised monocular depth estimation method based on dual-branch feature fusion

IF 3.4 2区工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Displays Pub Date : 2025-07-20 DOI:10.1016/j.displa.2025.103167

Han Zhang , Xiaojun Yu , Hengrong Guo , Liang Shen , Zeming Fan

{"title":"DFF-Mono:A lightweight self-supervised monocular depth estimation method based on dual-branch feature fusion","authors":"Han Zhang , Xiaojun Yu , Hengrong Guo , Liang Shen , Zeming Fan","doi":"10.1016/j.displa.2025.103167","DOIUrl":null,"url":null,"abstract":"<div><div>Monocular depth estimation is one of the fundamental challenges in 3D scene understanding, particularly when operating within the constraints of unsupervised learning paradigms. While existing self-supervised methods avoid the dependency on annotated depth labels, their high computational complexity significantly hinders deployment on resource-constrained mobile platforms. To address this issue, we propose a parameter-efficient framework, namely, DFF-Mono, that synergistically optimizes depth estimation accuracy with computational efficiency. Specifically, the proposed DFF-Mono framework incorporates three main components. While a lightweight encoder that integrates Dual-Kernel Dilated Convolution (DKDC) modules with Dual-branch Feature Fusion (DFF) architecture is proposed for multi-scale feature encoding, a novel Attention-guided Large Kernel Inception (ALKI) module with multi-branch large-kernel convolution is devised to leverage local–global attention guidance for efficient local feature extraction. As a complement, a frequency-domain optimization strategy is also employed to enhance training efficiency. The strategy is achieved via adaptive Gaussian low-pass filtering, without introducing any additional network parameters. Extensive experiments are conducted to verify the effectiveness of the proposed method, and results demonstrate that DFF-Mono is superior over those existing approaches across standard benchmarks. Notably, DFF-Mono reduces model parameters by 23% compared to current state-of-the-art solutions while consistently achieving superior depth accuracy.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"90 ","pages":"Article 103167"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938225002045","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Monocular depth estimation is one of the fundamental challenges in 3D scene understanding, particularly when operating within the constraints of unsupervised learning paradigms. While existing self-supervised methods avoid the dependency on annotated depth labels, their high computational complexity significantly hinders deployment on resource-constrained mobile platforms. To address this issue, we propose a parameter-efficient framework, namely, DFF-Mono, that synergistically optimizes depth estimation accuracy with computational efficiency. Specifically, the proposed DFF-Mono framework incorporates three main components. While a lightweight encoder that integrates Dual-Kernel Dilated Convolution (DKDC) modules with Dual-branch Feature Fusion (DFF) architecture is proposed for multi-scale feature encoding, a novel Attention-guided Large Kernel Inception (ALKI) module with multi-branch large-kernel convolution is devised to leverage local–global attention guidance for efficient local feature extraction. As a complement, a frequency-domain optimization strategy is also employed to enhance training efficiency. The strategy is achieved via adaptive Gaussian low-pass filtering, without introducing any additional network parameters. Extensive experiments are conducted to verify the effectiveness of the proposed method, and results demonstrate that DFF-Mono is superior over those existing approaches across standard benchmarks. Notably, DFF-Mono reduces model parameters by 23% compared to current state-of-the-art solutions while consistently achieving superior depth accuracy.

查看原文本刊更多论文

DFF-Mono：一种基于双分支特征融合的轻量级自监督单目深度估计方法

单目深度估计是3D场景理解的基本挑战之一，特别是在无监督学习范式的约束下操作时。虽然现有的自监督方法避免了对标注深度标签的依赖，但它们的高计算复杂性严重阻碍了在资源受限的移动平台上的部署。为了解决这个问题，我们提出了一个参数高效框架，即DFF-Mono，它可以协同优化深度估计精度和计算效率。具体来说，提议的DFF-Mono框架包含三个主要组件。针对多尺度特征编码，提出了一种集成双核扩展卷积（DKDC）模块和双分支特征融合（DFF）架构的轻量级编码器，设计了一种新颖的多分支大核卷积的注意力引导大核初始化（ALKI）模块，利用局部-全局注意力引导进行高效的局部特征提取。作为补充，采用频域优化策略提高训练效率。该策略通过自适应高斯低通滤波实现，不引入任何额外的网络参数。通过大量的实验来验证所提出方法的有效性，结果表明DFF-Mono在标准基准测试中优于现有方法。值得注意的是，与目前最先进的解决方案相比，DFF-Mono减少了23%的模型参数，同时始终保持卓越的深度精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Displays 工程技术-工程：电子与电气

CiteScore

4.60

自引率

25.60%

发文量

138

审稿时长

92 days

期刊介绍： Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface. Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.