STDepth: Leveraging semantic-textural information in transformers for self-supervised monocular depth estimation

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-06-18 DOI:10.1016/j.cviu.2025.104422

Xuanang Gao , Bingchao Wang , Zhiwei Ning , Jie Yang , Wei Liu

{"title":"STDepth: Leveraging semantic-textural information in transformers for self-supervised monocular depth estimation","authors":"Xuanang Gao , Bingchao Wang , Zhiwei Ning , Jie Yang , Wei Liu","doi":"10.1016/j.cviu.2025.104422","DOIUrl":null,"url":null,"abstract":"<div><div>Self-supervised monocular depth estimation, relying solely on monocular or stereo video for supervision, plays an important role in computer vision. The encoder backbone generates features at various stages, and each stage exhibits distinct properties. However, conventional methods fail to take full advantage of these distinctions and apply the same processing to features from different stages, lacking the adaptability required for aggregating the unique information in features. In this research, we replace convolutional neural networks (CNNs) with a Transformer as the encoder backbone, intending to enhance the model’s ability to encode long-range spatial dependencies. Furthermore, we introduce a semantic-textural decoder (STDec) to emphasize local critical regions and more effectively process intricate details. The STDec incorporates two principal modules: (1) the global feature recalibration (GFR) module, which performs a comprehensive analysis of the scene structure using high-level features, and recalibrates features in the spatial dimension through semantic information, and (2) the detail focus (DF) module is employed in low-level features to capture texture details precisely. Additionally, we propose an innovative multi-arbitrary-scale reconstruction loss (MAS Loss) function to fully exploit the depth estimation network’s capabilities. The extensive experimental results demonstrate that our method achieves state-of-the-art performance on the KITTI dataset. Moreover, our models demonstrate remarkable generalization ability when applied to the Make3D and NYUv2 datasets. The code is publicly available at: <span><span>https://github.com/xagao/STDepth</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104422"},"PeriodicalIF":3.5000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001456","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Self-supervised monocular depth estimation, relying solely on monocular or stereo video for supervision, plays an important role in computer vision. The encoder backbone generates features at various stages, and each stage exhibits distinct properties. However, conventional methods fail to take full advantage of these distinctions and apply the same processing to features from different stages, lacking the adaptability required for aggregating the unique information in features. In this research, we replace convolutional neural networks (CNNs) with a Transformer as the encoder backbone, intending to enhance the model’s ability to encode long-range spatial dependencies. Furthermore, we introduce a semantic-textural decoder (STDec) to emphasize local critical regions and more effectively process intricate details. The STDec incorporates two principal modules: (1) the global feature recalibration (GFR) module, which performs a comprehensive analysis of the scene structure using high-level features, and recalibrates features in the spatial dimension through semantic information, and (2) the detail focus (DF) module is employed in low-level features to capture texture details precisely. Additionally, we propose an innovative multi-arbitrary-scale reconstruction loss (MAS Loss) function to fully exploit the depth estimation network’s capabilities. The extensive experimental results demonstrate that our method achieves state-of-the-art performance on the KITTI dataset. Moreover, our models demonstrate remarkable generalization ability when applied to the Make3D and NYUv2 datasets. The code is publicly available at: https://github.com/xagao/STDepth.

查看原文本刊更多论文

STDepth：利用变压器中的语义-纹理信息进行自监督单目深度估计

自监督单目深度估计是计算机视觉中的一种重要方法，它完全依靠单目或立体视频进行监督。编码器主干在不同阶段产生特征，每个阶段表现出不同的特性。然而，传统方法未能充分利用这些区别，对不同阶段的特征进行相同的处理，缺乏对特征中唯一信息的聚合所需的适应性。在这项研究中，我们用变压器代替卷积神经网络（cnn）作为编码器骨干，旨在增强模型编码远程空间依赖关系的能力。此外，我们引入了语义-纹理解码器（STDec）来强调局部关键区域，并更有效地处理复杂的细节。STDec包含两个主要模块：(1)全局特征重标定（GFR）模块，该模块利用高层特征对场景结构进行综合分析，并通过语义信息在空间维度上对特征进行重标定；(2)底层特征采用细节聚焦（DF）模块，精确捕获纹理细节。此外，我们提出了一个创新的多任意尺度重建损失（MAS loss）函数，以充分利用深度估计网络的能力。大量的实验结果表明，我们的方法在KITTI数据集上达到了最先进的性能。此外，当应用于Make3D和NYUv2数据集时，我们的模型显示出显著的泛化能力。该代码可在https://github.com/xagao/STDepth公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems