Xuanang Gao , Bingchao Wang , Zhiwei Ning , Jie Yang , Wei Liu
{"title":"STDepth: Leveraging semantic-textural information in transformers for self-supervised monocular depth estimation","authors":"Xuanang Gao , Bingchao Wang , Zhiwei Ning , Jie Yang , Wei Liu","doi":"10.1016/j.cviu.2025.104422","DOIUrl":null,"url":null,"abstract":"<div><div>Self-supervised monocular depth estimation, relying solely on monocular or stereo video for supervision, plays an important role in computer vision. The encoder backbone generates features at various stages, and each stage exhibits distinct properties. However, conventional methods fail to take full advantage of these distinctions and apply the same processing to features from different stages, lacking the adaptability required for aggregating the unique information in features. In this research, we replace convolutional neural networks (CNNs) with a Transformer as the encoder backbone, intending to enhance the model’s ability to encode long-range spatial dependencies. Furthermore, we introduce a semantic-textural decoder (STDec) to emphasize local critical regions and more effectively process intricate details. The STDec incorporates two principal modules: (1) the global feature recalibration (GFR) module, which performs a comprehensive analysis of the scene structure using high-level features, and recalibrates features in the spatial dimension through semantic information, and (2) the detail focus (DF) module is employed in low-level features to capture texture details precisely. Additionally, we propose an innovative multi-arbitrary-scale reconstruction loss (MAS Loss) function to fully exploit the depth estimation network’s capabilities. The extensive experimental results demonstrate that our method achieves state-of-the-art performance on the KITTI dataset. Moreover, our models demonstrate remarkable generalization ability when applied to the Make3D and NYUv2 datasets. The code is publicly available at: <span><span>https://github.com/xagao/STDepth</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104422"},"PeriodicalIF":4.3000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001456","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Self-supervised monocular depth estimation, relying solely on monocular or stereo video for supervision, plays an important role in computer vision. The encoder backbone generates features at various stages, and each stage exhibits distinct properties. However, conventional methods fail to take full advantage of these distinctions and apply the same processing to features from different stages, lacking the adaptability required for aggregating the unique information in features. In this research, we replace convolutional neural networks (CNNs) with a Transformer as the encoder backbone, intending to enhance the model’s ability to encode long-range spatial dependencies. Furthermore, we introduce a semantic-textural decoder (STDec) to emphasize local critical regions and more effectively process intricate details. The STDec incorporates two principal modules: (1) the global feature recalibration (GFR) module, which performs a comprehensive analysis of the scene structure using high-level features, and recalibrates features in the spatial dimension through semantic information, and (2) the detail focus (DF) module is employed in low-level features to capture texture details precisely. Additionally, we propose an innovative multi-arbitrary-scale reconstruction loss (MAS Loss) function to fully exploit the depth estimation network’s capabilities. The extensive experimental results demonstrate that our method achieves state-of-the-art performance on the KITTI dataset. Moreover, our models demonstrate remarkable generalization ability when applied to the Make3D and NYUv2 datasets. The code is publicly available at: https://github.com/xagao/STDepth.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems