CLIP can understand depth

IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Sohee Kim , Jisu Kang , Dunam Kim , Seokju Lee
{"title":"CLIP can understand depth","authors":"Sohee Kim ,&nbsp;Jisu Kang ,&nbsp;Dunam Kim ,&nbsp;Seokju Lee","doi":"10.1016/j.patcog.2025.112475","DOIUrl":null,"url":null,"abstract":"<div><div>In this paper, we demonstrate that CLIP can also be adapted to downstream tasks where its vision-language alignment is suboptimally learned during pre-training on web-crawled data, all without requiring fine-tuning. We explore the case of monocular depth estimation, where CLIP’s contrastive prior struggles to generalize, compared to its success in domains such as generative modeling and semantic segmentation. Since CLIP fails to consistently capture similarities between image patches and natural language prompts describing distance, we eliminate the use of its pre-trained natural language token embeddings and distill the semantic prior of its frozen text encoder into a single learnable embedding matrix called <em>“mirror”</em>. The main design goal of <em>mirror</em> is to derive a non-human language prompt that approximates an optimal natural language prompt: “<em>How far is this location from the camera?</em>” Using this approach, we jointly train two lightweight modules, a <em>mirror</em> and a compact decoder, on top of a frozen CLIP for dense depth prediction. Compared to conventional depth models, our framework is significantly more efficient in terms of parameters and computation. The resulting model exhibits impressive performance, matching several state-of-the-art vision models on the NYU Depth v2 and KITTI benchmark datasets, while outperforming all vision-language depth models based on a frozen CLIP prior. Specifically, our method reduces the Absolute Relative Error (Abs Rel) by 68.7 % on NYU Depth v2 and by 75.6 % on KITTI compared to the method of Auty <em>et al.</em>, a representative CLIP-based baseline. Experiments demonstrate that the suboptimal depth understanding of CLIP in terms of spatial and temporal consistency can be significantly corrected without either fine-tuning it or concatenating <em>mirror</em> with its pre-trained subword token embeddings. Furthermore, an ablation study on the convergence status of <em>mirror</em> shows that it is implicitly trained to capture objects, such as humans and windows, where semantic cues play an important role in detection.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112475"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325011380","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

In this paper, we demonstrate that CLIP can also be adapted to downstream tasks where its vision-language alignment is suboptimally learned during pre-training on web-crawled data, all without requiring fine-tuning. We explore the case of monocular depth estimation, where CLIP’s contrastive prior struggles to generalize, compared to its success in domains such as generative modeling and semantic segmentation. Since CLIP fails to consistently capture similarities between image patches and natural language prompts describing distance, we eliminate the use of its pre-trained natural language token embeddings and distill the semantic prior of its frozen text encoder into a single learnable embedding matrix called “mirror”. The main design goal of mirror is to derive a non-human language prompt that approximates an optimal natural language prompt: “How far is this location from the camera?” Using this approach, we jointly train two lightweight modules, a mirror and a compact decoder, on top of a frozen CLIP for dense depth prediction. Compared to conventional depth models, our framework is significantly more efficient in terms of parameters and computation. The resulting model exhibits impressive performance, matching several state-of-the-art vision models on the NYU Depth v2 and KITTI benchmark datasets, while outperforming all vision-language depth models based on a frozen CLIP prior. Specifically, our method reduces the Absolute Relative Error (Abs Rel) by 68.7 % on NYU Depth v2 and by 75.6 % on KITTI compared to the method of Auty et al., a representative CLIP-based baseline. Experiments demonstrate that the suboptimal depth understanding of CLIP in terms of spatial and temporal consistency can be significantly corrected without either fine-tuning it or concatenating mirror with its pre-trained subword token embeddings. Furthermore, an ablation study on the convergence status of mirror shows that it is implicitly trained to capture objects, such as humans and windows, where semantic cues play an important role in detection.
CLIP可以理解深度
在本文中,我们证明CLIP也可以适用于下游任务,其中它的视觉语言对齐是在网络抓取数据的预训练期间次优学习的,所有这些都不需要微调。我们探讨了单目深度估计的情况,与在生成建模和语义分割等领域的成功相比,CLIP的对比先验难以泛化。由于CLIP无法始终捕获图像补丁和描述距离的自然语言提示之间的相似性,因此我们取消了使用其预训练的自然语言标记嵌入,并将其冻结文本编码器的语义先验提取到单个可学习的嵌入矩阵中,称为“镜像”。mirror的主要设计目标是推导出一种近似于最佳自然语言提示的非人类语言提示:“这个位置离相机有多远?”使用这种方法,我们联合训练两个轻量级模块,一个镜像和一个紧凑的解码器,在冻结CLIP之上进行密集深度预测。与传统的深度模型相比,我们的框架在参数和计算方面的效率显著提高。由此产生的模型表现出令人印象深刻的性能,与NYU Depth v2和KITTI基准数据集上的几个最先进的视觉模型相匹配,同时优于所有基于冻结CLIP先验的视觉语言深度模型。具体来说,与Auty等人的方法相比,我们的方法在NYU Depth v2上减少了68.7%的绝对相对误差(Abs Rel),在KITTI上减少了75.6%,Auty等人的方法是一个代表性的基于clip的基线。实验表明,在空间和时间一致性方面,CLIP的次优深度理解可以在不进行微调或将镜像与其预训练的子词标记嵌入连接的情况下得到显着纠正。此外,对镜像收敛状态的研究表明,它被隐式训练来捕捉物体,如人类和窗户,其中语义线索在检测中起着重要作用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Pattern Recognition
Pattern Recognition 工程技术-工程:电子与电气
CiteScore
14.40
自引率
16.20%
发文量
683
审稿时长
5.6 months
期刊介绍: The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信