CLIP can understand depth

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-09-26 DOI:10.1016/j.patcog.2025.112475

Sohee Kim , Jisu Kang , Dunam Kim , Seokju Lee

{"title":"CLIP can understand depth","authors":"Sohee Kim , Jisu Kang , Dunam Kim , Seokju Lee","doi":"10.1016/j.patcog.2025.112475","DOIUrl":null,"url":null,"abstract":"<div><div>In this paper, we demonstrate that CLIP can also be adapted to downstream tasks where its vision-language alignment is suboptimally learned during pre-training on web-crawled data, all without requiring fine-tuning. We explore the case of monocular depth estimation, where CLIP’s contrastive prior struggles to generalize, compared to its success in domains such as generative modeling and semantic segmentation. Since CLIP fails to consistently capture similarities between image patches and natural language prompts describing distance, we eliminate the use of its pre-trained natural language token embeddings and distill the semantic prior of its frozen text encoder into a single learnable embedding matrix called <em>“mirror”</em>. The main design goal of <em>mirror</em> is to derive a non-human language prompt that approximates an optimal natural language prompt: “<em>How far is this location from the camera?</em>” Using this approach, we jointly train two lightweight modules, a <em>mirror</em> and a compact decoder, on top of a frozen CLIP for dense depth prediction. Compared to conventional depth models, our framework is significantly more efficient in terms of parameters and computation. The resulting model exhibits impressive performance, matching several state-of-the-art vision models on the NYU Depth v2 and KITTI benchmark datasets, while outperforming all vision-language depth models based on a frozen CLIP prior. Specifically, our method reduces the Absolute Relative Error (Abs Rel) by 68.7 % on NYU Depth v2 and by 75.6 % on KITTI compared to the method of Auty <em>et al.</em>, a representative CLIP-based baseline. Experiments demonstrate that the suboptimal depth understanding of CLIP in terms of spatial and temporal consistency can be significantly corrected without either fine-tuning it or concatenating <em>mirror</em> with its pre-trained subword token embeddings. Furthermore, an ablation study on the convergence status of <em>mirror</em> shows that it is implicitly trained to capture objects, such as humans and windows, where semantic cues play an important role in detection.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112475"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325011380","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we demonstrate that CLIP can also be adapted to downstream tasks where its vision-language alignment is suboptimally learned during pre-training on web-crawled data, all without requiring fine-tuning. We explore the case of monocular depth estimation, where CLIP’s contrastive prior struggles to generalize, compared to its success in domains such as generative modeling and semantic segmentation. Since CLIP fails to consistently capture similarities between image patches and natural language prompts describing distance, we eliminate the use of its pre-trained natural language token embeddings and distill the semantic prior of its frozen text encoder into a single learnable embedding matrix called “mirror”. The main design goal of mirror is to derive a non-human language prompt that approximates an optimal natural language prompt: “How far is this location from the camera?” Using this approach, we jointly train two lightweight modules, a mirror and a compact decoder, on top of a frozen CLIP for dense depth prediction. Compared to conventional depth models, our framework is significantly more efficient in terms of parameters and computation. The resulting model exhibits impressive performance, matching several state-of-the-art vision models on the NYU Depth v2 and KITTI benchmark datasets, while outperforming all vision-language depth models based on a frozen CLIP prior. Specifically, our method reduces the Absolute Relative Error (Abs Rel) by 68.7 % on NYU Depth v2 and by 75.6 % on KITTI compared to the method of Auty et al., a representative CLIP-based baseline. Experiments demonstrate that the suboptimal depth understanding of CLIP in terms of spatial and temporal consistency can be significantly corrected without either fine-tuning it or concatenating mirror with its pre-trained subword token embeddings. Furthermore, an ablation study on the convergence status of mirror shows that it is implicitly trained to capture objects, such as humans and windows, where semantic cues play an important role in detection.

查看原文本刊更多论文

CLIP可以理解深度

在本文中，我们证明CLIP也可以适用于下游任务，其中它的视觉语言对齐是在网络抓取数据的预训练期间次优学习的，所有这些都不需要微调。我们探讨了单目深度估计的情况，与在生成建模和语义分割等领域的成功相比，CLIP的对比先验难以泛化。由于CLIP无法始终捕获图像补丁和描述距离的自然语言提示之间的相似性，因此我们取消了使用其预训练的自然语言标记嵌入，并将其冻结文本编码器的语义先验提取到单个可学习的嵌入矩阵中，称为“镜像”。mirror的主要设计目标是推导出一种近似于最佳自然语言提示的非人类语言提示：“这个位置离相机有多远？”使用这种方法，我们联合训练两个轻量级模块，一个镜像和一个紧凑的解码器，在冻结CLIP之上进行密集深度预测。与传统的深度模型相比，我们的框架在参数和计算方面的效率显著提高。由此产生的模型表现出令人印象深刻的性能，与NYU Depth v2和KITTI基准数据集上的几个最先进的视觉模型相匹配，同时优于所有基于冻结CLIP先验的视觉语言深度模型。具体来说，与Auty等人的方法相比，我们的方法在NYU Depth v2上减少了68.7%的绝对相对误差（Abs Rel），在KITTI上减少了75.6%，Auty等人的方法是一个代表性的基于clip的基线。实验表明，在空间和时间一致性方面，CLIP的次优深度理解可以在不进行微调或将镜像与其预训练的子词标记嵌入连接的情况下得到显着纠正。此外，对镜像收敛状态的研究表明，它被隐式训练来捕捉物体，如人类和窗户，其中语义线索在检测中起着重要作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.