{"title":"CLIP可以理解深度","authors":"Sohee Kim , Jisu Kang , Dunam Kim , Seokju Lee","doi":"10.1016/j.patcog.2025.112475","DOIUrl":null,"url":null,"abstract":"<div><div>In this paper, we demonstrate that CLIP can also be adapted to downstream tasks where its vision-language alignment is suboptimally learned during pre-training on web-crawled data, all without requiring fine-tuning. We explore the case of monocular depth estimation, where CLIP’s contrastive prior struggles to generalize, compared to its success in domains such as generative modeling and semantic segmentation. Since CLIP fails to consistently capture similarities between image patches and natural language prompts describing distance, we eliminate the use of its pre-trained natural language token embeddings and distill the semantic prior of its frozen text encoder into a single learnable embedding matrix called <em>“mirror”</em>. The main design goal of <em>mirror</em> is to derive a non-human language prompt that approximates an optimal natural language prompt: “<em>How far is this location from the camera?</em>” Using this approach, we jointly train two lightweight modules, a <em>mirror</em> and a compact decoder, on top of a frozen CLIP for dense depth prediction. Compared to conventional depth models, our framework is significantly more efficient in terms of parameters and computation. The resulting model exhibits impressive performance, matching several state-of-the-art vision models on the NYU Depth v2 and KITTI benchmark datasets, while outperforming all vision-language depth models based on a frozen CLIP prior. Specifically, our method reduces the Absolute Relative Error (Abs Rel) by 68.7 % on NYU Depth v2 and by 75.6 % on KITTI compared to the method of Auty <em>et al.</em>, a representative CLIP-based baseline. Experiments demonstrate that the suboptimal depth understanding of CLIP in terms of spatial and temporal consistency can be significantly corrected without either fine-tuning it or concatenating <em>mirror</em> with its pre-trained subword token embeddings. Furthermore, an ablation study on the convergence status of <em>mirror</em> shows that it is implicitly trained to capture objects, such as humans and windows, where semantic cues play an important role in detection.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112475"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLIP can understand depth\",\"authors\":\"Sohee Kim , Jisu Kang , Dunam Kim , Seokju Lee\",\"doi\":\"10.1016/j.patcog.2025.112475\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In this paper, we demonstrate that CLIP can also be adapted to downstream tasks where its vision-language alignment is suboptimally learned during pre-training on web-crawled data, all without requiring fine-tuning. We explore the case of monocular depth estimation, where CLIP’s contrastive prior struggles to generalize, compared to its success in domains such as generative modeling and semantic segmentation. Since CLIP fails to consistently capture similarities between image patches and natural language prompts describing distance, we eliminate the use of its pre-trained natural language token embeddings and distill the semantic prior of its frozen text encoder into a single learnable embedding matrix called <em>“mirror”</em>. The main design goal of <em>mirror</em> is to derive a non-human language prompt that approximates an optimal natural language prompt: “<em>How far is this location from the camera?</em>” Using this approach, we jointly train two lightweight modules, a <em>mirror</em> and a compact decoder, on top of a frozen CLIP for dense depth prediction. Compared to conventional depth models, our framework is significantly more efficient in terms of parameters and computation. The resulting model exhibits impressive performance, matching several state-of-the-art vision models on the NYU Depth v2 and KITTI benchmark datasets, while outperforming all vision-language depth models based on a frozen CLIP prior. Specifically, our method reduces the Absolute Relative Error (Abs Rel) by 68.7 % on NYU Depth v2 and by 75.6 % on KITTI compared to the method of Auty <em>et al.</em>, a representative CLIP-based baseline. Experiments demonstrate that the suboptimal depth understanding of CLIP in terms of spatial and temporal consistency can be significantly corrected without either fine-tuning it or concatenating <em>mirror</em> with its pre-trained subword token embeddings. Furthermore, an ablation study on the convergence status of <em>mirror</em> shows that it is implicitly trained to capture objects, such as humans and windows, where semantic cues play an important role in detection.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"172 \",\"pages\":\"Article 112475\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325011380\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325011380","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
In this paper, we demonstrate that CLIP can also be adapted to downstream tasks where its vision-language alignment is suboptimally learned during pre-training on web-crawled data, all without requiring fine-tuning. We explore the case of monocular depth estimation, where CLIP’s contrastive prior struggles to generalize, compared to its success in domains such as generative modeling and semantic segmentation. Since CLIP fails to consistently capture similarities between image patches and natural language prompts describing distance, we eliminate the use of its pre-trained natural language token embeddings and distill the semantic prior of its frozen text encoder into a single learnable embedding matrix called “mirror”. The main design goal of mirror is to derive a non-human language prompt that approximates an optimal natural language prompt: “How far is this location from the camera?” Using this approach, we jointly train two lightweight modules, a mirror and a compact decoder, on top of a frozen CLIP for dense depth prediction. Compared to conventional depth models, our framework is significantly more efficient in terms of parameters and computation. The resulting model exhibits impressive performance, matching several state-of-the-art vision models on the NYU Depth v2 and KITTI benchmark datasets, while outperforming all vision-language depth models based on a frozen CLIP prior. Specifically, our method reduces the Absolute Relative Error (Abs Rel) by 68.7 % on NYU Depth v2 and by 75.6 % on KITTI compared to the method of Auty et al., a representative CLIP-based baseline. Experiments demonstrate that the suboptimal depth understanding of CLIP in terms of spatial and temporal consistency can be significantly corrected without either fine-tuning it or concatenating mirror with its pre-trained subword token embeddings. Furthermore, an ablation study on the convergence status of mirror shows that it is implicitly trained to capture objects, such as humans and windows, where semantic cues play an important role in detection.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.