A comprehensive approach to improving CLIP-based image retrieval while maintaining joint-embedding alignment

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems Pub Date : 2025-07-08 DOI:10.1016/j.is.2025.102581

Konstantin Schall , Kai Uwe Barthel , Nico Hezel , Andre Moelle

{"title":"A comprehensive approach to improving CLIP-based image retrieval while maintaining joint-embedding alignment","authors":"Konstantin Schall , Kai Uwe Barthel , Nico Hezel , Andre Moelle","doi":"10.1016/j.is.2025.102581","DOIUrl":null,"url":null,"abstract":"<div><div>Contrastive Language–Image Pre-training (CLIP) jointly optimizes an image encoder and a text encoder, yet its semantic supervision can blur the distinction between visually different images that share similar captions, hurting instance-level image retrieval. We study two strategies, two-stage fine-tuning (2SFT) and multi-caption-image pairing (MCIP) that strengthen CLIP models for content-based image retrieval while preserving their cross-modal strengths. 2SFT first adapts the image encoder for retrieval and then realigns the text encoder. MCIP injects multiple pseudo-captions per image so that class labels sharpen retrieval and the extra captions keep text alignment. This extended version augments the original SISAP24 study with experiments on additional models, a systematic investigation of key hyperparameters of the presented approach, insights into the effects of the methods on the model, and more a detailed report on training setting and costs. Across four CLIP model families, the proposed methods boost image-to-image retrieval accuracy without sacrificing text-to-image performance, simplifying large-scale multimodal search systems by allowing them to store one embedding per image while being effective in image-to-image and text-to-image search.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"134 ","pages":"Article 102581"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437925000651","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Contrastive Language–Image Pre-training (CLIP) jointly optimizes an image encoder and a text encoder, yet its semantic supervision can blur the distinction between visually different images that share similar captions, hurting instance-level image retrieval. We study two strategies, two-stage fine-tuning (2SFT) and multi-caption-image pairing (MCIP) that strengthen CLIP models for content-based image retrieval while preserving their cross-modal strengths. 2SFT first adapts the image encoder for retrieval and then realigns the text encoder. MCIP injects multiple pseudo-captions per image so that class labels sharpen retrieval and the extra captions keep text alignment. This extended version augments the original SISAP24 study with experiments on additional models, a systematic investigation of key hyperparameters of the presented approach, insights into the effects of the methods on the model, and more a detailed report on training setting and costs. Across four CLIP model families, the proposed methods boost image-to-image retrieval accuracy without sacrificing text-to-image performance, simplifying large-scale multimodal search systems by allowing them to store one embedding per image while being effective in image-to-image and text-to-image search.

查看原文本刊更多论文

一种改进基于clip的图像检索同时保持关节嵌入对齐的综合方法

对比语言-图像预训练（CLIP）联合优化了图像编码器和文本编码器，但其语义监督可能模糊了具有相似标题的视觉不同图像之间的区别，从而损害了实例级图像检索。我们研究了两种策略，两阶段微调（2SFT）和多标题图像配对（MCIP），它们增强了基于内容的图像检索的CLIP模型，同时保留了它们的跨模态优势。sft首先调整图像编码器进行检索，然后重新调整文本编码器。MCIP为每个图像注入多个伪标题，这样类标签可以增强检索，额外的标题可以保持文本对齐。这个扩展版本增加了原始的SISAP24研究，对其他模型进行了实验，对所提出方法的关键超参数进行了系统调查，深入了解了方法对模型的影响，并详细报告了训练设置和成本。在四个CLIP模型家族中，所提出的方法在不牺牲文本到图像性能的情况下提高了图像到图像检索的准确性，简化了大型多模式搜索系统，允许它们在图像到图像和文本到图像搜索中有效地存储每个图像的一个嵌入。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.