Konstantin Schall , Kai Uwe Barthel , Nico Hezel , Andre Moelle
{"title":"A comprehensive approach to improving CLIP-based image retrieval while maintaining joint-embedding alignment","authors":"Konstantin Schall , Kai Uwe Barthel , Nico Hezel , Andre Moelle","doi":"10.1016/j.is.2025.102581","DOIUrl":null,"url":null,"abstract":"<div><div>Contrastive Language–Image Pre-training (CLIP) jointly optimizes an image encoder and a text encoder, yet its semantic supervision can blur the distinction between visually different images that share similar captions, hurting instance-level image retrieval. We study two strategies, two-stage fine-tuning (2SFT) and multi-caption-image pairing (MCIP) that strengthen CLIP models for content-based image retrieval while preserving their cross-modal strengths. 2SFT first adapts the image encoder for retrieval and then realigns the text encoder. MCIP injects multiple pseudo-captions per image so that class labels sharpen retrieval and the extra captions keep text alignment. This extended version augments the original SISAP24 study with experiments on additional models, a systematic investigation of key hyperparameters of the presented approach, insights into the effects of the methods on the model, and more a detailed report on training setting and costs. Across four CLIP model families, the proposed methods boost image-to-image retrieval accuracy without sacrificing text-to-image performance, simplifying large-scale multimodal search systems by allowing them to store one embedding per image while being effective in image-to-image and text-to-image search.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"134 ","pages":"Article 102581"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437925000651","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Contrastive Language–Image Pre-training (CLIP) jointly optimizes an image encoder and a text encoder, yet its semantic supervision can blur the distinction between visually different images that share similar captions, hurting instance-level image retrieval. We study two strategies, two-stage fine-tuning (2SFT) and multi-caption-image pairing (MCIP) that strengthen CLIP models for content-based image retrieval while preserving their cross-modal strengths. 2SFT first adapts the image encoder for retrieval and then realigns the text encoder. MCIP injects multiple pseudo-captions per image so that class labels sharpen retrieval and the extra captions keep text alignment. This extended version augments the original SISAP24 study with experiments on additional models, a systematic investigation of key hyperparameters of the presented approach, insights into the effects of the methods on the model, and more a detailed report on training setting and costs. Across four CLIP model families, the proposed methods boost image-to-image retrieval accuracy without sacrificing text-to-image performance, simplifying large-scale multimodal search systems by allowing them to store one embedding per image while being effective in image-to-image and text-to-image search.
期刊介绍:
Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems.
Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.