{"title":"Contrastive Learning-Based Fine-Tuning Method for Cross-Modal Text-Image Retrieval","authors":"Wei Zhao, Xuan Ma, Weigang Wang","doi":"10.1002/cpe.70228","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>With the rapid proliferation of social media and smart devices, multimodal data has grown explosively, making traditional unimodal retrieval methods insufficient for addressing cross-modal semantic correlation tasks. To tackle the challenges caused by text redundancy and image noise in real-world scenarios, this paper proposes a contrastive learning-based, two-stage progressive fine-tuning approach for building a high-precision text-image cross-modal retrieval system. We design an efficient data preprocessing pipeline: Text data undergoes tokenization, stop-word filtering, and TF-IDF-based keyword extraction, while image data is enhanced using Cutout-style random masking to improve robustness against occlusion and noise. The model employs a dual-tower architecture composed of a ResNet50 visual encoder and a RoBERTa-based text encoder, with joint embedding space optimized using InfoNCE loss. A Locked-image Tuning (LiT) strategy is introduced, where the visual encoder is initially frozen and then both encoders are fine-tuned jointly with mixed-precision training and gradient clipping to ensure convergence stability. To improve data loading efficiency, we utilize LMDB to store 50,000 image-text pairs, significantly reducing I/O overhead. Experiments on an industry-scale dataset demonstrate that the fine-tuned model achieves R@5 of 87.1% (text-to-image) and 87.4% (image-to-text), outperforming baselines by over 13% while reducing GPU memory usage by 18%. Our method achieves a balance between accuracy, efficiency, and scalability, making it suitable for applications such as social media content management and e-commerce cross-modal search.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 21-22","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70228","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
With the rapid proliferation of social media and smart devices, multimodal data has grown explosively, making traditional unimodal retrieval methods insufficient for addressing cross-modal semantic correlation tasks. To tackle the challenges caused by text redundancy and image noise in real-world scenarios, this paper proposes a contrastive learning-based, two-stage progressive fine-tuning approach for building a high-precision text-image cross-modal retrieval system. We design an efficient data preprocessing pipeline: Text data undergoes tokenization, stop-word filtering, and TF-IDF-based keyword extraction, while image data is enhanced using Cutout-style random masking to improve robustness against occlusion and noise. The model employs a dual-tower architecture composed of a ResNet50 visual encoder and a RoBERTa-based text encoder, with joint embedding space optimized using InfoNCE loss. A Locked-image Tuning (LiT) strategy is introduced, where the visual encoder is initially frozen and then both encoders are fine-tuned jointly with mixed-precision training and gradient clipping to ensure convergence stability. To improve data loading efficiency, we utilize LMDB to store 50,000 image-text pairs, significantly reducing I/O overhead. Experiments on an industry-scale dataset demonstrate that the fine-tuned model achieves R@5 of 87.1% (text-to-image) and 87.4% (image-to-text), outperforming baselines by over 13% while reducing GPU memory usage by 18%. Our method achieves a balance between accuracy, efficiency, and scalability, making it suitable for applications such as social media content management and e-commerce cross-modal search.
期刊介绍:
Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of:
Parallel and distributed computing;
High-performance computing;
Computational and data science;
Artificial intelligence and machine learning;
Big data applications, algorithms, and systems;
Network science;
Ontologies and semantics;
Security and privacy;
Cloud/edge/fog computing;
Green computing; and
Quantum computing.