基于对比学习的文本-图像跨模态检索微调方法

IF 1.5 4区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Wei Zhao, Xuan Ma, Weigang Wang
{"title":"基于对比学习的文本-图像跨模态检索微调方法","authors":"Wei Zhao,&nbsp;Xuan Ma,&nbsp;Weigang Wang","doi":"10.1002/cpe.70228","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>With the rapid proliferation of social media and smart devices, multimodal data has grown explosively, making traditional unimodal retrieval methods insufficient for addressing cross-modal semantic correlation tasks. To tackle the challenges caused by text redundancy and image noise in real-world scenarios, this paper proposes a contrastive learning-based, two-stage progressive fine-tuning approach for building a high-precision text-image cross-modal retrieval system. We design an efficient data preprocessing pipeline: Text data undergoes tokenization, stop-word filtering, and TF-IDF-based keyword extraction, while image data is enhanced using Cutout-style random masking to improve robustness against occlusion and noise. The model employs a dual-tower architecture composed of a ResNet50 visual encoder and a RoBERTa-based text encoder, with joint embedding space optimized using InfoNCE loss. A Locked-image Tuning (LiT) strategy is introduced, where the visual encoder is initially frozen and then both encoders are fine-tuned jointly with mixed-precision training and gradient clipping to ensure convergence stability. To improve data loading efficiency, we utilize LMDB to store 50,000 image-text pairs, significantly reducing I/O overhead. Experiments on an industry-scale dataset demonstrate that the fine-tuned model achieves R@5 of 87.1% (text-to-image) and 87.4% (image-to-text), outperforming baselines by over 13% while reducing GPU memory usage by 18%. Our method achieves a balance between accuracy, efficiency, and scalability, making it suitable for applications such as social media content management and e-commerce cross-modal search.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 21-22","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Contrastive Learning-Based Fine-Tuning Method for Cross-Modal Text-Image Retrieval\",\"authors\":\"Wei Zhao,&nbsp;Xuan Ma,&nbsp;Weigang Wang\",\"doi\":\"10.1002/cpe.70228\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>With the rapid proliferation of social media and smart devices, multimodal data has grown explosively, making traditional unimodal retrieval methods insufficient for addressing cross-modal semantic correlation tasks. To tackle the challenges caused by text redundancy and image noise in real-world scenarios, this paper proposes a contrastive learning-based, two-stage progressive fine-tuning approach for building a high-precision text-image cross-modal retrieval system. We design an efficient data preprocessing pipeline: Text data undergoes tokenization, stop-word filtering, and TF-IDF-based keyword extraction, while image data is enhanced using Cutout-style random masking to improve robustness against occlusion and noise. The model employs a dual-tower architecture composed of a ResNet50 visual encoder and a RoBERTa-based text encoder, with joint embedding space optimized using InfoNCE loss. A Locked-image Tuning (LiT) strategy is introduced, where the visual encoder is initially frozen and then both encoders are fine-tuned jointly with mixed-precision training and gradient clipping to ensure convergence stability. To improve data loading efficiency, we utilize LMDB to store 50,000 image-text pairs, significantly reducing I/O overhead. Experiments on an industry-scale dataset demonstrate that the fine-tuned model achieves R@5 of 87.1% (text-to-image) and 87.4% (image-to-text), outperforming baselines by over 13% while reducing GPU memory usage by 18%. Our method achieves a balance between accuracy, efficiency, and scalability, making it suitable for applications such as social media content management and e-commerce cross-modal search.</p>\\n </div>\",\"PeriodicalId\":55214,\"journal\":{\"name\":\"Concurrency and Computation-Practice & Experience\",\"volume\":\"37 21-22\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-08-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Concurrency and Computation-Practice & Experience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70228\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70228","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

摘要

随着社交媒体和智能设备的快速普及,多模态数据呈爆炸式增长,传统的单模态检索方法已不足以解决跨模态语义关联任务。为了解决现实场景中文本冗余和图像噪声带来的挑战,本文提出了一种基于对比学习的两阶段渐进式微调方法,用于构建高精度文本-图像跨模态检索系统。我们设计了一个高效的数据预处理管道:文本数据经过标记化、停止词过滤和基于tf - idf的关键字提取,而图像数据使用cutout风格的随机掩蔽来增强对遮挡和噪声的鲁棒性。该模型采用由ResNet50视觉编码器和基于roberta的文本编码器组成的双塔架构,联合嵌入空间利用InfoNCE loss进行优化。提出了一种锁定图像调谐策略,即先冻结视觉编码器,然后结合混合精度训练和梯度裁剪对两个编码器进行联合微调,以保证收敛稳定性。为了提高数据加载效率,我们利用LMDB存储了50,000对图像-文本对,从而显著降低了I/O开销。在工业规模数据集上的实验表明,微调模型达到R@5的87.1%(文本到图像)和87.4%(图像到文本),超过基准13%,同时减少GPU内存使用18%。我们的方法在准确性、效率和可扩展性之间取得了平衡,使其适用于社交媒体内容管理和电子商务跨模式搜索等应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Contrastive Learning-Based Fine-Tuning Method for Cross-Modal Text-Image Retrieval

With the rapid proliferation of social media and smart devices, multimodal data has grown explosively, making traditional unimodal retrieval methods insufficient for addressing cross-modal semantic correlation tasks. To tackle the challenges caused by text redundancy and image noise in real-world scenarios, this paper proposes a contrastive learning-based, two-stage progressive fine-tuning approach for building a high-precision text-image cross-modal retrieval system. We design an efficient data preprocessing pipeline: Text data undergoes tokenization, stop-word filtering, and TF-IDF-based keyword extraction, while image data is enhanced using Cutout-style random masking to improve robustness against occlusion and noise. The model employs a dual-tower architecture composed of a ResNet50 visual encoder and a RoBERTa-based text encoder, with joint embedding space optimized using InfoNCE loss. A Locked-image Tuning (LiT) strategy is introduced, where the visual encoder is initially frozen and then both encoders are fine-tuned jointly with mixed-precision training and gradient clipping to ensure convergence stability. To improve data loading efficiency, we utilize LMDB to store 50,000 image-text pairs, significantly reducing I/O overhead. Experiments on an industry-scale dataset demonstrate that the fine-tuned model achieves R@5 of 87.1% (text-to-image) and 87.4% (image-to-text), outperforming baselines by over 13% while reducing GPU memory usage by 18%. Our method achieves a balance between accuracy, efficiency, and scalability, making it suitable for applications such as social media content management and e-commerce cross-modal search.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Concurrency and Computation-Practice & Experience
Concurrency and Computation-Practice & Experience 工程技术-计算机:理论方法
CiteScore
5.00
自引率
10.00%
发文量
664
审稿时长
9.6 months
期刊介绍: Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of: Parallel and distributed computing; High-performance computing; Computational and data science; Artificial intelligence and machine learning; Big data applications, algorithms, and systems; Network science; Ontologies and semantics; Security and privacy; Cloud/edge/fog computing; Green computing; and Quantum computing.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信