Ran Jin, Zhengang Li, Fang Deng, Yanhong Zhang, Min Luo, Tao Jin, Tengda Hou, Chenjie Du, Xiaozhe Gu, Jie Yuan
{"title":"Dual-stage framework with soft-label distillation and spatial prompting for image-text retrieval.","authors":"Ran Jin, Zhengang Li, Fang Deng, Yanhong Zhang, Min Luo, Tao Jin, Tengda Hou, Chenjie Du, Xiaozhe Gu, Jie Yuan","doi":"10.1371/journal.pone.0333084","DOIUrl":null,"url":null,"abstract":"<p><p>Vision-language pre-training (VLP) methods have significantly advanced cross-modal tasks in recent years. However, image-text retrieval still faces two critical challenges: inter-modal matching deficiency and intra-modal fine-grained localization deficiency. These issues significantly impede the accuracy of image-text retrieval. To address these challenges, we propose a novel dual-stage training framework. In the first stage, we employ Soft Label Distillation (SLD) to align the contrastive relationships between images and texts by mitigating the overfitting problem caused by hard labels. In the second stage, we introduce Spatial Text Prompt (STP) to enhance the model's visual grounding capabilities by incorporating spatial prompt information, thereby achieving more precise fine-grained alignment. Extensive experiments on standard datasets show that our method outperforms state-of-the-art approaches in image-text retrieval.The code and supplementary files can be found at https://github.com/Leon001211/DSSLP.</p>","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"20 10","pages":"e0333084"},"PeriodicalIF":2.6000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12513663/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0333084","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Vision-language pre-training (VLP) methods have significantly advanced cross-modal tasks in recent years. However, image-text retrieval still faces two critical challenges: inter-modal matching deficiency and intra-modal fine-grained localization deficiency. These issues significantly impede the accuracy of image-text retrieval. To address these challenges, we propose a novel dual-stage training framework. In the first stage, we employ Soft Label Distillation (SLD) to align the contrastive relationships between images and texts by mitigating the overfitting problem caused by hard labels. In the second stage, we introduce Spatial Text Prompt (STP) to enhance the model's visual grounding capabilities by incorporating spatial prompt information, thereby achieving more precise fine-grained alignment. Extensive experiments on standard datasets show that our method outperforms state-of-the-art approaches in image-text retrieval.The code and supplementary files can be found at https://github.com/Leon001211/DSSLP.
期刊介绍:
PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides:
* Open-access—freely accessible online, authors retain copyright
* Fast publication times
* Peer review by expert, practicing researchers
* Post-publication tools to indicate quality and impact
* Community-based dialogue on articles
* Worldwide media coverage