Multi-SEA: Multi-stage Semantic Enhancement and Aggregation for image–text retrieval

IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Zijing Tian , Zhonghong Ou , Yifan Zhu , Shuai Lyu , Hanyu Zhang , Jinghua Xiao , Meina Song
{"title":"Multi-SEA: Multi-stage Semantic Enhancement and Aggregation for image–text retrieval","authors":"Zijing Tian ,&nbsp;Zhonghong Ou ,&nbsp;Yifan Zhu ,&nbsp;Shuai Lyu ,&nbsp;Hanyu Zhang ,&nbsp;Jinghua Xiao ,&nbsp;Meina Song","doi":"10.1016/j.ipm.2025.104165","DOIUrl":null,"url":null,"abstract":"<div><div>Image–text retrieval aims to find a general embedding space to semantically align cross-modal tokens. Existing studies struggle to adequately integrate information cross different modality encoders and usually neglect implicit semantic information mining, making it difficult to accurately understand and represent cross-modal information. To resolve the problems mentioned above, we propose a Multi-stage Semantic Enhancement and Aggregation framework (<strong>Multi-SEA</strong>) with novel networks and training schemes, which can more comprehensively integrate global and local information within both intra-modal and inter-modal features. Multi-SEA first designs a fusion module with agent attention and gating mechanism. It helps the model focus on crucial information. Multi-SEA then introduces a three-stage scheme to enhance uni-modal information and aggregates fine-grained cross-modal information by involving the fusion module in different stages. Eventually, Multi-SEA utilizes a negative sample queue and hierarchical scheme to facilitate robust contrastive learning and promote expressive capabilities from implicit information. Experimental results demonstrate that Multi-SEA significantly outperforms the state-of-the-art schemes, achieving notable improvements in image-to-text and text-to-image retrieval tasks on the Flickr30k, MSCOCO<!--> <!-->(1K), and MSCOCO<!--> <!-->(5K) datasets, with Recall@sum increased by 13.3, 2.8, and 4.7, respectively.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 5","pages":"Article 104165"},"PeriodicalIF":7.4000,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325001062","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Image–text retrieval aims to find a general embedding space to semantically align cross-modal tokens. Existing studies struggle to adequately integrate information cross different modality encoders and usually neglect implicit semantic information mining, making it difficult to accurately understand and represent cross-modal information. To resolve the problems mentioned above, we propose a Multi-stage Semantic Enhancement and Aggregation framework (Multi-SEA) with novel networks and training schemes, which can more comprehensively integrate global and local information within both intra-modal and inter-modal features. Multi-SEA first designs a fusion module with agent attention and gating mechanism. It helps the model focus on crucial information. Multi-SEA then introduces a three-stage scheme to enhance uni-modal information and aggregates fine-grained cross-modal information by involving the fusion module in different stages. Eventually, Multi-SEA utilizes a negative sample queue and hierarchical scheme to facilitate robust contrastive learning and promote expressive capabilities from implicit information. Experimental results demonstrate that Multi-SEA significantly outperforms the state-of-the-art schemes, achieving notable improvements in image-to-text and text-to-image retrieval tasks on the Flickr30k, MSCOCO (1K), and MSCOCO (5K) datasets, with Recall@sum increased by 13.3, 2.8, and 4.7, respectively.
Multi-SEA:用于图像文本检索的多阶段语义增强和聚合技术
图像-文本检索的目的是寻找一个通用的嵌入空间来对跨模态标记进行语义对齐。现有研究难以充分整合不同模态编码器之间的信息,往往忽略了隐式语义信息挖掘,难以准确理解和表示跨模态信息。为了解决上述问题,我们提出了一种多阶段语义增强和聚合框架(Multi-SEA),该框架具有新颖的网络和训练方案,可以更全面地整合模式内和模式间特征中的全局和局部信息。Multi-SEA首先设计了一个具有代理注意和门控机制的融合模块。它帮助模型专注于关键信息。然后,Multi-SEA引入了一个三阶段方案来增强单模态信息,并通过不同阶段的融合模块来聚合细粒度的跨模态信息。最后,Multi-SEA利用负样本队列和分层方案来促进鲁棒性对比学习,并提高隐性信息的表达能力。实验结果表明,Multi-SEA在Flickr30k、MSCOCO (1K)和MSCOCO (5K)数据集上的图像到文本和文本到图像检索任务上取得了显著的改进,Recall@sum分别提高了13.3、2.8和4.7。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information Processing & Management
Information Processing & Management 工程技术-计算机:信息系统
CiteScore
17.00
自引率
11.60%
发文量
276
审稿时长
39 days
期刊介绍: Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信