Zijing Tian , Zhonghong Ou , Yifan Zhu , Shuai Lyu , Hanyu Zhang , Jinghua Xiao , Meina Song
{"title":"Multi-SEA: Multi-stage Semantic Enhancement and Aggregation for image–text retrieval","authors":"Zijing Tian , Zhonghong Ou , Yifan Zhu , Shuai Lyu , Hanyu Zhang , Jinghua Xiao , Meina Song","doi":"10.1016/j.ipm.2025.104165","DOIUrl":null,"url":null,"abstract":"<div><div>Image–text retrieval aims to find a general embedding space to semantically align cross-modal tokens. Existing studies struggle to adequately integrate information cross different modality encoders and usually neglect implicit semantic information mining, making it difficult to accurately understand and represent cross-modal information. To resolve the problems mentioned above, we propose a Multi-stage Semantic Enhancement and Aggregation framework (<strong>Multi-SEA</strong>) with novel networks and training schemes, which can more comprehensively integrate global and local information within both intra-modal and inter-modal features. Multi-SEA first designs a fusion module with agent attention and gating mechanism. It helps the model focus on crucial information. Multi-SEA then introduces a three-stage scheme to enhance uni-modal information and aggregates fine-grained cross-modal information by involving the fusion module in different stages. Eventually, Multi-SEA utilizes a negative sample queue and hierarchical scheme to facilitate robust contrastive learning and promote expressive capabilities from implicit information. Experimental results demonstrate that Multi-SEA significantly outperforms the state-of-the-art schemes, achieving notable improvements in image-to-text and text-to-image retrieval tasks on the Flickr30k, MSCOCO<!--> <!-->(1K), and MSCOCO<!--> <!-->(5K) datasets, with Recall@sum increased by 13.3, 2.8, and 4.7, respectively.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 5","pages":"Article 104165"},"PeriodicalIF":7.4000,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325001062","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Image–text retrieval aims to find a general embedding space to semantically align cross-modal tokens. Existing studies struggle to adequately integrate information cross different modality encoders and usually neglect implicit semantic information mining, making it difficult to accurately understand and represent cross-modal information. To resolve the problems mentioned above, we propose a Multi-stage Semantic Enhancement and Aggregation framework (Multi-SEA) with novel networks and training schemes, which can more comprehensively integrate global and local information within both intra-modal and inter-modal features. Multi-SEA first designs a fusion module with agent attention and gating mechanism. It helps the model focus on crucial information. Multi-SEA then introduces a three-stage scheme to enhance uni-modal information and aggregates fine-grained cross-modal information by involving the fusion module in different stages. Eventually, Multi-SEA utilizes a negative sample queue and hierarchical scheme to facilitate robust contrastive learning and promote expressive capabilities from implicit information. Experimental results demonstrate that Multi-SEA significantly outperforms the state-of-the-art schemes, achieving notable improvements in image-to-text and text-to-image retrieval tasks on the Flickr30k, MSCOCO (1K), and MSCOCO (5K) datasets, with Recall@sum increased by 13.3, 2.8, and 4.7, respectively.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.