{"title":"Learning Adequate Alignment and Interaction for Cross-Modal Retrieval","authors":"MingKang Wang , Min Meng , Jigang Liu , Jigang Wu","doi":"10.1016/j.vrih.2023.06.003","DOIUrl":null,"url":null,"abstract":"<div><p>Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications, especially image-text retrieval in the fields of computer vision and natural language processing. Recently, visual and semantic embedding (VSE) learning has shown promising improvements on image-text retrieval tasks. Most existing VSE models employ two unrelated encoders to extract features, then use complex methods to contextualize and aggregate those features into holistic embeddings. Despite recent advances, existing approaches still suffer from two limitations: 1) without considering intermediate interaction and adequate alignment between different modalities, these models cannot guarantee the discriminative ability of representations; 2) existing feature aggregators are susceptible to certain noisy regions, which may lead to unreasonable pooling coefficients and affect the quality of the final aggregated features. To address these challenges, we propose a novel cross-modal retrieval model containing a well-designed alignment module and a novel multimodal fusion encoder, which aims to learn adequate alignment and interaction on aggregated features for effectively bridging the modality gap. Experiments on Microsoft COCO and Flickr30k datasets demonstrates the superiority of our model over the state-of-the-art methods.</p></div>","PeriodicalId":33538,"journal":{"name":"Virtual Reality Intelligent Hardware","volume":"5 6","pages":"Pages 509-522"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S209657962300027X/pdf?md5=12c947f69173683c04a27c84c4b305fc&pid=1-s2.0-S209657962300027X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Virtual Reality Intelligent Hardware","FirstCategoryId":"1093","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S209657962300027X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0
Abstract
Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications, especially image-text retrieval in the fields of computer vision and natural language processing. Recently, visual and semantic embedding (VSE) learning has shown promising improvements on image-text retrieval tasks. Most existing VSE models employ two unrelated encoders to extract features, then use complex methods to contextualize and aggregate those features into holistic embeddings. Despite recent advances, existing approaches still suffer from two limitations: 1) without considering intermediate interaction and adequate alignment between different modalities, these models cannot guarantee the discriminative ability of representations; 2) existing feature aggregators are susceptible to certain noisy regions, which may lead to unreasonable pooling coefficients and affect the quality of the final aggregated features. To address these challenges, we propose a novel cross-modal retrieval model containing a well-designed alignment module and a novel multimodal fusion encoder, which aims to learn adequate alignment and interaction on aggregated features for effectively bridging the modality gap. Experiments on Microsoft COCO and Flickr30k datasets demonstrates the superiority of our model over the state-of-the-art methods.