Dual Stream Relation Learning Network for Image-Text Retrieval

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-12-25 DOI:10.1109/TMM.2024.3521736

Dongqing Wu;Huihui Li;Cang Gu;Lei Guo;Hang Liu

{"title":"Dual Stream Relation Learning Network for Image-Text Retrieval","authors":"Dongqing Wu;Huihui Li;Cang Gu;Lei Guo;Hang Liu","doi":"10.1109/TMM.2024.3521736","DOIUrl":null,"url":null,"abstract":"Image-text retrieval has made remarkable achievements through the development of feature extraction networks and model architectures. However, almost all region feature-based methods face two serious problems when modeling modality interactions. First, region features are prone to feature entanglement in the feature extraction stage, making it difficult to accurately reason complex intra-model relations between visual objects. Second, region features lack rich contextual information, background, and object details, making it difficult to achieve precise inter-modal alignment with textual information. In this paper, we propose a novel Dual Stream Relation Learning Network (DSRLN) to jointly solve these issues with two key components: a Geometry-sensitive Interactive Self-Attention (GISA) module and a Dual Information Fusion (DIF) module. Specifically, GISA extends the vanilla self-attention network from two aspects to better model the intrinsic relationships between different regions, thereby improving high-level visual-semantic reasoning ability. DIF uses grid features as an additional visual information source, and achieves deeper and complex fusion between the two types of features through a masked cross-attention module and an adaptive gate fusion module, which can capture comprehensive visual information to learn more precise inter-modal alignment. Besides, our method also learns a more comprehensive hierarchical correspondence between images and sentences through local and global alignment. Experimental results on two public datasets, i.e., Flickr30K and MS-COCO, fully demonstrate the superiority and effectiveness of our model.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1551-1565"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814705/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Image-text retrieval has made remarkable achievements through the development of feature extraction networks and model architectures. However, almost all region feature-based methods face two serious problems when modeling modality interactions. First, region features are prone to feature entanglement in the feature extraction stage, making it difficult to accurately reason complex intra-model relations between visual objects. Second, region features lack rich contextual information, background, and object details, making it difficult to achieve precise inter-modal alignment with textual information. In this paper, we propose a novel Dual Stream Relation Learning Network (DSRLN) to jointly solve these issues with two key components: a Geometry-sensitive Interactive Self-Attention (GISA) module and a Dual Information Fusion (DIF) module. Specifically, GISA extends the vanilla self-attention network from two aspects to better model the intrinsic relationships between different regions, thereby improving high-level visual-semantic reasoning ability. DIF uses grid features as an additional visual information source, and achieves deeper and complex fusion between the two types of features through a masked cross-attention module and an adaptive gate fusion module, which can capture comprehensive visual information to learn more precise inter-modal alignment. Besides, our method also learns a more comprehensive hierarchical correspondence between images and sentences through local and global alignment. Experimental results on two public datasets, i.e., Flickr30K and MS-COCO, fully demonstrate the superiority and effectiveness of our model.

查看原文本刊更多论文

用于图像-文本检索的双流关系学习网络

图像文本检索通过特征提取网络和模型体系结构的发展取得了显著的成就。然而，几乎所有基于区域特征的方法在建模模态交互时都面临两个严重的问题。首先，区域特征在特征提取阶段容易产生特征纠缠，难以准确推断出视觉对象之间复杂的模型内关系。其次，区域特征缺乏丰富的上下文信息、背景和对象细节，难以实现与文本信息的精确多模式对齐。在本文中，我们提出了一种新的双流关系学习网络（DSRLN），通过两个关键组件：几何敏感交互自注意（GISA）模块和双信息融合（DIF）模块来共同解决这些问题。具体来说，GISA从两个方面扩展了香草自注意网络，以更好地模拟不同区域之间的内在关系，从而提高高层次的视觉语义推理能力。DIF将网格特征作为附加的视觉信息源，通过掩蔽交叉注意模块和自适应门融合模块实现两类特征之间更深层次、更复杂的融合，能够捕获更全面的视觉信息，学习更精确的多模态匹配。此外，我们的方法还通过局部和全局对齐来学习图像和句子之间更全面的层次对应关系。在Flickr30K和MS-COCO两个公开数据集上的实验结果充分证明了我们模型的优越性和有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.