Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval

Dongqing Wu, Huihui Li, Cang Gu, Lei Guo, Hang Liu
{"title":"Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval","authors":"Dongqing Wu, Huihui Li, Cang Gu, Lei Guo, Hang Liu","doi":"10.1145/3503161.3548223","DOIUrl":null,"url":null,"abstract":"In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503161.3548223","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.
基于两步交互改进区域特征与网格特征融合的图像-文本检索
近年来,从目标检测网络中提取的区域特征被广泛应用于图像文本检索任务中。然而,它们缺乏丰富的背景和上下文信息,这使得难以匹配描述句子中全局概念的单词。同时,区域特征也丢失了图像中物体的细节。幸运的是,区域特征的这些缺点正是网格特征的优点。在本文中,我们提出了一种新的框架,通过两步交互策略融合区域特征和网格特征,从而提取更全面的图像表示用于图像文本检索。具体而言,第一步构建具有空间信息约束的联合图,将所有区域特征和网格特征表示为图节点;通过使用联合图对关系进行建模,可以沿边传递信息。第二步,提出交叉关注门控融合模块,进一步探索区域特征与网格特征之间复杂的相互作用,自适应融合不同类型的特征。通过这两个步骤,我们的模型可以充分实现区域特征和网格特征的互补优势。此外,我们提出了一个多注意力池模块,以更好地聚合融合的区域特征和网格特征。在两个公共数据集(包括Flickr30K和MS-COCO)上的大量实验表明,我们的模型达到了最先进的水平,并将图像文本检索的性能推向了一个新的高度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信