DRSGN:用于图像-文本匹配的双重修正语义图结构网络

2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS) Pub Date : 2021-11-07 DOI:10.1109/CCIS53392.2021.9754625

Xiao Yang, Xiaojun Wu, Tianyang Xu

{"title":"DRSGN:用于图像-文本匹配的双重修正语义图结构网络","authors":"Xiao Yang, Xiaojun Wu, Tianyang Xu","doi":"10.1109/CCIS53392.2021.9754625","DOIUrl":null,"url":null,"abstract":"Image-Text matching has made significant contributions to bridge the gap in multi-modal retrieval for advanced pattern recognition systems. The key point of this task is mapping images and text semantics into a common space, where the correlations of similar pairs and dissimilar ones are distinguishable. To concentrate on obtaining semantics, most of existing methods utilize an external object detector to extract compatible region-level visual representations to match the word-level textual clues in captions. In addition to perform the local region matching, several recent studies design a classification task using global information in the final matching layer, but such global supervision signal is difficult to transmitted to the shallow feature descriptor learning layer. To address this issue, in this paper, we propose a novel Dual Revised Semantic Graph Structured Network (DRSGN) to supplement global supervision to the regional semantics adaptively in the shallow layers. In principle, DRSGN integrates regional and global descriptors to formulate an attention supervising mechanism, aiming to simultaneously highlight the regional instances and global scene to obtain complementary visual clues. A dual semantic supervising module is then established to interact between two modalities to extract the genuine matching pairs. Finally, a semantic graph consisting of the obtained multi-modal clues is designed to perform similarity reasoning between the positional relation embedded textual nodes and semantic related visual nodes. The dedicated global signals provide complementary supervision against local regions to support improved matching capacity. The experimental results on Flikr30K and MSCOCO demonstrate the effectiveness of the proposed DRSGN, with improved matching performance against the local region-based approaches","PeriodicalId":191226,"journal":{"name":"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DRSGN: Dual Revised Semantic Graph Structured Network for Image-Text Matching\",\"authors\":\"Xiao Yang, Xiaojun Wu, Tianyang Xu\",\"doi\":\"10.1109/CCIS53392.2021.9754625\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image-Text matching has made significant contributions to bridge the gap in multi-modal retrieval for advanced pattern recognition systems. The key point of this task is mapping images and text semantics into a common space, where the correlations of similar pairs and dissimilar ones are distinguishable. To concentrate on obtaining semantics, most of existing methods utilize an external object detector to extract compatible region-level visual representations to match the word-level textual clues in captions. In addition to perform the local region matching, several recent studies design a classification task using global information in the final matching layer, but such global supervision signal is difficult to transmitted to the shallow feature descriptor learning layer. To address this issue, in this paper, we propose a novel Dual Revised Semantic Graph Structured Network (DRSGN) to supplement global supervision to the regional semantics adaptively in the shallow layers. In principle, DRSGN integrates regional and global descriptors to formulate an attention supervising mechanism, aiming to simultaneously highlight the regional instances and global scene to obtain complementary visual clues. A dual semantic supervising module is then established to interact between two modalities to extract the genuine matching pairs. Finally, a semantic graph consisting of the obtained multi-modal clues is designed to perform similarity reasoning between the positional relation embedded textual nodes and semantic related visual nodes. The dedicated global signals provide complementary supervision against local regions to support improved matching capacity. The experimental results on Flikr30K and MSCOCO demonstrate the effectiveness of the proposed DRSGN, with improved matching performance against the local region-based approaches\",\"PeriodicalId\":191226,\"journal\":{\"name\":\"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCIS53392.2021.9754625\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCIS53392.2021.9754625","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

图像-文本匹配对于弥补高级模式识别系统在多模态检索方面的不足做出了重要贡献。该任务的关键是将图像和文本语义映射到一个公共空间中，在这个公共空间中，相似对和不相似对的相关性是可区分的。为了集中精力获取语义，大多数现有方法利用外部对象检测器提取兼容的区域级视觉表示来匹配标题中的词级文本线索。除了进行局部区域匹配之外，最近的一些研究在最终匹配层设计了一个使用全局信息的分类任务，但是这种全局监督信号很难传输到浅层特征描述符学习层。为了解决这一问题，本文提出了一种新的双重修正语义图结构网络(DRSGN)，以自适应地补充浅层区域语义的全局监督。DRSGN原理上将区域描述符和全局描述符相结合，形成注意力监督机制，旨在同时突出区域实例和全局场景，获得互补的视觉线索。然后建立双语义监督模块，在两个模态之间进行交互，提取真实的匹配对。最后，设计由多模态线索组成的语义图，在嵌入的位置关系文本节点与语义相关的视觉节点之间进行相似性推理。专用的全球信号提供对局部区域的补充监督，以支持改进的匹配能力。在Flikr30K和MSCOCO上的实验结果证明了所提出的DRSGN的有效性，与基于局部区域的方法相比具有更高的匹配性能

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DRSGN: Dual Revised Semantic Graph Structured Network for Image-Text Matching

Image-Text matching has made significant contributions to bridge the gap in multi-modal retrieval for advanced pattern recognition systems. The key point of this task is mapping images and text semantics into a common space, where the correlations of similar pairs and dissimilar ones are distinguishable. To concentrate on obtaining semantics, most of existing methods utilize an external object detector to extract compatible region-level visual representations to match the word-level textual clues in captions. In addition to perform the local region matching, several recent studies design a classification task using global information in the final matching layer, but such global supervision signal is difficult to transmitted to the shallow feature descriptor learning layer. To address this issue, in this paper, we propose a novel Dual Revised Semantic Graph Structured Network (DRSGN) to supplement global supervision to the regional semantics adaptively in the shallow layers. In principle, DRSGN integrates regional and global descriptors to formulate an attention supervising mechanism, aiming to simultaneously highlight the regional instances and global scene to obtain complementary visual clues. A dual semantic supervising module is then established to interact between two modalities to extract the genuine matching pairs. Finally, a semantic graph consisting of the obtained multi-modal clues is designed to perform similarity reasoning between the positional relation embedded textual nodes and semantic related visual nodes. The dedicated global signals provide complementary supervision against local regions to support improved matching capacity. The experimental results on Flikr30K and MSCOCO demonstrate the effectiveness of the proposed DRSGN, with improved matching performance against the local region-based approaches

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)

自引率

0.00%

发文量