{"title":"DRSGN:用于图像-文本匹配的双重修正语义图结构网络","authors":"Xiao Yang, Xiaojun Wu, Tianyang Xu","doi":"10.1109/CCIS53392.2021.9754625","DOIUrl":null,"url":null,"abstract":"Image-Text matching has made significant contributions to bridge the gap in multi-modal retrieval for advanced pattern recognition systems. The key point of this task is mapping images and text semantics into a common space, where the correlations of similar pairs and dissimilar ones are distinguishable. To concentrate on obtaining semantics, most of existing methods utilize an external object detector to extract compatible region-level visual representations to match the word-level textual clues in captions. In addition to perform the local region matching, several recent studies design a classification task using global information in the final matching layer, but such global supervision signal is difficult to transmitted to the shallow feature descriptor learning layer. To address this issue, in this paper, we propose a novel Dual Revised Semantic Graph Structured Network (DRSGN) to supplement global supervision to the regional semantics adaptively in the shallow layers. In principle, DRSGN integrates regional and global descriptors to formulate an attention supervising mechanism, aiming to simultaneously highlight the regional instances and global scene to obtain complementary visual clues. A dual semantic supervising module is then established to interact between two modalities to extract the genuine matching pairs. Finally, a semantic graph consisting of the obtained multi-modal clues is designed to perform similarity reasoning between the positional relation embedded textual nodes and semantic related visual nodes. The dedicated global signals provide complementary supervision against local regions to support improved matching capacity. The experimental results on Flikr30K and MSCOCO demonstrate the effectiveness of the proposed DRSGN, with improved matching performance against the local region-based approaches","PeriodicalId":191226,"journal":{"name":"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DRSGN: Dual Revised Semantic Graph Structured Network for Image-Text Matching\",\"authors\":\"Xiao Yang, Xiaojun Wu, Tianyang Xu\",\"doi\":\"10.1109/CCIS53392.2021.9754625\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image-Text matching has made significant contributions to bridge the gap in multi-modal retrieval for advanced pattern recognition systems. The key point of this task is mapping images and text semantics into a common space, where the correlations of similar pairs and dissimilar ones are distinguishable. To concentrate on obtaining semantics, most of existing methods utilize an external object detector to extract compatible region-level visual representations to match the word-level textual clues in captions. In addition to perform the local region matching, several recent studies design a classification task using global information in the final matching layer, but such global supervision signal is difficult to transmitted to the shallow feature descriptor learning layer. To address this issue, in this paper, we propose a novel Dual Revised Semantic Graph Structured Network (DRSGN) to supplement global supervision to the regional semantics adaptively in the shallow layers. In principle, DRSGN integrates regional and global descriptors to formulate an attention supervising mechanism, aiming to simultaneously highlight the regional instances and global scene to obtain complementary visual clues. A dual semantic supervising module is then established to interact between two modalities to extract the genuine matching pairs. Finally, a semantic graph consisting of the obtained multi-modal clues is designed to perform similarity reasoning between the positional relation embedded textual nodes and semantic related visual nodes. The dedicated global signals provide complementary supervision against local regions to support improved matching capacity. The experimental results on Flikr30K and MSCOCO demonstrate the effectiveness of the proposed DRSGN, with improved matching performance against the local region-based approaches\",\"PeriodicalId\":191226,\"journal\":{\"name\":\"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCIS53392.2021.9754625\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCIS53392.2021.9754625","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
DRSGN: Dual Revised Semantic Graph Structured Network for Image-Text Matching
Image-Text matching has made significant contributions to bridge the gap in multi-modal retrieval for advanced pattern recognition systems. The key point of this task is mapping images and text semantics into a common space, where the correlations of similar pairs and dissimilar ones are distinguishable. To concentrate on obtaining semantics, most of existing methods utilize an external object detector to extract compatible region-level visual representations to match the word-level textual clues in captions. In addition to perform the local region matching, several recent studies design a classification task using global information in the final matching layer, but such global supervision signal is difficult to transmitted to the shallow feature descriptor learning layer. To address this issue, in this paper, we propose a novel Dual Revised Semantic Graph Structured Network (DRSGN) to supplement global supervision to the regional semantics adaptively in the shallow layers. In principle, DRSGN integrates regional and global descriptors to formulate an attention supervising mechanism, aiming to simultaneously highlight the regional instances and global scene to obtain complementary visual clues. A dual semantic supervising module is then established to interact between two modalities to extract the genuine matching pairs. Finally, a semantic graph consisting of the obtained multi-modal clues is designed to perform similarity reasoning between the positional relation embedded textual nodes and semantic related visual nodes. The dedicated global signals provide complementary supervision against local regions to support improved matching capacity. The experimental results on Flikr30K and MSCOCO demonstrate the effectiveness of the proposed DRSGN, with improved matching performance against the local region-based approaches