{"title":"用于图像-文本匹配的双语义图相似学习","authors":"Wenxin Tan, Hua Ji, Qian Liu, Ming Jin","doi":"10.1109/ICARCE55724.2022.10046452","DOIUrl":null,"url":null,"abstract":"Image-text matching has received increasing attention because it enables the interaction between vision and language. Existing approaches have two limitations. First, most existing methods only pay attention to learning paired samples, ignoring the similar semantic information in the same modality. Second, the current methods lack interaction between local and global features, resulting in the mismatch of certain image regions or words due to the lack of global information. To solve the above problems, we propose a new dual semantic graph similarity learning (DSGSL) network, which consists of a feature enhancement module for learning compact features and a feature alignment module that learns the relations between global and local features. In the feature enhancement module, similar samples are processed as a graph, and a graph convolutional network is used to extract similar features to reconstruct the global feature representation. In addition, we use a gated fusion network to obtain discriminative sample representations by selecting salient features from other modalities and filtering out insignificant information. In the feature alignment module, we construct a dual semantic graph for every sample to learn the association between local features and global features. Numerous experiments on MS-COCO and Flicr30K have shown that our approach reaches the most advanced performance.","PeriodicalId":416305,"journal":{"name":"2022 International Conference on Automation, Robotics and Computer Engineering (ICARCE)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dual-semantic Graph Similarity Learning for Image-text Matching\",\"authors\":\"Wenxin Tan, Hua Ji, Qian Liu, Ming Jin\",\"doi\":\"10.1109/ICARCE55724.2022.10046452\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image-text matching has received increasing attention because it enables the interaction between vision and language. Existing approaches have two limitations. First, most existing methods only pay attention to learning paired samples, ignoring the similar semantic information in the same modality. Second, the current methods lack interaction between local and global features, resulting in the mismatch of certain image regions or words due to the lack of global information. To solve the above problems, we propose a new dual semantic graph similarity learning (DSGSL) network, which consists of a feature enhancement module for learning compact features and a feature alignment module that learns the relations between global and local features. In the feature enhancement module, similar samples are processed as a graph, and a graph convolutional network is used to extract similar features to reconstruct the global feature representation. In addition, we use a gated fusion network to obtain discriminative sample representations by selecting salient features from other modalities and filtering out insignificant information. In the feature alignment module, we construct a dual semantic graph for every sample to learn the association between local features and global features. Numerous experiments on MS-COCO and Flicr30K have shown that our approach reaches the most advanced performance.\",\"PeriodicalId\":416305,\"journal\":{\"name\":\"2022 International Conference on Automation, Robotics and Computer Engineering (ICARCE)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Automation, Robotics and Computer Engineering (ICARCE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICARCE55724.2022.10046452\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Automation, Robotics and Computer Engineering (ICARCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICARCE55724.2022.10046452","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Dual-semantic Graph Similarity Learning for Image-text Matching
Image-text matching has received increasing attention because it enables the interaction between vision and language. Existing approaches have two limitations. First, most existing methods only pay attention to learning paired samples, ignoring the similar semantic information in the same modality. Second, the current methods lack interaction between local and global features, resulting in the mismatch of certain image regions or words due to the lack of global information. To solve the above problems, we propose a new dual semantic graph similarity learning (DSGSL) network, which consists of a feature enhancement module for learning compact features and a feature alignment module that learns the relations between global and local features. In the feature enhancement module, similar samples are processed as a graph, and a graph convolutional network is used to extract similar features to reconstruct the global feature representation. In addition, we use a gated fusion network to obtain discriminative sample representations by selecting salient features from other modalities and filtering out insignificant information. In the feature alignment module, we construct a dual semantic graph for every sample to learn the association between local features and global features. Numerous experiments on MS-COCO and Flicr30K have shown that our approach reaches the most advanced performance.