{"title":"文本引导注意机制细粒度图像分类","authors":"Xin Yang, Heng-Xi Pan","doi":"10.1145/3546607.3546614","DOIUrl":null,"url":null,"abstract":"Scene texts with explicit semantic information in natural images can provide important clues to solve the corresponding computer vision problems. In the text, we usually focus on using multimodal content in the form of visual and text prompts to solve the task of fine-grained image classification and retrieval. In this paper, graph convolution network is used to perform multimodal reasoning, and the features of relationship enhancement are obtained by learning the common semantic space between salient objects and texts found in images. By obtaining a set of enhanced visual and textual functions, the proposed model is highly superior to the existing technologies in two different tasks (fine-grained classification and image retrieval in contextual texts).","PeriodicalId":114920,"journal":{"name":"Proceedings of the 6th International Conference on Virtual and Augmented Reality Simulations","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Text-guided Attention Mechanism Fine-grained Image Classification\",\"authors\":\"Xin Yang, Heng-Xi Pan\",\"doi\":\"10.1145/3546607.3546614\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scene texts with explicit semantic information in natural images can provide important clues to solve the corresponding computer vision problems. In the text, we usually focus on using multimodal content in the form of visual and text prompts to solve the task of fine-grained image classification and retrieval. In this paper, graph convolution network is used to perform multimodal reasoning, and the features of relationship enhancement are obtained by learning the common semantic space between salient objects and texts found in images. By obtaining a set of enhanced visual and textual functions, the proposed model is highly superior to the existing technologies in two different tasks (fine-grained classification and image retrieval in contextual texts).\",\"PeriodicalId\":114920,\"journal\":{\"name\":\"Proceedings of the 6th International Conference on Virtual and Augmented Reality Simulations\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Conference on Virtual and Augmented Reality Simulations\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3546607.3546614\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Virtual and Augmented Reality Simulations","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3546607.3546614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Scene texts with explicit semantic information in natural images can provide important clues to solve the corresponding computer vision problems. In the text, we usually focus on using multimodal content in the form of visual and text prompts to solve the task of fine-grained image classification and retrieval. In this paper, graph convolution network is used to perform multimodal reasoning, and the features of relationship enhancement are obtained by learning the common semantic space between salient objects and texts found in images. By obtaining a set of enhanced visual and textual functions, the proposed model is highly superior to the existing technologies in two different tasks (fine-grained classification and image retrieval in contextual texts).