{"title":"Text-guided Attention Mechanism Fine-grained Image Classification","authors":"Xin Yang, Heng-Xi Pan","doi":"10.1145/3546607.3546614","DOIUrl":null,"url":null,"abstract":"Scene texts with explicit semantic information in natural images can provide important clues to solve the corresponding computer vision problems. In the text, we usually focus on using multimodal content in the form of visual and text prompts to solve the task of fine-grained image classification and retrieval. In this paper, graph convolution network is used to perform multimodal reasoning, and the features of relationship enhancement are obtained by learning the common semantic space between salient objects and texts found in images. By obtaining a set of enhanced visual and textual functions, the proposed model is highly superior to the existing technologies in two different tasks (fine-grained classification and image retrieval in contextual texts).","PeriodicalId":114920,"journal":{"name":"Proceedings of the 6th International Conference on Virtual and Augmented Reality Simulations","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Virtual and Augmented Reality Simulations","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3546607.3546614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Scene texts with explicit semantic information in natural images can provide important clues to solve the corresponding computer vision problems. In the text, we usually focus on using multimodal content in the form of visual and text prompts to solve the task of fine-grained image classification and retrieval. In this paper, graph convolution network is used to perform multimodal reasoning, and the features of relationship enhancement are obtained by learning the common semantic space between salient objects and texts found in images. By obtaining a set of enhanced visual and textual functions, the proposed model is highly superior to the existing technologies in two different tasks (fine-grained classification and image retrieval in contextual texts).