{"title":"多模态注意机制的属性-图像相似性度量","authors":"Ali Salehi Najafabadi, A. Ghomsheh","doi":"10.1109/CSICC52343.2021.9420626","DOIUrl":null,"url":null,"abstract":"Multimodal attention mechanisms in computer vision applications enable rich feature extraction by attending to specific image regions, highlighted through a second mode of data regarded as auxiliary information. The correspondence between image regions and auxiliary data can be defined as the similarity between parts of the two modes. In this paper, we propose a similarity measure that maximizes the posterior for matching high-level object attributes with image regions. In contrast to previous methods, we rely on attribute space rather than textual descriptions. We evaluate our results on the CUB dataset. The results show that the proposed method better minimizes the similarity loss function compared to the text-image similarity measurement.","PeriodicalId":374593,"journal":{"name":"2021 26th International Computer Conference, Computer Society of Iran (CSICC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Attribute-Image Similarity Measure for Multimodal Attention Mechanism\",\"authors\":\"Ali Salehi Najafabadi, A. Ghomsheh\",\"doi\":\"10.1109/CSICC52343.2021.9420626\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal attention mechanisms in computer vision applications enable rich feature extraction by attending to specific image regions, highlighted through a second mode of data regarded as auxiliary information. The correspondence between image regions and auxiliary data can be defined as the similarity between parts of the two modes. In this paper, we propose a similarity measure that maximizes the posterior for matching high-level object attributes with image regions. In contrast to previous methods, we rely on attribute space rather than textual descriptions. We evaluate our results on the CUB dataset. The results show that the proposed method better minimizes the similarity loss function compared to the text-image similarity measurement.\",\"PeriodicalId\":374593,\"journal\":{\"name\":\"2021 26th International Computer Conference, Computer Society of Iran (CSICC)\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-03-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 26th International Computer Conference, Computer Society of Iran (CSICC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CSICC52343.2021.9420626\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 26th International Computer Conference, Computer Society of Iran (CSICC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSICC52343.2021.9420626","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Attribute-Image Similarity Measure for Multimodal Attention Mechanism
Multimodal attention mechanisms in computer vision applications enable rich feature extraction by attending to specific image regions, highlighted through a second mode of data regarded as auxiliary information. The correspondence between image regions and auxiliary data can be defined as the similarity between parts of the two modes. In this paper, we propose a similarity measure that maximizes the posterior for matching high-level object attributes with image regions. In contrast to previous methods, we rely on attribute space rather than textual descriptions. We evaluate our results on the CUB dataset. The results show that the proposed method better minimizes the similarity loss function compared to the text-image similarity measurement.