{"title":"用Iconclass视觉概念衡量自然语言监督的文本-图像度量学习的局限性","authors":"Kai Labusch, Clemens Neudecker","doi":"10.1145/3604951.3605516","DOIUrl":null,"url":null,"abstract":"Identification of images that are close to each other in terms of their iconographical meaning requires an applicable distance measure for text-image or image-image pairs. To obtain such a measure of distance, we finetune a group of contrastive loss based text-to-image similarity models (MS-CLIP) with respect to a large number of Iconclass visual concepts by means of natural language supervised learning. We show that there are certain Iconclass concepts that actually can be learned by the models whereas other visual concepts cannot be learned. We hypothesize that the visual concepts that can be learned more easily are intrinsically different from those that are more difficult to learn and that these qualitative differences can provide a valuable orientation for future research directions in text-to-image similarity learning.","PeriodicalId":375632,"journal":{"name":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Gauging the Limitations of Natural Language Supervised Text-Image Metrics Learning by Iconclass Visual Concepts\",\"authors\":\"Kai Labusch, Clemens Neudecker\",\"doi\":\"10.1145/3604951.3605516\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Identification of images that are close to each other in terms of their iconographical meaning requires an applicable distance measure for text-image or image-image pairs. To obtain such a measure of distance, we finetune a group of contrastive loss based text-to-image similarity models (MS-CLIP) with respect to a large number of Iconclass visual concepts by means of natural language supervised learning. We show that there are certain Iconclass concepts that actually can be learned by the models whereas other visual concepts cannot be learned. We hypothesize that the visual concepts that can be learned more easily are intrinsically different from those that are more difficult to learn and that these qualitative differences can provide a valuable orientation for future research directions in text-to-image similarity learning.\",\"PeriodicalId\":375632,\"journal\":{\"name\":\"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3604951.3605516\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th International Workshop on Historical Document Imaging and Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3604951.3605516","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Gauging the Limitations of Natural Language Supervised Text-Image Metrics Learning by Iconclass Visual Concepts
Identification of images that are close to each other in terms of their iconographical meaning requires an applicable distance measure for text-image or image-image pairs. To obtain such a measure of distance, we finetune a group of contrastive loss based text-to-image similarity models (MS-CLIP) with respect to a large number of Iconclass visual concepts by means of natural language supervised learning. We show that there are certain Iconclass concepts that actually can be learned by the models whereas other visual concepts cannot be learned. We hypothesize that the visual concepts that can be learned more easily are intrinsically different from those that are more difficult to learn and that these qualitative differences can provide a valuable orientation for future research directions in text-to-image similarity learning.