视觉艺术的迁移学习:图像类代码的多模态检索

IF 2.1 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

ACM Journal on Computing and Cultural Heritage Pub Date : 2023-06-24 DOI:https://dl.acm.org/doi/10.1145/3575865

Nikolay Banar, Walter Daelemans, Mike Kestemont

{"title":"视觉艺术的迁移学习:图像类代码的多模态检索","authors":"Nikolay Banar, Walter Daelemans, Mike Kestemont","doi":"https://dl.acm.org/doi/10.1145/3575865","DOIUrl":null,"url":null,"abstract":"Iconclass is an iconographic thesaurus, which is widely used in the digital heritage domain to describe subjects depicted in artworks. Each subject is assigned a unique descriptive code, which has a corresponding textual definition. The assignment of Iconclass codes is a challenging task for computational systems, due to the large number of available labels in comparison to the limited amount of training data available. Transfer learning has become a common strategy to overcome such a data shortage. In deep learning, transfer learning consists in fine-tuning the weights of a deep neural network for a downstream task. In this work, we present a deep retrieval framework, which can be fully fine-tuned for the task under consideration. Our work is based on a recent approach to this task, which already yielded state-of-the-art performance, although it could not be fully fine-tuned yet. This approach exploits the multi-linguality and multi-modality that is inherent to digital heritage data. Our framework jointly processes multiple input modalities, namely, textual and visual features. We extract the textual features from the artwork titles in multiple languages, whereas the visual features are derived from photographic reproductions of the artworks. The definitions of the Iconclass codes, containing useful textual information, are used as target labels instead of the codes themselves. As our main contribution, we demonstrate that our approach outperforms the state-of-the-art by a large margin. In addition, our approach is superior to the M3P feature extractor and outperforms the multi-lingual CLIP in most experiments due to the better quality of the visual features. Our out-of-domain and zero-shot experiments show poor results and demonstrate that the Iconclass retrieval remains a challenging task. We make our source code and models publicly available to support heritage institutions in the further enrichment of their digital collections.","PeriodicalId":54310,"journal":{"name":"ACM Journal on Computing and Cultural Heritage","volume":"23 1","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2023-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transfer Learning for the Visual Arts: The Multi-modal Retrieval of Iconclass Codes\",\"authors\":\"Nikolay Banar, Walter Daelemans, Mike Kestemont\",\"doi\":\"https://dl.acm.org/doi/10.1145/3575865\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Iconclass is an iconographic thesaurus, which is widely used in the digital heritage domain to describe subjects depicted in artworks. Each subject is assigned a unique descriptive code, which has a corresponding textual definition. The assignment of Iconclass codes is a challenging task for computational systems, due to the large number of available labels in comparison to the limited amount of training data available. Transfer learning has become a common strategy to overcome such a data shortage. In deep learning, transfer learning consists in fine-tuning the weights of a deep neural network for a downstream task. In this work, we present a deep retrieval framework, which can be fully fine-tuned for the task under consideration. Our work is based on a recent approach to this task, which already yielded state-of-the-art performance, although it could not be fully fine-tuned yet. This approach exploits the multi-linguality and multi-modality that is inherent to digital heritage data. Our framework jointly processes multiple input modalities, namely, textual and visual features. We extract the textual features from the artwork titles in multiple languages, whereas the visual features are derived from photographic reproductions of the artworks. The definitions of the Iconclass codes, containing useful textual information, are used as target labels instead of the codes themselves. As our main contribution, we demonstrate that our approach outperforms the state-of-the-art by a large margin. In addition, our approach is superior to the M3P feature extractor and outperforms the multi-lingual CLIP in most experiments due to the better quality of the visual features. Our out-of-domain and zero-shot experiments show poor results and demonstrate that the Iconclass retrieval remains a challenging task. We make our source code and models publicly available to support heritage institutions in the further enrichment of their digital collections.\",\"PeriodicalId\":54310,\"journal\":{\"name\":\"ACM Journal on Computing and Cultural Heritage\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2023-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Journal on Computing and Cultural Heritage\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/https://dl.acm.org/doi/10.1145/3575865\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal on Computing and Cultural Heritage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3575865","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

Iconclass是一个图像同义词典，它被广泛用于数字遗产领域，以描述艺术作品中描绘的主题。每个主题都分配了一个唯一的描述性代码，该代码具有相应的文本定义。对于计算系统来说，Iconclass代码的分配是一项具有挑战性的任务，因为可用的标签数量很多，而可用的训练数据数量有限。迁移学习已经成为克服这种数据短缺的常用策略。在深度学习中，迁移学习包括为下游任务微调深度神经网络的权重。在这项工作中，我们提出了一个深度检索框架，它可以对所考虑的任务进行充分的微调。我们的工作是基于这项任务的最新方法，该方法已经产生了最先进的性能，尽管它还不能完全微调。这种方法利用了数字遗产数据固有的多语言和多模态。我们的框架共同处理多种输入方式，即文本和视觉特征。我们从多种语言的艺术作品标题中提取文字特征，而视觉特征则来源于艺术作品的摄影复制品。包含有用文本信息的Iconclass代码的定义被用作目标标签，而不是代码本身。作为我们的主要贡献，我们证明了我们的方法在很大程度上优于最先进的方法。此外，由于视觉特征的质量更好，我们的方法优于M3P特征提取器，并且在大多数实验中优于多语言CLIP。我们的域外和零射击实验显示了较差的结果，并证明了Iconclass检索仍然是一个具有挑战性的任务。我们公开源代码和模型，以支持遗产机构进一步丰富其数字馆藏。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Transfer Learning for the Visual Arts: The Multi-modal Retrieval of Iconclass Codes

Iconclass is an iconographic thesaurus, which is widely used in the digital heritage domain to describe subjects depicted in artworks. Each subject is assigned a unique descriptive code, which has a corresponding textual definition. The assignment of Iconclass codes is a challenging task for computational systems, due to the large number of available labels in comparison to the limited amount of training data available. Transfer learning has become a common strategy to overcome such a data shortage. In deep learning, transfer learning consists in fine-tuning the weights of a deep neural network for a downstream task. In this work, we present a deep retrieval framework, which can be fully fine-tuned for the task under consideration. Our work is based on a recent approach to this task, which already yielded state-of-the-art performance, although it could not be fully fine-tuned yet. This approach exploits the multi-linguality and multi-modality that is inherent to digital heritage data. Our framework jointly processes multiple input modalities, namely, textual and visual features. We extract the textual features from the artwork titles in multiple languages, whereas the visual features are derived from photographic reproductions of the artworks. The definitions of the Iconclass codes, containing useful textual information, are used as target labels instead of the codes themselves. As our main contribution, we demonstrate that our approach outperforms the state-of-the-art by a large margin. In addition, our approach is superior to the M³P feature extractor and outperforms the multi-lingual CLIP in most experiments due to the better quality of the visual features. Our out-of-domain and zero-shot experiments show poor results and demonstrate that the Iconclass retrieval remains a challenging task. We make our source code and models publicly available to support heritage institutions in the further enrichment of their digital collections.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Journal on Computing and Cultural Heritage Arts and Humanities-Conservation

CiteScore

4.60

自引率

8.30%

发文量

期刊介绍： ACM Journal on Computing and Cultural Heritage (JOCCH) publishes papers of significant and lasting value in all areas relating to the use of information and communication technologies (ICT) in support of Cultural Heritage. The journal encourages the submission of manuscripts that demonstrate innovative use of technology for the discovery, analysis, interpretation and presentation of cultural material, as well as manuscripts that illustrate applications in the Cultural Heritage sector that challenge the computational technologies and suggest new research opportunities in computer science.