在双曲空间中构建知识图谱，用于自动图像标注

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2024-09-30 DOI:10.1016/j.imavis.2024.105293

Fariba Lotfi , Mansour Jamzad , Hamid Beigy , Helia Farhood , Quan Z. Sheng , Amin Beheshti

{"title":"在双曲空间中构建知识图谱，用于自动图像标注","authors":"Fariba Lotfi , Mansour Jamzad , Hamid Beigy , Helia Farhood , Quan Z. Sheng , Amin Beheshti","doi":"10.1016/j.imavis.2024.105293","DOIUrl":null,"url":null,"abstract":"<div><div>Automatic image annotation (AIA) is a fundamental and challenging task in computer vision. Considering the correlations between tags can lead to more accurate image understanding, benefiting various applications, including image retrieval and visual search. While many attempts have been made to incorporate tag correlations in annotation models, the method of constructing a knowledge graph based on external knowledge sources and hyperbolic space has not been explored. In this paper, we create an attributed knowledge graph based on vocabulary, integrate external knowledge sources such as WordNet, and utilize hyperbolic word embeddings for the tag representations. These embeddings provide a sophisticated tag representation that captures hierarchical and complex correlations more effectively, enhancing the image annotation results. In addition, leveraging external knowledge sources enhances contextuality and significantly enriches existing AIA datasets. We exploit two deep learning-based models, the Relational Graph Convolutional Network (R-GCN) and the Vision Transformer (ViT), to extract the input features. We apply two R-GCN operations to obtain word descriptors and fuse them with the extracted visual features. We evaluate the proposed approach using three public benchmark datasets. Our experimental results demonstrate that the proposed architecture achieves state-of-the-art performance across most metrics on Corel5k, ESP Game, and IAPRTC-12.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105293"},"PeriodicalIF":4.2000,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Knowledge graph construction in hyperbolic space for automatic image annotation\",\"authors\":\"Fariba Lotfi , Mansour Jamzad , Hamid Beigy , Helia Farhood , Quan Z. Sheng , Amin Beheshti\",\"doi\":\"10.1016/j.imavis.2024.105293\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Automatic image annotation (AIA) is a fundamental and challenging task in computer vision. Considering the correlations between tags can lead to more accurate image understanding, benefiting various applications, including image retrieval and visual search. While many attempts have been made to incorporate tag correlations in annotation models, the method of constructing a knowledge graph based on external knowledge sources and hyperbolic space has not been explored. In this paper, we create an attributed knowledge graph based on vocabulary, integrate external knowledge sources such as WordNet, and utilize hyperbolic word embeddings for the tag representations. These embeddings provide a sophisticated tag representation that captures hierarchical and complex correlations more effectively, enhancing the image annotation results. In addition, leveraging external knowledge sources enhances contextuality and significantly enriches existing AIA datasets. We exploit two deep learning-based models, the Relational Graph Convolutional Network (R-GCN) and the Vision Transformer (ViT), to extract the input features. We apply two R-GCN operations to obtain word descriptors and fuse them with the extracted visual features. We evaluate the proposed approach using three public benchmark datasets. Our experimental results demonstrate that the proposed architecture achieves state-of-the-art performance across most metrics on Corel5k, ESP Game, and IAPRTC-12.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"151 \",\"pages\":\"Article 105293\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624003986\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624003986","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

自动图像标注（AIA）是计算机视觉领域一项基本而又具有挑战性的任务。考虑标签之间的相关性可以更准确地理解图像，有利于图像检索和视觉搜索等各种应用。虽然已有许多尝试将标签相关性纳入注释模型，但基于外部知识源和双曲空间构建知识图谱的方法尚未得到探索。在本文中，我们创建了基于词汇的归属知识图谱，整合了 WordNet 等外部知识源，并利用双曲词嵌入进行标签表示。这些嵌入提供了一种复杂的标签表示法，能更有效地捕捉分层和复杂的相关性，从而提高图像标注结果。此外，利用外部知识源还能增强语境性，并极大地丰富现有的 AIA 数据集。我们利用关系图卷积网络（R-GCN）和视觉转换器（ViT）这两个基于深度学习的模型来提取输入特征。我们应用两种 R-GCN 操作来获取单词描述符，并将它们与提取的视觉特征进行融合。我们使用三个公共基准数据集对所提出的方法进行了评估。我们的实验结果表明，在 Corel5k、ESP Game 和 IAPRTC-12 上，所提出的架构在大多数指标上都达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Knowledge graph construction in hyperbolic space for automatic image annotation

Automatic image annotation (AIA) is a fundamental and challenging task in computer vision. Considering the correlations between tags can lead to more accurate image understanding, benefiting various applications, including image retrieval and visual search. While many attempts have been made to incorporate tag correlations in annotation models, the method of constructing a knowledge graph based on external knowledge sources and hyperbolic space has not been explored. In this paper, we create an attributed knowledge graph based on vocabulary, integrate external knowledge sources such as WordNet, and utilize hyperbolic word embeddings for the tag representations. These embeddings provide a sophisticated tag representation that captures hierarchical and complex correlations more effectively, enhancing the image annotation results. In addition, leveraging external knowledge sources enhances contextuality and significantly enriches existing AIA datasets. We exploit two deep learning-based models, the Relational Graph Convolutional Network (R-GCN) and the Vision Transformer (ViT), to extract the input features. We apply two R-GCN operations to obtain word descriptors and fuse them with the extracted visual features. We evaluate the proposed approach using three public benchmark datasets. Our experimental results demonstrate that the proposed architecture achieves state-of-the-art performance across most metrics on Corel5k, ESP Game, and IAPRTC-12.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.