Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph

IF 2.6 3区工程技术 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Electronics Pub Date : 2023-03-14 DOI:10.3390/electronics12061390

Lei Jiang, Zuqiang Meng

{"title":"Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph","authors":"Lei Jiang, Zuqiang Meng","doi":"10.3390/electronics12061390","DOIUrl":null,"url":null,"abstract":"The field of visual question answering (VQA) has seen a growing trend of integrating external knowledge sources to improve performance. However, owing to the potential incompleteness of external knowledge sources and the inherent mismatch between different forms of data, current knowledge-based visual question answering (KBVQA) techniques are still confronted with the challenge of effectively integrating and utilizing multiple heterogeneous data. To address this issue, a novel approach centered on a multi-modal semantic graph (MSG) is proposed. The MSG serves as a mechanism for effectively unifying the representation of heterogeneous data and diverse types of knowledge. Additionally, a multi-modal semantic graph knowledge reasoning model (MSG-KRM) is introduced to perform reasoning and deep fusion of image–text information and external knowledge sources. The development of the semantic graph involves extracting keywords from the image object detection information, question text, and external knowledge texts, which are then represented as symbol nodes. Three types of semantic graphs are then constructed based on the knowledge graph, including vision, question, and the external knowledge text, with non-symbol nodes added to connect these three independent graphs and marked with respective node and edge types. During the inference stage, the multi-modal semantic graph and image–text information are embedded into the feature semantic graph through three embedding methods, and a type-aware graph attention module is employed for deep reasoning. The final answer prediction is a blend of the output from the pre-trained model, graph pooling results, and the characteristics of non-symbolic nodes. The experimental results on the OK-VQA dataset show that the MSG-KRM model is superior to existing methods in terms of overall accuracy score, achieving a score of 43.58, and with improved accuracy for most subclass questions, proving the effectiveness of the proposed method.","PeriodicalId":11646,"journal":{"name":"Electronics","volume":"53 1","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.3390/electronics12061390","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The field of visual question answering (VQA) has seen a growing trend of integrating external knowledge sources to improve performance. However, owing to the potential incompleteness of external knowledge sources and the inherent mismatch between different forms of data, current knowledge-based visual question answering (KBVQA) techniques are still confronted with the challenge of effectively integrating and utilizing multiple heterogeneous data. To address this issue, a novel approach centered on a multi-modal semantic graph (MSG) is proposed. The MSG serves as a mechanism for effectively unifying the representation of heterogeneous data and diverse types of knowledge. Additionally, a multi-modal semantic graph knowledge reasoning model (MSG-KRM) is introduced to perform reasoning and deep fusion of image–text information and external knowledge sources. The development of the semantic graph involves extracting keywords from the image object detection information, question text, and external knowledge texts, which are then represented as symbol nodes. Three types of semantic graphs are then constructed based on the knowledge graph, including vision, question, and the external knowledge text, with non-symbol nodes added to connect these three independent graphs and marked with respective node and edge types. During the inference stage, the multi-modal semantic graph and image–text information are embedded into the feature semantic graph through three embedding methods, and a type-aware graph attention module is employed for deep reasoning. The final answer prediction is a blend of the output from the pre-trained model, graph pooling results, and the characteristics of non-symbolic nodes. The experimental results on the OK-VQA dataset show that the MSG-KRM model is superior to existing methods in terms of overall accuracy score, achieving a score of 43.58, and with improved accuracy for most subclass questions, proving the effectiveness of the proposed method.

查看原文本刊更多论文

基于知识的多模态语义图可视化问答

可视化问答(VQA)领域整合外部知识来源以提高性能的趋势日益明显。然而，由于外部知识来源的潜在不完全性和不同形式数据之间的内在不匹配，当前基于知识的视觉问答(KBVQA)技术仍然面临着如何有效集成和利用多个异构数据的挑战。为了解决这个问题，提出了一种以多模态语义图(MSG)为中心的新方法。MSG是一种有效统一异构数据和不同类型知识表示的机制。此外，引入了多模态语义图知识推理模型(MSG-KRM)，对图像-文本信息和外部知识源进行推理和深度融合。语义图的开发涉及从图像对象检测信息、问题文本和外部知识文本中提取关键字，然后将其表示为符号节点。然后在知识图的基础上构建三种类型的语义图，包括视觉、问题和外部知识文本，并添加非符号节点连接这三种独立的图，并标记各自的节点和边缘类型。在推理阶段，通过三种嵌入方法将多模态语义图和图像文本信息嵌入到特征语义图中，并采用类型感知图注意模块进行深度推理。最终的答案预测混合了预训练模型的输出、图池结果和非符号节点的特征。在OK-VQA数据集上的实验结果表明，MSG-KRM模型在总体准确率得分上优于现有方法，达到43.58分，并且对大多数子类问题的准确率有所提高，证明了所提方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Electronics Computer Science-Computer Networks and Communications

CiteScore

1.10

自引率

10.30%

发文量

3515

审稿时长

16.71 days

期刊介绍： Electronics (ISSN 2079-9292; CODEN: ELECGJ) is an international, open access journal on the science of electronics and its applications published quarterly online by MDPI.