Caption matters: a new perspective for knowledge-based visual question answering

IF 3.1 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge and Information Systems Pub Date : 2024-07-22 DOI:10.1007/s10115-024-02166-8

Bin Feng, Shulan Ruan, Likang Wu, Huijie Liu, Kai Zhang, Kun Zhang, Qi Liu, Enhong Chen

{"title":"Caption matters: a new perspective for knowledge-based visual question answering","authors":"Bin Feng, Shulan Ruan, Likang Wu, Huijie Liu, Kai Zhang, Kun Zhang, Qi Liu, Enhong Chen","doi":"10.1007/s10115-024-02166-8","DOIUrl":null,"url":null,"abstract":"<p>Knowledge-based visual question answering (KB-VQA) requires to answer questions according to the given image with the assistance of external knowledge. Recently, researchers generally tend to design different multimodal networks to extract visual and text semantic features for KB-VQA. Despite the significant progress, ‘caption’ information, a textual form of image semantics, which can also provide visually non-obvious cues for the reasoning process, is often ignored. In this paper, we introduce a novel framework, the Knowledge Based Caption Enhanced Net (KBCEN), designed to integrate caption information into the KB-VQA process. Specifically, for better knowledge reasoning, we make utilization of caption information comprehensively from both explicit and implicit perspectives. For the former, we explicitly link caption entities to knowledge graph together with object tags and question entities. While for the latter, a pre-trained multimodal BERT with natural implicit knowledge is leveraged to co-represent caption tokens, object regions as well as question tokens. Moreover, we develop a mutual correlation module to discern intricate correlations between explicit and implicit representations, thereby facilitating knowledge integration and final prediction. We conduct extensive experiments on three publicly available datasets (i.e., OK-VQA v1.0, OK-VQA v1.1 and A-OKVQA). Both quantitative and qualitative results demonstrate the superiority and rationality of our proposed KBCEN.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":"11 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge and Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10115-024-02166-8","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Knowledge-based visual question answering (KB-VQA) requires to answer questions according to the given image with the assistance of external knowledge. Recently, researchers generally tend to design different multimodal networks to extract visual and text semantic features for KB-VQA. Despite the significant progress, ‘caption’ information, a textual form of image semantics, which can also provide visually non-obvious cues for the reasoning process, is often ignored. In this paper, we introduce a novel framework, the Knowledge Based Caption Enhanced Net (KBCEN), designed to integrate caption information into the KB-VQA process. Specifically, for better knowledge reasoning, we make utilization of caption information comprehensively from both explicit and implicit perspectives. For the former, we explicitly link caption entities to knowledge graph together with object tags and question entities. While for the latter, a pre-trained multimodal BERT with natural implicit knowledge is leveraged to co-represent caption tokens, object regions as well as question tokens. Moreover, we develop a mutual correlation module to discern intricate correlations between explicit and implicit representations, thereby facilitating knowledge integration and final prediction. We conduct extensive experiments on three publicly available datasets (i.e., OK-VQA v1.0, OK-VQA v1.1 and A-OKVQA). Both quantitative and qualitative results demonstrate the superiority and rationality of our proposed KBCEN.

Abstract Image

查看原文本刊更多论文

标题很重要：基于知识的视觉问题解答新视角

基于知识的视觉问题解答（KB-VQA）需要借助外部知识，根据给定图像回答问题。最近，研究人员普遍倾向于设计不同的多模态网络来提取视觉和文本语义特征，用于知识库-VQA。尽管取得了重大进展，但 "标题 "信息作为图像语义的一种文本形式，也能为推理过程提供视觉上不明显的提示，却往往被忽视。在本文中，我们介绍了一个新颖的框架--基于知识的标题增强网络（KBCEN），旨在将标题信息整合到 KB-VQA 流程中。具体来说，为了更好地进行知识推理，我们从显性和隐性两个角度综合利用字幕信息。对于前者，我们将标题实体与对象标签和问题实体一起显式地链接到知识图谱中。对于后者，我们利用预先训练好的具有自然隐含知识的多模态 BERT 来共同表示字幕标记、对象区域和问题标记。此外，我们还开发了一个相互关联模块，用于识别显性和隐性表征之间错综复杂的关联，从而促进知识整合和最终预测。我们在三个公开可用的数据集（即 OK-VQA v1.0、OK-VQA v1.1 和 A-OKVQA）上进行了广泛的实验。定量和定性结果都证明了我们提出的 KBCEN 的优越性和合理性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge and Information Systems 工程技术-计算机：人工智能

CiteScore

5.70

自引率

7.40%

发文量

152

审稿时长

7.2 months

期刊介绍： Knowledge and Information Systems (KAIS) provides an international forum for researchers and professionals to share their knowledge and report new advances on all topics related to knowledge systems and advanced information systems. This monthly peer-reviewed archival journal publishes state-of-the-art research reports on emerging topics in KAIS, reviews of important techniques in related areas, and application papers of interest to a general readership.