上下文感知知识预测的层次链接和基于双语知识的视觉问答的提示调整

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-04-24 DOI:10.1016/j.knosys.2025.113556

Hyeonki Jeong , Taehyeong Kim , Wooseok Shin, Sung Won Han

{"title":"上下文感知知识预测的层次链接和基于双语知识的视觉问答的提示调整","authors":"Hyeonki Jeong , Taehyeong Kim , Wooseok Shin, Sung Won Han","doi":"10.1016/j.knosys.2025.113556","DOIUrl":null,"url":null,"abstract":"<div><div>Knowledge-based visual question answering (KBVQA) is a representative visual reasoning task that leverages external knowledge for question answering in situations where predicting the correct answer using only image and query data is difficult. In addition to KBVQA, various visual reasoning tasks have been actively studied for their potential to improve visual understanding by combining text and image modalities effectively. However, these tasks have primarily focused on high-resource languages, such as English. In contrast, studies on low-resource languages remain comparatively rare. To mitigate this research gap, we propose HiLINK, which utilizes multilingual data to enhance KBVQA performance in various languages. In this study, we use the BOK-VQA dataset to design the following key methodologies: We propose an end-to-end model that eliminates the need for a knowledge graph embedding-based training network by learning relationships between triplet knowledge components within prompts directly using Link-Tuning. We propose the HK-TriNet and HK-TriNet+ methodologies to perform triplet prediction based on contextualized knowledge relationships. Finally, we apply the frozen training approach as an alternative to conventional encoder joint training to improve the efficiency and performance of bilingual learning. HiLINK exhibits outstanding performance on the BOK-VQA dataset in three language configurations: bilingual, English, and Korean, outperforming the GEL-VQA method by +19.40%, +12.01%, and +11.30%, respectively. Furthermore, the effectiveness of the proposed method is validated based on a comprehensive analysis of bilingual embedding spaces, both visually and numerically. We expect this study to inspire future research on this topic and encourage practical applications of improved vision-language models.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"319 ","pages":"Article 113556"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HiLINK: Hierarchical linking of context-aware knowledge prediction and prompt tuning for bilingual knowledge-based visual question answering\",\"authors\":\"Hyeonki Jeong , Taehyeong Kim , Wooseok Shin, Sung Won Han\",\"doi\":\"10.1016/j.knosys.2025.113556\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Knowledge-based visual question answering (KBVQA) is a representative visual reasoning task that leverages external knowledge for question answering in situations where predicting the correct answer using only image and query data is difficult. In addition to KBVQA, various visual reasoning tasks have been actively studied for their potential to improve visual understanding by combining text and image modalities effectively. However, these tasks have primarily focused on high-resource languages, such as English. In contrast, studies on low-resource languages remain comparatively rare. To mitigate this research gap, we propose HiLINK, which utilizes multilingual data to enhance KBVQA performance in various languages. In this study, we use the BOK-VQA dataset to design the following key methodologies: We propose an end-to-end model that eliminates the need for a knowledge graph embedding-based training network by learning relationships between triplet knowledge components within prompts directly using Link-Tuning. We propose the HK-TriNet and HK-TriNet+ methodologies to perform triplet prediction based on contextualized knowledge relationships. Finally, we apply the frozen training approach as an alternative to conventional encoder joint training to improve the efficiency and performance of bilingual learning. HiLINK exhibits outstanding performance on the BOK-VQA dataset in three language configurations: bilingual, English, and Korean, outperforming the GEL-VQA method by +19.40%, +12.01%, and +11.30%, respectively. Furthermore, the effectiveness of the proposed method is validated based on a comprehensive analysis of bilingual embedding spaces, both visually and numerically. We expect this study to inspire future research on this topic and encourage practical applications of improved vision-language models.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"319 \",\"pages\":\"Article 113556\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2025-04-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125006021\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125006021","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

基于知识的视觉问答（KBVQA）是一种典型的视觉推理任务，在仅使用图像和查询数据难以预测正确答案的情况下，利用外部知识进行问答。除了KBVQA之外，各种视觉推理任务已经被积极研究，以通过有效地结合文本和图像模式来提高视觉理解的潜力。然而，这些任务主要集中在资源丰富的语言上，比如英语。相比之下，对低资源语言的研究相对较少。为了弥补这一研究空白，我们提出了HiLINK，它利用多语言数据来提高KBVQA在各种语言中的性能。在本研究中，我们使用BOK-VQA数据集设计了以下关键方法：我们提出了一个端到端模型，通过直接使用链接调优学习提示框内三元组知识组件之间的关系，消除了对基于知识图嵌入的训练网络的需求。我们提出了基于语境化知识关系的HK-TriNet和HK-TriNet+方法来进行三元组预测。最后，我们将冻结训练方法作为传统编码器联合训练的替代方案，以提高双语学习的效率和性能。HiLINK在双语、英语和韩语三种语言配置的BOK-VQA数据集上表现出色，分别比GEL-VQA方法高出+19.40%、+12.01%和+11.30%。此外，通过对双语嵌入空间的综合分析，从视觉上和数值上验证了该方法的有效性。我们希望这项研究能够启发未来对这一主题的研究，并鼓励改进的视觉语言模型的实际应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

HiLINK: Hierarchical linking of context-aware knowledge prediction and prompt tuning for bilingual knowledge-based visual question answering

查看原文本刊更多论文

HiLINK: Hierarchical linking of context-aware knowledge prediction and prompt tuning for bilingual knowledge-based visual question answering

Knowledge-based visual question answering (KBVQA) is a representative visual reasoning task that leverages external knowledge for question answering in situations where predicting the correct answer using only image and query data is difficult. In addition to KBVQA, various visual reasoning tasks have been actively studied for their potential to improve visual understanding by combining text and image modalities effectively. However, these tasks have primarily focused on high-resource languages, such as English. In contrast, studies on low-resource languages remain comparatively rare. To mitigate this research gap, we propose HiLINK, which utilizes multilingual data to enhance KBVQA performance in various languages. In this study, we use the BOK-VQA dataset to design the following key methodologies: We propose an end-to-end model that eliminates the need for a knowledge graph embedding-based training network by learning relationships between triplet knowledge components within prompts directly using Link-Tuning. We propose the HK-TriNet and HK-TriNet+ methodologies to perform triplet prediction based on contextualized knowledge relationships. Finally, we apply the frozen training approach as an alternative to conventional encoder joint training to improve the efficiency and performance of bilingual learning. HiLINK exhibits outstanding performance on the BOK-VQA dataset in three language configurations: bilingual, English, and Korean, outperforming the GEL-VQA method by +19.40%, +12.01%, and +11.30%, respectively. Furthermore, the effectiveness of the proposed method is validated based on a comprehensive analysis of bilingual embedding spaces, both visually and numerically. We expect this study to inspire future research on this topic and encourage practical applications of improved vision-language models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.