Hyeonki Jeong , Taehyeong Kim , Wooseok Shin, Sung Won Han
{"title":"上下文感知知识预测的层次链接和基于双语知识的视觉问答的提示调整","authors":"Hyeonki Jeong , Taehyeong Kim , Wooseok Shin, Sung Won Han","doi":"10.1016/j.knosys.2025.113556","DOIUrl":null,"url":null,"abstract":"<div><div>Knowledge-based visual question answering (KBVQA) is a representative visual reasoning task that leverages external knowledge for question answering in situations where predicting the correct answer using only image and query data is difficult. In addition to KBVQA, various visual reasoning tasks have been actively studied for their potential to improve visual understanding by combining text and image modalities effectively. However, these tasks have primarily focused on high-resource languages, such as English. In contrast, studies on low-resource languages remain comparatively rare. To mitigate this research gap, we propose HiLINK, which utilizes multilingual data to enhance KBVQA performance in various languages. In this study, we use the BOK-VQA dataset to design the following key methodologies: We propose an end-to-end model that eliminates the need for a knowledge graph embedding-based training network by learning relationships between triplet knowledge components within prompts directly using Link-Tuning. We propose the HK-TriNet and HK-TriNet+ methodologies to perform triplet prediction based on contextualized knowledge relationships. Finally, we apply the frozen training approach as an alternative to conventional encoder joint training to improve the efficiency and performance of bilingual learning. HiLINK exhibits outstanding performance on the BOK-VQA dataset in three language configurations: bilingual, English, and Korean, outperforming the GEL-VQA method by +19.40%, +12.01%, and +11.30%, respectively. Furthermore, the effectiveness of the proposed method is validated based on a comprehensive analysis of bilingual embedding spaces, both visually and numerically. We expect this study to inspire future research on this topic and encourage practical applications of improved vision-language models.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"319 ","pages":"Article 113556"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HiLINK: Hierarchical linking of context-aware knowledge prediction and prompt tuning for bilingual knowledge-based visual question answering\",\"authors\":\"Hyeonki Jeong , Taehyeong Kim , Wooseok Shin, Sung Won Han\",\"doi\":\"10.1016/j.knosys.2025.113556\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Knowledge-based visual question answering (KBVQA) is a representative visual reasoning task that leverages external knowledge for question answering in situations where predicting the correct answer using only image and query data is difficult. In addition to KBVQA, various visual reasoning tasks have been actively studied for their potential to improve visual understanding by combining text and image modalities effectively. However, these tasks have primarily focused on high-resource languages, such as English. In contrast, studies on low-resource languages remain comparatively rare. To mitigate this research gap, we propose HiLINK, which utilizes multilingual data to enhance KBVQA performance in various languages. In this study, we use the BOK-VQA dataset to design the following key methodologies: We propose an end-to-end model that eliminates the need for a knowledge graph embedding-based training network by learning relationships between triplet knowledge components within prompts directly using Link-Tuning. We propose the HK-TriNet and HK-TriNet+ methodologies to perform triplet prediction based on contextualized knowledge relationships. Finally, we apply the frozen training approach as an alternative to conventional encoder joint training to improve the efficiency and performance of bilingual learning. HiLINK exhibits outstanding performance on the BOK-VQA dataset in three language configurations: bilingual, English, and Korean, outperforming the GEL-VQA method by +19.40%, +12.01%, and +11.30%, respectively. Furthermore, the effectiveness of the proposed method is validated based on a comprehensive analysis of bilingual embedding spaces, both visually and numerically. We expect this study to inspire future research on this topic and encourage practical applications of improved vision-language models.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"319 \",\"pages\":\"Article 113556\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2025-04-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125006021\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125006021","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
HiLINK: Hierarchical linking of context-aware knowledge prediction and prompt tuning for bilingual knowledge-based visual question answering
Knowledge-based visual question answering (KBVQA) is a representative visual reasoning task that leverages external knowledge for question answering in situations where predicting the correct answer using only image and query data is difficult. In addition to KBVQA, various visual reasoning tasks have been actively studied for their potential to improve visual understanding by combining text and image modalities effectively. However, these tasks have primarily focused on high-resource languages, such as English. In contrast, studies on low-resource languages remain comparatively rare. To mitigate this research gap, we propose HiLINK, which utilizes multilingual data to enhance KBVQA performance in various languages. In this study, we use the BOK-VQA dataset to design the following key methodologies: We propose an end-to-end model that eliminates the need for a knowledge graph embedding-based training network by learning relationships between triplet knowledge components within prompts directly using Link-Tuning. We propose the HK-TriNet and HK-TriNet+ methodologies to perform triplet prediction based on contextualized knowledge relationships. Finally, we apply the frozen training approach as an alternative to conventional encoder joint training to improve the efficiency and performance of bilingual learning. HiLINK exhibits outstanding performance on the BOK-VQA dataset in three language configurations: bilingual, English, and Korean, outperforming the GEL-VQA method by +19.40%, +12.01%, and +11.30%, respectively. Furthermore, the effectiveness of the proposed method is validated based on a comprehensive analysis of bilingual embedding spaces, both visually and numerically. We expect this study to inspire future research on this topic and encourage practical applications of improved vision-language models.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.