Iterative Caption Generation with Heuristic Guidance for enhancing knowledge-based visual question answering

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-09-23 DOI:10.1016/j.cviu.2025.104515

Fengyuan Liu , Zhongjian Hu , Peng Yang , Xingyu Liu

{"title":"Iterative Caption Generation with Heuristic Guidance for enhancing knowledge-based visual question answering","authors":"Fengyuan Liu , Zhongjian Hu , Peng Yang , Xingyu Liu","doi":"10.1016/j.cviu.2025.104515","DOIUrl":null,"url":null,"abstract":"<div><div>The advent of large language models (LLMs) has significantly advanced Knowledge-based Visual Question Answering (KBVQA) by reducing the reliance on external knowledge bases. Traditional methods often generate captions in a single pass, which can struggle with complex questions due to difficulty in precisely identifying key visual components. This challenge undermines the reasoning capabilities of LLMs, which require accurate, semantically aligned captions to answer complex questions effectively. To address this limitation, we propose ICGHG Iterative Caption Generation with Heuristic Guidance, a novel framework that refines captions iteratively. Our approach incorporates a dynamic loop where captions are continuously refined based on heuristic feedback from a set of candidate answers and the question itself, ensuring that the final caption provides accurate semantic alignment with both the visual content and the question. By leveraging this iterative process, ICGHG mitigates common issues such as hallucinations and improves the quality of the generated captions. Extensive experiments on OK-VQA, A-OKVQA, and FVQA datasets demonstrate that ICGHG significantly outperforms existing methods, achieving 57.5%, 60.2%, and 69.4% accuracy on their respective test sets, setting new benchmarks in KBVQA accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104515"},"PeriodicalIF":3.5000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225002383","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The advent of large language models (LLMs) has significantly advanced Knowledge-based Visual Question Answering (KBVQA) by reducing the reliance on external knowledge bases. Traditional methods often generate captions in a single pass, which can struggle with complex questions due to difficulty in precisely identifying key visual components. This challenge undermines the reasoning capabilities of LLMs, which require accurate, semantically aligned captions to answer complex questions effectively. To address this limitation, we propose ICGHG Iterative Caption Generation with Heuristic Guidance, a novel framework that refines captions iteratively. Our approach incorporates a dynamic loop where captions are continuously refined based on heuristic feedback from a set of candidate answers and the question itself, ensuring that the final caption provides accurate semantic alignment with both the visual content and the question. By leveraging this iterative process, ICGHG mitigates common issues such as hallucinations and improves the quality of the generated captions. Extensive experiments on OK-VQA, A-OKVQA, and FVQA datasets demonstrate that ICGHG significantly outperforms existing methods, achieving 57.5%, 60.2%, and 69.4% accuracy on their respective test sets, setting new benchmarks in KBVQA accuracy.

查看原文本刊更多论文

基于启发式指导的迭代标题生成增强基于知识的视觉问答

大型语言模型（llm）的出现通过减少对外部知识库的依赖，极大地推进了基于知识的视觉问答（KBVQA）。传统的方法通常在一次传递中生成字幕，由于难以精确识别关键的视觉组件，这可能会与复杂的问题作斗争。这一挑战削弱了法学硕士的推理能力，法学硕士需要准确、语义一致的标题才能有效地回答复杂的问题。为了解决这一限制，我们提出了带有启发式指导的ICGHG迭代标题生成，这是一个迭代地改进标题的新框架。我们的方法结合了一个动态循环，其中根据一组候选答案和问题本身的启发式反馈不断改进标题，确保最终的标题与视觉内容和问题提供准确的语义对齐。通过利用这一迭代过程，ICGHG减轻了幻觉等常见问题，并提高了生成字幕的质量。在OK-VQA、A-OKVQA和FVQA数据集上进行的大量实验表明，ICGHG显著优于现有方法，在各自的测试集上达到57.5%、60.2%和69.4%的准确率，为KBVQA准确率设定了新的基准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems