Iterative Caption Generation with Heuristic Guidance for enhancing knowledge-based visual question answering

IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Fengyuan Liu , Zhongjian Hu , Peng Yang , Xingyu Liu
{"title":"Iterative Caption Generation with Heuristic Guidance for enhancing knowledge-based visual question answering","authors":"Fengyuan Liu ,&nbsp;Zhongjian Hu ,&nbsp;Peng Yang ,&nbsp;Xingyu Liu","doi":"10.1016/j.cviu.2025.104515","DOIUrl":null,"url":null,"abstract":"<div><div>The advent of large language models (LLMs) has significantly advanced Knowledge-based Visual Question Answering (KBVQA) by reducing the reliance on external knowledge bases. Traditional methods often generate captions in a single pass, which can struggle with complex questions due to difficulty in precisely identifying key visual components. This challenge undermines the reasoning capabilities of LLMs, which require accurate, semantically aligned captions to answer complex questions effectively. To address this limitation, we propose ICGHG <strong><u>I</u></strong>terative <strong><u>C</u></strong>aption <strong><u>G</u></strong>eneration with <strong><u>H</u></strong>euristic <strong><u>G</u></strong>uidance, a novel framework that refines captions iteratively. Our approach incorporates a dynamic loop where captions are continuously refined based on heuristic feedback from a set of candidate answers and the question itself, ensuring that the final caption provides accurate semantic alignment with both the visual content and the question. By leveraging this iterative process, ICGHG mitigates common issues such as hallucinations and improves the quality of the generated captions. Extensive experiments on OK-VQA, A-OKVQA, and FVQA datasets demonstrate that ICGHG significantly outperforms existing methods, achieving 57.5%, 60.2%, and 69.4% accuracy on their respective test sets, setting new benchmarks in KBVQA accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104515"},"PeriodicalIF":3.5000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225002383","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The advent of large language models (LLMs) has significantly advanced Knowledge-based Visual Question Answering (KBVQA) by reducing the reliance on external knowledge bases. Traditional methods often generate captions in a single pass, which can struggle with complex questions due to difficulty in precisely identifying key visual components. This challenge undermines the reasoning capabilities of LLMs, which require accurate, semantically aligned captions to answer complex questions effectively. To address this limitation, we propose ICGHG Iterative Caption Generation with Heuristic Guidance, a novel framework that refines captions iteratively. Our approach incorporates a dynamic loop where captions are continuously refined based on heuristic feedback from a set of candidate answers and the question itself, ensuring that the final caption provides accurate semantic alignment with both the visual content and the question. By leveraging this iterative process, ICGHG mitigates common issues such as hallucinations and improves the quality of the generated captions. Extensive experiments on OK-VQA, A-OKVQA, and FVQA datasets demonstrate that ICGHG significantly outperforms existing methods, achieving 57.5%, 60.2%, and 69.4% accuracy on their respective test sets, setting new benchmarks in KBVQA accuracy.
基于启发式指导的迭代标题生成增强基于知识的视觉问答
大型语言模型(llm)的出现通过减少对外部知识库的依赖,极大地推进了基于知识的视觉问答(KBVQA)。传统的方法通常在一次传递中生成字幕,由于难以精确识别关键的视觉组件,这可能会与复杂的问题作斗争。这一挑战削弱了法学硕士的推理能力,法学硕士需要准确、语义一致的标题才能有效地回答复杂的问题。为了解决这一限制,我们提出了带有启发式指导的ICGHG迭代标题生成,这是一个迭代地改进标题的新框架。我们的方法结合了一个动态循环,其中根据一组候选答案和问题本身的启发式反馈不断改进标题,确保最终的标题与视觉内容和问题提供准确的语义对齐。通过利用这一迭代过程,ICGHG减轻了幻觉等常见问题,并提高了生成字幕的质量。在OK-VQA、A-OKVQA和FVQA数据集上进行的大量实验表明,ICGHG显著优于现有方法,在各自的测试集上达到57.5%、60.2%和69.4%的准确率,为KBVQA准确率设定了新的基准。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computer Vision and Image Understanding
Computer Vision and Image Understanding 工程技术-工程:电子与电气
CiteScore
7.80
自引率
4.40%
发文量
112
审稿时长
79 days
期刊介绍: The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信