Enriched Image Captioning Based on Knowledge Divergence and Focus

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-03 DOI:10.1109/TCSVT.2024.3525158

An-An Liu;Quanhan Wu;Ning Xu;Hongshuo Tian;Lanjun Wang

{"title":"Enriched Image Captioning Based on Knowledge Divergence and Focus","authors":"An-An Liu;Quanhan Wu;Ning Xu;Hongshuo Tian;Lanjun Wang","doi":"10.1109/TCSVT.2024.3525158","DOIUrl":null,"url":null,"abstract":"Image captioning is a fundamental task in computer vision that aims to generate precise and comprehensive descriptions of images automatically. Intuitively, humans initially rely on the image content, e.g., “cake on a plate”, to gradually gather relevant knowledge facts e.g., “birthday party”, “candles”, which is a process referred to as divergence. Then, we perform step-by-step reasoning based on the images to refine, and rearrange these knowledge facts for explicit sentence generation, a process referred to as focus. However, existing image captioning methods mainly rely on the encode-decode framework that does not well fit the “divergence-focus” nature of the task. To this end, we propose the knowledge “divergence-focus” method for Image Captioning (K-DFIC) to gather and polish knowledge facts for image understanding, which consists of two components: 1) Knowledge Divergence Module aims to leverage the divergence capability of large-scale pre-trained model to acquire knowledge facts relevant to the image content. To achieve this, we design a scene-graph-aware prompt that serves as a “trigger” for GPT-3.5, encouraging it to “diverge” and generate more sophisticated, human-like knowledge. 2) Knowledge Focus Module aims to refine acquired knowledge facts and further rearrange them in a coherent manner. We design the interactive refining network to encode knowledge, which is refined with the visual features to remove irrelevant words. Then, to generate fluent image descriptions, we design the large-scale pre-trained model-based rearrangement method to estimate the importance of each knowledge word for an image. Finally, we fuse the refined knowledge and visual features to assist the decoder in generating captions. We demonstrate the superiority of our approach through extensive experiments on the MSCOCO dataset. Our approach surpasses state-of-the-art performance across all metrics in the Karpathy split. For example, our model obtains the best CIDEr-D score of 148.4%. Additional ablation studies and visualization further validate our effectiveness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4937-4948"},"PeriodicalIF":8.3000,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10820873/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Image captioning is a fundamental task in computer vision that aims to generate precise and comprehensive descriptions of images automatically. Intuitively, humans initially rely on the image content, e.g., “cake on a plate”, to gradually gather relevant knowledge facts e.g., “birthday party”, “candles”, which is a process referred to as divergence. Then, we perform step-by-step reasoning based on the images to refine, and rearrange these knowledge facts for explicit sentence generation, a process referred to as focus. However, existing image captioning methods mainly rely on the encode-decode framework that does not well fit the “divergence-focus” nature of the task. To this end, we propose the knowledge “divergence-focus” method for Image Captioning (K-DFIC) to gather and polish knowledge facts for image understanding, which consists of two components: 1) Knowledge Divergence Module aims to leverage the divergence capability of large-scale pre-trained model to acquire knowledge facts relevant to the image content. To achieve this, we design a scene-graph-aware prompt that serves as a “trigger” for GPT-3.5, encouraging it to “diverge” and generate more sophisticated, human-like knowledge. 2) Knowledge Focus Module aims to refine acquired knowledge facts and further rearrange them in a coherent manner. We design the interactive refining network to encode knowledge, which is refined with the visual features to remove irrelevant words. Then, to generate fluent image descriptions, we design the large-scale pre-trained model-based rearrangement method to estimate the importance of each knowledge word for an image. Finally, we fuse the refined knowledge and visual features to assist the decoder in generating captions. We demonstrate the superiority of our approach through extensive experiments on the MSCOCO dataset. Our approach surpasses state-of-the-art performance across all metrics in the Karpathy split. For example, our model obtains the best CIDEr-D score of 148.4%. Additional ablation studies and visualization further validate our effectiveness.

查看原文本刊更多论文

基于知识发散和聚焦的丰富图像字幕

图像字幕是计算机视觉中的一项基本任务，旨在自动生成精确和全面的图像描述。直观上，人类最初依靠图像内容，例如“盘子上的蛋糕”，逐渐收集相关的知识事实，例如“生日聚会”，“蜡烛”，这是一个被称为发散的过程。然后，我们根据图像进行分步推理，提炼并重新排列这些知识事实，以生成明确的句子，这一过程被称为焦点。然而，现有的图像字幕方法主要依赖于编码-解码框架，不能很好地适应任务的“发散焦点”性质。为此，我们提出了knowledge“Divergence -focus”method for Image Captioning （K-DFIC）方法来收集和整理用于图像理解的知识事实，该方法由两个部分组成：1)knowledge Divergence Module旨在利用大规模预训练模型的发散能力来获取与图像内容相关的知识事实。为了实现这一目标，我们设计了一个场景图感知提示，作为GPT-3.5的“触发器”，鼓励它“发散”并产生更复杂的、类似人类的知识。2) Knowledge Focus Module的目的是提炼已获得的知识事实，并进一步将其条理清晰地重新排列。我们设计了交互式的提炼网络来对知识进行编码，利用视觉特征对知识进行提炼，去除不相关的词。然后，为了生成流畅的图像描述，我们设计了基于大规模预训练模型的重排方法来估计图像中每个知识词的重要性。最后，我们融合了精炼的知识和视觉特征，以帮助解码器生成字幕。我们通过在MSCOCO数据集上的大量实验证明了我们方法的优越性。我们的方法在卡帕西分裂的所有指标上都超过了最先进的性能。例如，我们的模型获得了最佳的CIDEr-D得分为148.4%。进一步的消融研究和可视化进一步验证了我们的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.