An-An Liu;Quanhan Wu;Ning Xu;Hongshuo Tian;Lanjun Wang
{"title":"Enriched Image Captioning Based on Knowledge Divergence and Focus","authors":"An-An Liu;Quanhan Wu;Ning Xu;Hongshuo Tian;Lanjun Wang","doi":"10.1109/TCSVT.2024.3525158","DOIUrl":null,"url":null,"abstract":"Image captioning is a fundamental task in computer vision that aims to generate precise and comprehensive descriptions of images automatically. Intuitively, humans initially rely on the image content, e.g., “cake on a plate”, to gradually gather relevant knowledge facts e.g., “birthday party”, “candles”, which is a process referred to as divergence. Then, we perform step-by-step reasoning based on the images to refine, and rearrange these knowledge facts for explicit sentence generation, a process referred to as focus. However, existing image captioning methods mainly rely on the encode-decode framework that does not well fit the “divergence-focus” nature of the task. To this end, we propose the knowledge “divergence-focus” method for Image Captioning (K-DFIC) to gather and polish knowledge facts for image understanding, which consists of two components: 1) Knowledge Divergence Module aims to leverage the divergence capability of large-scale pre-trained model to acquire knowledge facts relevant to the image content. To achieve this, we design a scene-graph-aware prompt that serves as a “trigger” for GPT-3.5, encouraging it to “diverge” and generate more sophisticated, human-like knowledge. 2) Knowledge Focus Module aims to refine acquired knowledge facts and further rearrange them in a coherent manner. We design the interactive refining network to encode knowledge, which is refined with the visual features to remove irrelevant words. Then, to generate fluent image descriptions, we design the large-scale pre-trained model-based rearrangement method to estimate the importance of each knowledge word for an image. Finally, we fuse the refined knowledge and visual features to assist the decoder in generating captions. We demonstrate the superiority of our approach through extensive experiments on the MSCOCO dataset. Our approach surpasses state-of-the-art performance across all metrics in the Karpathy split. For example, our model obtains the best CIDEr-D score of 148.4%. Additional ablation studies and visualization further validate our effectiveness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4937-4948"},"PeriodicalIF":8.3000,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10820873/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Image captioning is a fundamental task in computer vision that aims to generate precise and comprehensive descriptions of images automatically. Intuitively, humans initially rely on the image content, e.g., “cake on a plate”, to gradually gather relevant knowledge facts e.g., “birthday party”, “candles”, which is a process referred to as divergence. Then, we perform step-by-step reasoning based on the images to refine, and rearrange these knowledge facts for explicit sentence generation, a process referred to as focus. However, existing image captioning methods mainly rely on the encode-decode framework that does not well fit the “divergence-focus” nature of the task. To this end, we propose the knowledge “divergence-focus” method for Image Captioning (K-DFIC) to gather and polish knowledge facts for image understanding, which consists of two components: 1) Knowledge Divergence Module aims to leverage the divergence capability of large-scale pre-trained model to acquire knowledge facts relevant to the image content. To achieve this, we design a scene-graph-aware prompt that serves as a “trigger” for GPT-3.5, encouraging it to “diverge” and generate more sophisticated, human-like knowledge. 2) Knowledge Focus Module aims to refine acquired knowledge facts and further rearrange them in a coherent manner. We design the interactive refining network to encode knowledge, which is refined with the visual features to remove irrelevant words. Then, to generate fluent image descriptions, we design the large-scale pre-trained model-based rearrangement method to estimate the importance of each knowledge word for an image. Finally, we fuse the refined knowledge and visual features to assist the decoder in generating captions. We demonstrate the superiority of our approach through extensive experiments on the MSCOCO dataset. Our approach surpasses state-of-the-art performance across all metrics in the Karpathy split. For example, our model obtains the best CIDEr-D score of 148.4%. Additional ablation studies and visualization further validate our effectiveness.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.