端到端图像字幕的完全语义差距恢复

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-04 DOI:10.1109/TCSVT.2025.3558088

Jingchun Gao;Lei Zhang;Jingyu Li;Zhendong Mao

{"title":"端到端图像字幕的完全语义差距恢复","authors":"Jingchun Gao;Lei Zhang;Jingyu Li;Zhendong Mao","doi":"10.1109/TCSVT.2025.3558088","DOIUrl":null,"url":null,"abstract":"Image captioning (IC) involves the comprehension of images from the visual domain to generate descriptions that are grounded in visual elements within the linguistic domain. Current image captioning methods typically rely on pre-trained unimodal visual backbones or vision-language models to identify visual entities. Subsequently, these methods employ unimodal self-attention fusion to uncover high-level semantic associations. However, we uncover this paradigm suffers from the inherent intra-modal semantic gap from the input features. Unimodal pre-trained visual features lack sufficient linguistic semantic information due to the modality misalignment. Furthermore, contrastive pre-trained vision-language models, such as CLIP, confine to the global cross-modal alignment, leading to local visual features belonging to the same object exhibiting distinct semantics. Given the semantically insufficient visual features, unimodal self-attention fusion struggles to accurately capture semantic associations among visual patches, thereby exacerbating the semantic gap. This gap results in inaccurate visual entities and associations in the generated captions. Therefore, we propose a novel Fully Semantic Gap Recovery (FSGR) method to broaden the robust cross-modal bridge of CLIP into a fine-grained level and consolidate vision-language semantic associations for more precise visual comprehension. Technically, we first propose a local contrastive learning method to aggregate the semantically similar visual patches. Next, we design a semantic quantification module to abstract the language-bridged visual map from the enhanced local visual features. Finally, fine-grained cross-modal interaction consolidates the image patches with their corresponding linguistic semantics, allowing the generation of plausible captions based on the aggregated features. Extensive experiments on comprehensive metrics demonstrate that our model has achieved new state-of-the-art performance on the MSCOCO dataset, while also exhibiting competitive cross-domain capability on the Nocaps dataset. Source code released at <uri>https://github.com/gjc0824/FSGR</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9365-9383"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fully Semantic Gap Recovery for End-to-End Image Captioning\",\"authors\":\"Jingchun Gao;Lei Zhang;Jingyu Li;Zhendong Mao\",\"doi\":\"10.1109/TCSVT.2025.3558088\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image captioning (IC) involves the comprehension of images from the visual domain to generate descriptions that are grounded in visual elements within the linguistic domain. Current image captioning methods typically rely on pre-trained unimodal visual backbones or vision-language models to identify visual entities. Subsequently, these methods employ unimodal self-attention fusion to uncover high-level semantic associations. However, we uncover this paradigm suffers from the inherent intra-modal semantic gap from the input features. Unimodal pre-trained visual features lack sufficient linguistic semantic information due to the modality misalignment. Furthermore, contrastive pre-trained vision-language models, such as CLIP, confine to the global cross-modal alignment, leading to local visual features belonging to the same object exhibiting distinct semantics. Given the semantically insufficient visual features, unimodal self-attention fusion struggles to accurately capture semantic associations among visual patches, thereby exacerbating the semantic gap. This gap results in inaccurate visual entities and associations in the generated captions. Therefore, we propose a novel Fully Semantic Gap Recovery (FSGR) method to broaden the robust cross-modal bridge of CLIP into a fine-grained level and consolidate vision-language semantic associations for more precise visual comprehension. Technically, we first propose a local contrastive learning method to aggregate the semantically similar visual patches. Next, we design a semantic quantification module to abstract the language-bridged visual map from the enhanced local visual features. Finally, fine-grained cross-modal interaction consolidates the image patches with their corresponding linguistic semantics, allowing the generation of plausible captions based on the aggregated features. Extensive experiments on comprehensive metrics demonstrate that our model has achieved new state-of-the-art performance on the MSCOCO dataset, while also exhibiting competitive cross-domain capability on the Nocaps dataset. Source code released at <uri>https://github.com/gjc0824/FSGR</uri>.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 9\",\"pages\":\"9365-9383\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10949179/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10949179/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

图像字幕（IC）涉及对来自视觉域的图像的理解，以生成基于语言域内视觉元素的描述。当前的图像字幕方法通常依赖于预训练的单模视觉主干或视觉语言模型来识别视觉实体。随后，这些方法采用单模态自注意融合来揭示高层次的语义关联。然而，我们发现这种范式受到输入特征固有的模态语义差距的影响。单模态预训练的视觉特征由于模态错位而缺乏足够的语言语义信息。此外，对比预训练的视觉语言模型，如CLIP，仅限于全局跨模态对齐，导致属于同一对象的局部视觉特征表现出不同的语义。由于视觉特征语义不足，单模态自注意融合难以准确捕捉视觉斑块之间的语义关联，从而加剧了语义差距。这种差距导致生成的标题中的视觉实体和关联不准确。因此，我们提出了一种新的全语义间隙恢复（FSGR）方法，将CLIP的鲁棒跨模态桥扩展到细粒度水平，并巩固视觉语言语义关联，以实现更精确的视觉理解。在技术上，我们首先提出了一种局部对比学习方法来聚合语义相似的视觉块。其次，设计语义量化模块，从增强的局部视觉特征中提取语言桥接的视觉地图。最后，细粒度的跨模态交互将图像补丁与其相应的语言语义进行整合，从而基于聚合的特征生成可信的标题。综合指标的广泛实验表明，我们的模型在MSCOCO数据集上取得了新的最先进的性能，同时在Nocaps数据集上也表现出具有竞争力的跨域能力。源代码发布于https://github.com/gjc0824/FSGR。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fully Semantic Gap Recovery for End-to-End Image Captioning

Image captioning (IC) involves the comprehension of images from the visual domain to generate descriptions that are grounded in visual elements within the linguistic domain. Current image captioning methods typically rely on pre-trained unimodal visual backbones or vision-language models to identify visual entities. Subsequently, these methods employ unimodal self-attention fusion to uncover high-level semantic associations. However, we uncover this paradigm suffers from the inherent intra-modal semantic gap from the input features. Unimodal pre-trained visual features lack sufficient linguistic semantic information due to the modality misalignment. Furthermore, contrastive pre-trained vision-language models, such as CLIP, confine to the global cross-modal alignment, leading to local visual features belonging to the same object exhibiting distinct semantics. Given the semantically insufficient visual features, unimodal self-attention fusion struggles to accurately capture semantic associations among visual patches, thereby exacerbating the semantic gap. This gap results in inaccurate visual entities and associations in the generated captions. Therefore, we propose a novel Fully Semantic Gap Recovery (FSGR) method to broaden the robust cross-modal bridge of CLIP into a fine-grained level and consolidate vision-language semantic associations for more precise visual comprehension. Technically, we first propose a local contrastive learning method to aggregate the semantically similar visual patches. Next, we design a semantic quantification module to abstract the language-bridged visual map from the enhanced local visual features. Finally, fine-grained cross-modal interaction consolidates the image patches with their corresponding linguistic semantics, allowing the generation of plausible captions based on the aggregated features. Extensive experiments on comprehensive metrics demonstrate that our model has achieved new state-of-the-art performance on the MSCOCO dataset, while also exhibiting competitive cross-domain capability on the Nocaps dataset. Source code released at https://github.com/gjc0824/FSGR.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.