{"title":"端到端图像字幕的完全语义差距恢复","authors":"Jingchun Gao;Lei Zhang;Jingyu Li;Zhendong Mao","doi":"10.1109/TCSVT.2025.3558088","DOIUrl":null,"url":null,"abstract":"Image captioning (IC) involves the comprehension of images from the visual domain to generate descriptions that are grounded in visual elements within the linguistic domain. Current image captioning methods typically rely on pre-trained unimodal visual backbones or vision-language models to identify visual entities. Subsequently, these methods employ unimodal self-attention fusion to uncover high-level semantic associations. However, we uncover this paradigm suffers from the inherent intra-modal semantic gap from the input features. Unimodal pre-trained visual features lack sufficient linguistic semantic information due to the modality misalignment. Furthermore, contrastive pre-trained vision-language models, such as CLIP, confine to the global cross-modal alignment, leading to local visual features belonging to the same object exhibiting distinct semantics. Given the semantically insufficient visual features, unimodal self-attention fusion struggles to accurately capture semantic associations among visual patches, thereby exacerbating the semantic gap. This gap results in inaccurate visual entities and associations in the generated captions. Therefore, we propose a novel Fully Semantic Gap Recovery (FSGR) method to broaden the robust cross-modal bridge of CLIP into a fine-grained level and consolidate vision-language semantic associations for more precise visual comprehension. Technically, we first propose a local contrastive learning method to aggregate the semantically similar visual patches. Next, we design a semantic quantification module to abstract the language-bridged visual map from the enhanced local visual features. Finally, fine-grained cross-modal interaction consolidates the image patches with their corresponding linguistic semantics, allowing the generation of plausible captions based on the aggregated features. Extensive experiments on comprehensive metrics demonstrate that our model has achieved new state-of-the-art performance on the MSCOCO dataset, while also exhibiting competitive cross-domain capability on the Nocaps dataset. Source code released at <uri>https://github.com/gjc0824/FSGR</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9365-9383"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fully Semantic Gap Recovery for End-to-End Image Captioning\",\"authors\":\"Jingchun Gao;Lei Zhang;Jingyu Li;Zhendong Mao\",\"doi\":\"10.1109/TCSVT.2025.3558088\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image captioning (IC) involves the comprehension of images from the visual domain to generate descriptions that are grounded in visual elements within the linguistic domain. Current image captioning methods typically rely on pre-trained unimodal visual backbones or vision-language models to identify visual entities. Subsequently, these methods employ unimodal self-attention fusion to uncover high-level semantic associations. However, we uncover this paradigm suffers from the inherent intra-modal semantic gap from the input features. Unimodal pre-trained visual features lack sufficient linguistic semantic information due to the modality misalignment. Furthermore, contrastive pre-trained vision-language models, such as CLIP, confine to the global cross-modal alignment, leading to local visual features belonging to the same object exhibiting distinct semantics. Given the semantically insufficient visual features, unimodal self-attention fusion struggles to accurately capture semantic associations among visual patches, thereby exacerbating the semantic gap. This gap results in inaccurate visual entities and associations in the generated captions. Therefore, we propose a novel Fully Semantic Gap Recovery (FSGR) method to broaden the robust cross-modal bridge of CLIP into a fine-grained level and consolidate vision-language semantic associations for more precise visual comprehension. Technically, we first propose a local contrastive learning method to aggregate the semantically similar visual patches. Next, we design a semantic quantification module to abstract the language-bridged visual map from the enhanced local visual features. Finally, fine-grained cross-modal interaction consolidates the image patches with their corresponding linguistic semantics, allowing the generation of plausible captions based on the aggregated features. Extensive experiments on comprehensive metrics demonstrate that our model has achieved new state-of-the-art performance on the MSCOCO dataset, while also exhibiting competitive cross-domain capability on the Nocaps dataset. Source code released at <uri>https://github.com/gjc0824/FSGR</uri>.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 9\",\"pages\":\"9365-9383\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10949179/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10949179/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Fully Semantic Gap Recovery for End-to-End Image Captioning
Image captioning (IC) involves the comprehension of images from the visual domain to generate descriptions that are grounded in visual elements within the linguistic domain. Current image captioning methods typically rely on pre-trained unimodal visual backbones or vision-language models to identify visual entities. Subsequently, these methods employ unimodal self-attention fusion to uncover high-level semantic associations. However, we uncover this paradigm suffers from the inherent intra-modal semantic gap from the input features. Unimodal pre-trained visual features lack sufficient linguistic semantic information due to the modality misalignment. Furthermore, contrastive pre-trained vision-language models, such as CLIP, confine to the global cross-modal alignment, leading to local visual features belonging to the same object exhibiting distinct semantics. Given the semantically insufficient visual features, unimodal self-attention fusion struggles to accurately capture semantic associations among visual patches, thereby exacerbating the semantic gap. This gap results in inaccurate visual entities and associations in the generated captions. Therefore, we propose a novel Fully Semantic Gap Recovery (FSGR) method to broaden the robust cross-modal bridge of CLIP into a fine-grained level and consolidate vision-language semantic associations for more precise visual comprehension. Technically, we first propose a local contrastive learning method to aggregate the semantically similar visual patches. Next, we design a semantic quantification module to abstract the language-bridged visual map from the enhanced local visual features. Finally, fine-grained cross-modal interaction consolidates the image patches with their corresponding linguistic semantics, allowing the generation of plausible captions based on the aggregated features. Extensive experiments on comprehensive metrics demonstrate that our model has achieved new state-of-the-art performance on the MSCOCO dataset, while also exhibiting competitive cross-domain capability on the Nocaps dataset. Source code released at https://github.com/gjc0824/FSGR.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.