MCT-CCDiff: Context-Aware Contrastive Diffusion Model With Mediator-Bridging Cross-Modal Transformer for Image Change Captioning

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-02 DOI:10.1109/TIP.2025.3573471

Jinhong Hu;Guojin Zhong;Jin Yuan;Wenbo Pan;Xiaoping Wang

{"title":"MCT-CCDiff: Context-Aware Contrastive Diffusion Model With Mediator-Bridging Cross-Modal Transformer for Image Change Captioning","authors":"Jinhong Hu;Guojin Zhong;Jin Yuan;Wenbo Pan;Xiaoping Wang","doi":"10.1109/TIP.2025.3573471","DOIUrl":null,"url":null,"abstract":"Recent advancements in diffusion models (DMs) have showcased superior capabilities in generating images and text. This paper first introduces DMs for image change captioning (ICC) and proposes a novel Context-aware Contrastive Diffusion model with Mediator-bridging Cross-modal Transformer (MCT-CCDiff) to accurately predict visual difference descriptions conditioned on two similar images. Technically, MCT-CCDiff develops a Text Embedding Contrastive Loss (TECL) that leverages both positive and negative samples to more effectively distinguish text embeddings, thus generating more discriminative text representations for ICC. To accurately predict visual difference descriptions, MCT-CCDiff introduces a Mediator-bridging Cross-modal Transformer (MCTrans) designed to efficiently explore the cross-modal correlations between visual differences and corresponding text by using a lightweight mediator, mitigating interference from visual redundancy and reducing interaction overhead. Additionally, it incorporates context-augmented denoising to further understand the contextual relationships within caption words implemented by a revised diffusion loss, which provides a tighter optimization bound, leading to enhanced optimization effects for high-quality text generation. Extensive experiments conducted on four benchmark datasets clearly demonstrate that our MCT-CCDiff significantly outperforms state-of-the-art methods in the field of ICC, marking a notable advancement in the generation of precise visual difference descriptions.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3294-3308"},"PeriodicalIF":13.7000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11021330/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advancements in diffusion models (DMs) have showcased superior capabilities in generating images and text. This paper first introduces DMs for image change captioning (ICC) and proposes a novel Context-aware Contrastive Diffusion model with Mediator-bridging Cross-modal Transformer (MCT-CCDiff) to accurately predict visual difference descriptions conditioned on two similar images. Technically, MCT-CCDiff develops a Text Embedding Contrastive Loss (TECL) that leverages both positive and negative samples to more effectively distinguish text embeddings, thus generating more discriminative text representations for ICC. To accurately predict visual difference descriptions, MCT-CCDiff introduces a Mediator-bridging Cross-modal Transformer (MCTrans) designed to efficiently explore the cross-modal correlations between visual differences and corresponding text by using a lightweight mediator, mitigating interference from visual redundancy and reducing interaction overhead. Additionally, it incorporates context-augmented denoising to further understand the contextual relationships within caption words implemented by a revised diffusion loss, which provides a tighter optimization bound, leading to enhanced optimization effects for high-quality text generation. Extensive experiments conducted on four benchmark datasets clearly demonstrate that our MCT-CCDiff significantly outperforms state-of-the-art methods in the field of ICC, marking a notable advancement in the generation of precise visual difference descriptions.

查看原文本刊更多论文

MCT-CCDiff：带有中介体桥接跨模态转换器的图像变化字幕上下文感知对比扩散模型

扩散模型（DMs）的最新进展显示出在生成图像和文本方面的卓越能力。本文首先介绍了用于图像变化字幕（ICC）的dm，并提出了一种新的基于介质桥接跨模态变压器（MCT-CCDiff）的上下文感知对比扩散模型，以准确预测两幅相似图像条件下的视觉差异描述。技术上，MCT-CCDiff开发了一种文本嵌入对比损失（TECL），它利用正样本和负样本更有效地区分文本嵌入，从而为ICC生成更具区别性的文本表示。为了准确地预测视觉差异描述，MCT-CCDiff引入了一个中介器桥接跨模态变压器（MCTrans），旨在通过使用轻量级中介器有效地探索视觉差异和相应文本之间的跨模态相关性，减轻视觉冗余的干扰并减少交互开销。此外，它结合了上下文增强去噪，以进一步理解通过修订的扩散损失实现的标题词中的上下文关系，这提供了更严格的优化界限，从而增强了高质量文本生成的优化效果。在四个基准数据集上进行的大量实验清楚地表明，我们的MCT-CCDiff显著优于ICC领域最先进的方法，标志着在生成精确的视觉差异描述方面取得了显着进步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量