MCT-CCDiff: Context-Aware Contrastive Diffusion Model With Mediator-Bridging Cross-Modal Transformer for Image Change Captioning

IF 13.7
Jinhong Hu;Guojin Zhong;Jin Yuan;Wenbo Pan;Xiaoping Wang
{"title":"MCT-CCDiff: Context-Aware Contrastive Diffusion Model With Mediator-Bridging Cross-Modal Transformer for Image Change Captioning","authors":"Jinhong Hu;Guojin Zhong;Jin Yuan;Wenbo Pan;Xiaoping Wang","doi":"10.1109/TIP.2025.3573471","DOIUrl":null,"url":null,"abstract":"Recent advancements in diffusion models (DMs) have showcased superior capabilities in generating images and text. This paper first introduces DMs for image change captioning (ICC) and proposes a novel Context-aware Contrastive Diffusion model with Mediator-bridging Cross-modal Transformer (MCT-CCDiff) to accurately predict visual difference descriptions conditioned on two similar images. Technically, MCT-CCDiff develops a Text Embedding Contrastive Loss (TECL) that leverages both positive and negative samples to more effectively distinguish text embeddings, thus generating more discriminative text representations for ICC. To accurately predict visual difference descriptions, MCT-CCDiff introduces a Mediator-bridging Cross-modal Transformer (MCTrans) designed to efficiently explore the cross-modal correlations between visual differences and corresponding text by using a lightweight mediator, mitigating interference from visual redundancy and reducing interaction overhead. Additionally, it incorporates context-augmented denoising to further understand the contextual relationships within caption words implemented by a revised diffusion loss, which provides a tighter optimization bound, leading to enhanced optimization effects for high-quality text generation. Extensive experiments conducted on four benchmark datasets clearly demonstrate that our MCT-CCDiff significantly outperforms state-of-the-art methods in the field of ICC, marking a notable advancement in the generation of precise visual difference descriptions.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3294-3308"},"PeriodicalIF":13.7000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11021330/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent advancements in diffusion models (DMs) have showcased superior capabilities in generating images and text. This paper first introduces DMs for image change captioning (ICC) and proposes a novel Context-aware Contrastive Diffusion model with Mediator-bridging Cross-modal Transformer (MCT-CCDiff) to accurately predict visual difference descriptions conditioned on two similar images. Technically, MCT-CCDiff develops a Text Embedding Contrastive Loss (TECL) that leverages both positive and negative samples to more effectively distinguish text embeddings, thus generating more discriminative text representations for ICC. To accurately predict visual difference descriptions, MCT-CCDiff introduces a Mediator-bridging Cross-modal Transformer (MCTrans) designed to efficiently explore the cross-modal correlations between visual differences and corresponding text by using a lightweight mediator, mitigating interference from visual redundancy and reducing interaction overhead. Additionally, it incorporates context-augmented denoising to further understand the contextual relationships within caption words implemented by a revised diffusion loss, which provides a tighter optimization bound, leading to enhanced optimization effects for high-quality text generation. Extensive experiments conducted on four benchmark datasets clearly demonstrate that our MCT-CCDiff significantly outperforms state-of-the-art methods in the field of ICC, marking a notable advancement in the generation of precise visual difference descriptions.
MCT-CCDiff:带有中介体桥接跨模态转换器的图像变化字幕上下文感知对比扩散模型
扩散模型(DMs)的最新进展显示出在生成图像和文本方面的卓越能力。本文首先介绍了用于图像变化字幕(ICC)的dm,并提出了一种新的基于介质桥接跨模态变压器(MCT-CCDiff)的上下文感知对比扩散模型,以准确预测两幅相似图像条件下的视觉差异描述。技术上,MCT-CCDiff开发了一种文本嵌入对比损失(TECL),它利用正样本和负样本更有效地区分文本嵌入,从而为ICC生成更具区别性的文本表示。为了准确地预测视觉差异描述,MCT-CCDiff引入了一个中介器桥接跨模态变压器(MCTrans),旨在通过使用轻量级中介器有效地探索视觉差异和相应文本之间的跨模态相关性,减轻视觉冗余的干扰并减少交互开销。此外,它结合了上下文增强去噪,以进一步理解通过修订的扩散损失实现的标题词中的上下文关系,这提供了更严格的优化界限,从而增强了高质量文本生成的优化效果。在四个基准数据集上进行的大量实验清楚地表明,我们的MCT-CCDiff显著优于ICC领域最先进的方法,标志着在生成精确的视觉差异描述方面取得了显着进步。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信