对比条件潜扩散在视听分割中的应用

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-23 DOI:10.1109/TIP.2025.3580269

Yuxin Mao;Jing Zhang;Mochu Xiang;Yunqiu Lv;Dong Li;Yiran Zhong;Yuchao Dai

{"title":"对比条件潜扩散在视听分割中的应用","authors":"Yuxin Mao;Jing Zhang;Mochu Xiang;Yunqiu Lv;Dong Li;Yiran Zhong;Yuchao Dai","doi":"10.1109/TIP.2025.3580269","DOIUrl":null,"url":null,"abstract":"Audio-visual Segmentation (AVS) is conceptualized as a conditional generation task, where audio is considered as the conditional variable for segmenting the sound producer(s). In this case, audio should be extensively explored to maximize its contribution for the final segmentation task. We propose a contrastive conditional latent diffusion model for audio-visual segmentation (AVS) to thoroughly investigate the impact of audio, where the correlation between audio and the final segmentation map is modeled to guarantee the strong correlation between them. To achieve semantic-correlated representation learning, our framework incorporates a latent diffusion model. The diffusion model learns the conditional generation process of the ground-truth segmentation map, resulting in ground-truth aware inference during the denoising process at the test stage. As our model is conditional, it is vital to ensure that the conditional variable contributes to the model output. We thus extensively model the contribution of the audio signal by minimizing the density ratio between the conditional probability of the multimodal data, e.g. conditioned on the audio-visual data, and that of the unimodal data, e.g. conditioned on the audio data only. In this way, our latent diffusion model via density ratio optimization explicitly maximizes the contribution of audio for AVS, which can then be achieved with contrastive learning as a constraint, where the diffusion part serves as the main objective to achieve maximum likelihood estimation, and the density ratio optimization part imposes the constraint. By adopting this latent diffusion model via contrastive learning, we effectively enhance the contribution of audio for AVS. The effectiveness of our solution is validated through experimental results on the benchmark dataset. Code and results are online via our project page: <uri>https://github.com/OpenNLPLab/DiffusionAVS</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4108-4119"},"PeriodicalIF":13.7000,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Contrastive Conditional Latent Diffusion for Audio-Visual Segmentation\",\"authors\":\"Yuxin Mao;Jing Zhang;Mochu Xiang;Yunqiu Lv;Dong Li;Yiran Zhong;Yuchao Dai\",\"doi\":\"10.1109/TIP.2025.3580269\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Audio-visual Segmentation (AVS) is conceptualized as a conditional generation task, where audio is considered as the conditional variable for segmenting the sound producer(s). In this case, audio should be extensively explored to maximize its contribution for the final segmentation task. We propose a contrastive conditional latent diffusion model for audio-visual segmentation (AVS) to thoroughly investigate the impact of audio, where the correlation between audio and the final segmentation map is modeled to guarantee the strong correlation between them. To achieve semantic-correlated representation learning, our framework incorporates a latent diffusion model. The diffusion model learns the conditional generation process of the ground-truth segmentation map, resulting in ground-truth aware inference during the denoising process at the test stage. As our model is conditional, it is vital to ensure that the conditional variable contributes to the model output. We thus extensively model the contribution of the audio signal by minimizing the density ratio between the conditional probability of the multimodal data, e.g. conditioned on the audio-visual data, and that of the unimodal data, e.g. conditioned on the audio data only. In this way, our latent diffusion model via density ratio optimization explicitly maximizes the contribution of audio for AVS, which can then be achieved with contrastive learning as a constraint, where the diffusion part serves as the main objective to achieve maximum likelihood estimation, and the density ratio optimization part imposes the constraint. By adopting this latent diffusion model via contrastive learning, we effectively enhance the contribution of audio for AVS. The effectiveness of our solution is validated through experimental results on the benchmark dataset. Code and results are online via our project page: <uri>https://github.com/OpenNLPLab/DiffusionAVS</uri>\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"34 \",\"pages\":\"4108-4119\"},\"PeriodicalIF\":13.7000,\"publicationDate\":\"2025-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11048437/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11048437/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

视听分割（AVS）被定义为条件生成任务，其中音频被视为分割声音生产者的条件变量。在这种情况下，应该广泛探索音频，以最大限度地发挥其对最终分割任务的贡献。我们提出了一种对比条件潜扩散模型用于视听分割（AVS），以彻底研究音频的影响，其中音频与最终分割图之间的相关性建模，以保证它们之间的强相关性。为了实现语义相关表示学习，我们的框架结合了一个潜在扩散模型。扩散模型学习地真分割图的条件生成过程，在测试阶段去噪过程中进行地真感知推理。由于我们的模型是有条件的，因此确保条件变量对模型输出有贡献是至关重要的。因此，我们通过最小化多模态数据（例如以视听数据为条件的数据）和单模态数据（例如仅以音频数据为条件的数据）的条件概率之间的密度比，广泛地建模音频信号的贡献。这样，我们通过密度比优化的潜在扩散模型明确地最大化了音频对AVS的贡献，然后可以用对比学习作为约束来实现，其中扩散部分是实现最大似然估计的主要目标，密度比优化部分施加约束。通过对比学习，采用这种潜在扩散模型，有效地提高了音频对AVS的贡献。通过在基准数据集上的实验结果验证了该解决方案的有效性。代码和结果通过我们的项目页面：https://github.com/OpenNLPLab/DiffusionAVS在线

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Contrastive Conditional Latent Diffusion for Audio-Visual Segmentation

Audio-visual Segmentation (AVS) is conceptualized as a conditional generation task, where audio is considered as the conditional variable for segmenting the sound producer(s). In this case, audio should be extensively explored to maximize its contribution for the final segmentation task. We propose a contrastive conditional latent diffusion model for audio-visual segmentation (AVS) to thoroughly investigate the impact of audio, where the correlation between audio and the final segmentation map is modeled to guarantee the strong correlation between them. To achieve semantic-correlated representation learning, our framework incorporates a latent diffusion model. The diffusion model learns the conditional generation process of the ground-truth segmentation map, resulting in ground-truth aware inference during the denoising process at the test stage. As our model is conditional, it is vital to ensure that the conditional variable contributes to the model output. We thus extensively model the contribution of the audio signal by minimizing the density ratio between the conditional probability of the multimodal data, e.g. conditioned on the audio-visual data, and that of the unimodal data, e.g. conditioned on the audio data only. In this way, our latent diffusion model via density ratio optimization explicitly maximizes the contribution of audio for AVS, which can then be achieved with contrastive learning as a constraint, where the diffusion part serves as the main objective to achieve maximum likelihood estimation, and the density ratio optimization part imposes the constraint. By adopting this latent diffusion model via contrastive learning, we effectively enhance the contribution of audio for AVS. The effectiveness of our solution is validated through experimental results on the benchmark dataset. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量