{"title":"光学和SAR图像多模态聚类的条件双重扩散","authors":"Shujun Liu;Ling Chang","doi":"10.1109/TCSVT.2025.3533301","DOIUrl":null,"url":null,"abstract":"Acknowledging different wavelengths by imaging mechanisms, optical images usually embed higher low-dimensional manifolds into ambient spaces than SAR images do. How to utilize their complementarity remains challenging for multimodal clustering. In this study, we devise a conditional dual diffusion (CDD) model for multimodal clustering of optical and SAR images, and theoretically prove that it is equivalent to a probability flow ordinary differential equation (ODE) having a unique solution. Different from vanilla diffusion models, the CDD model is equipped with a decoupling autoencoder to predict noises and clear images simultaneously, preserving data manifolds embedded in latent space. To the fuse manifolds of optical and SAR images, we train the model to generate optical images conditioned by SAR images, mapping them into a unified latent space. The learned features extracted from the model are fed to K-means algorithm to produce resulting clusters. To the best of our knowledge, this study could be one of the first diffusion models for multimodal clustering. Extensive comparison experiments on three large-scale optical-SAR pair datasets show the superiority of our method over state-of-the-art (SOTA) methods overall in terms of clustering performance and time consumption. The source code is available at <uri>https://github.com/suldier/CDD</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5318-5330"},"PeriodicalIF":11.1000,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Conditional Dual Diffusion for Multimodal Clustering of Optical and SAR Images\",\"authors\":\"Shujun Liu;Ling Chang\",\"doi\":\"10.1109/TCSVT.2025.3533301\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Acknowledging different wavelengths by imaging mechanisms, optical images usually embed higher low-dimensional manifolds into ambient spaces than SAR images do. How to utilize their complementarity remains challenging for multimodal clustering. In this study, we devise a conditional dual diffusion (CDD) model for multimodal clustering of optical and SAR images, and theoretically prove that it is equivalent to a probability flow ordinary differential equation (ODE) having a unique solution. Different from vanilla diffusion models, the CDD model is equipped with a decoupling autoencoder to predict noises and clear images simultaneously, preserving data manifolds embedded in latent space. To the fuse manifolds of optical and SAR images, we train the model to generate optical images conditioned by SAR images, mapping them into a unified latent space. The learned features extracted from the model are fed to K-means algorithm to produce resulting clusters. To the best of our knowledge, this study could be one of the first diffusion models for multimodal clustering. Extensive comparison experiments on three large-scale optical-SAR pair datasets show the superiority of our method over state-of-the-art (SOTA) methods overall in terms of clustering performance and time consumption. The source code is available at <uri>https://github.com/suldier/CDD</uri>.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 6\",\"pages\":\"5318-5330\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-01-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10851381/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10851381/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Conditional Dual Diffusion for Multimodal Clustering of Optical and SAR Images
Acknowledging different wavelengths by imaging mechanisms, optical images usually embed higher low-dimensional manifolds into ambient spaces than SAR images do. How to utilize their complementarity remains challenging for multimodal clustering. In this study, we devise a conditional dual diffusion (CDD) model for multimodal clustering of optical and SAR images, and theoretically prove that it is equivalent to a probability flow ordinary differential equation (ODE) having a unique solution. Different from vanilla diffusion models, the CDD model is equipped with a decoupling autoencoder to predict noises and clear images simultaneously, preserving data manifolds embedded in latent space. To the fuse manifolds of optical and SAR images, we train the model to generate optical images conditioned by SAR images, mapping them into a unified latent space. The learned features extracted from the model are fed to K-means algorithm to produce resulting clusters. To the best of our knowledge, this study could be one of the first diffusion models for multimodal clustering. Extensive comparison experiments on three large-scale optical-SAR pair datasets show the superiority of our method over state-of-the-art (SOTA) methods overall in terms of clustering performance and time consumption. The source code is available at https://github.com/suldier/CDD.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.