IEEE Transactions on Circuits and Systems for Video Technology最新文献

筛选
英文 中文
An Attention-Locating Algorithm for Eliminating Background Effects in Fine-Grained Visual Classification 一种细粒度视觉分类中消除背景影响的注意定位算法
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-28 DOI: 10.1109/TCSVT.2025.3535818
Yueting Huang;Zhenzhe Hechen;Mingliang Zhou;Zhengguo Li;Sam Kwong
{"title":"An Attention-Locating Algorithm for Eliminating Background Effects in Fine-Grained Visual Classification","authors":"Yueting Huang;Zhenzhe Hechen;Mingliang Zhou;Zhengguo Li;Sam Kwong","doi":"10.1109/TCSVT.2025.3535818","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3535818","url":null,"abstract":"Fine-grained visual classification (FGVC) is a challenging task characterized by interclass similarity and intraclass diversity and has broad application prospects. Recently, several methods have adopted the vision Transformer (ViT) in FGVC tasks since the data specificity of the multihead self-attention (MSA) mechanism in ViT is beneficial for extracting discriminative feature representations. However, these works focus on integrating feature dependencies at a high level, which leads to the model being easily disturbed by low-level background information. To address this issue, we propose a fine-grained attention-locating vision Transformer (FAL-ViT) and an attention selection module (ASM). First, FAL-ViT contains a two-stage framework to identify crucial regions effectively within images and enhance features by strategically reusing parameters. Second, the ASM accurately locates important target regions via the natural scores of the MSA, extracting finer low-level features to offer more comprehensive information through position mapping. Extensive experiments on public datasets demonstrate that FAL-ViT outperforms the other methods in terms of performance, confirming the effectiveness of our proposed methods. The source code is available at <uri>https://github.com/Yueting-Huang/FAL-ViT</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5993-6006"},"PeriodicalIF":8.3,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Relighting Scenes With Object Insertions in Neural Radiance Fields 重新照明场景与对象插入在神经辐射领域
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-28 DOI: 10.1109/TCSVT.2025.3535599
Xuening Zhu;Renjiao Yi;Xin Wen;Chenyang Zhu;Kai Xu
{"title":"Relighting Scenes With Object Insertions in Neural Radiance Fields","authors":"Xuening Zhu;Renjiao Yi;Xin Wen;Chenyang Zhu;Kai Xu","doi":"10.1109/TCSVT.2025.3535599","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3535599","url":null,"abstract":"Inserting objects into scenes and performing realistic relighting are common applications in augmented reality (AR). Previous methods focused on inserting virtual objects using CAD models or real objects from single-view images, resulting in highly limited AR application scenarios. We introduce a novel pipeline based on Neural Radiance Fields (NeRFs) for seamlessly integrating objects into scenes, from two sets of images depicting the object and scene. This approach enables novel view synthesis, realistic relighting, and supports physical interactions such as shadow casting between objects. The lighting environment is in a hybrid representation of Spherical Harmonics and Spherical Gaussians, representing both high- and low-frequency lighting components very well, and supporting non-Lambertian surfaces. Specifically, we leverage the benefits of volume rendering and introduce an innovative approach for efficient shadow rendering by comparing the depth maps between the camera view and the light source view and generating vivid soft shadows. The proposed method achieves realistic relighting effects in extensive experimental evaluations.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 7","pages":"6787-6802"},"PeriodicalIF":8.3,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144558014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible ViG: Learning the Self-Saliency for Flexible Object Recognition 柔性视觉:学习柔性物体识别的自显著性
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-27 DOI: 10.1109/TCSVT.2025.3534204
Kunshan Yang;Lin Zuo;Mengmeng Jing;Xianlong Tian;Kunbin He;Yongqi Ding
{"title":"Flexible ViG: Learning the Self-Saliency for Flexible Object Recognition","authors":"Kunshan Yang;Lin Zuo;Mengmeng Jing;Xianlong Tian;Kunbin He;Yongqi Ding","doi":"10.1109/TCSVT.2025.3534204","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3534204","url":null,"abstract":"Existing computer vision methods mainly focus on the recognition of rigid objects, whereas the recognition of flexible objects remains unexplored. Recognizing flexible objects poses significant challenges due to their inherently diverse shapes and sizes, translucent attributes, ambiguous boundaries, and subtle inter-class differences. In this paper, we claim that these problems primarily arise from the lack of object saliency. To this end, we propose the Flexible Vision Graph Neural Network (FViG) to optimize the self-saliency and thereby improve the discrimination of the representations for flexible objects. Specifically, on one hand, we propose to maximize the channel-aware saliency by extracting the weight of neighboring graph nodes, which is employed to identify flexible objects with minimal inter-class differences. On the other hand, we maximize the spatial-aware saliency based on clustering to aggregate neighborhood information for the centroid graph nodes. This introduces local context information and enables extracting of consistent representation, effectively adapting to the shape and size variations in flexible objects. To verify the performance of flexible objects recognition thoroughly, for the first time we propose the Flexible Dataset (FDA), which consists of various images of flexible objects collected from real-world scenarios or online. Extensive experiments evaluated on our FDA, FireNet, CIFAR-100 and ImageNet-Hard datasets demonstrate the effectiveness of our method on enhancing the discrimination of flexible objects.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 7","pages":"6424-6436"},"PeriodicalIF":8.3,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144557907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SGFormer: Spherical Geometry Transformer for 360° Depth Estimation SGFormer:用于360°深度估计的球面几何变压器
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-27 DOI: 10.1109/TCSVT.2025.3534220
Junsong Zhang;Zisong Chen;Chunyu Lin;Zhijie Shen;Lang Nie;Kang Liao;Yao Zhao
{"title":"SGFormer: Spherical Geometry Transformer for 360° Depth Estimation","authors":"Junsong Zhang;Zisong Chen;Chunyu Lin;Zhijie Shen;Lang Nie;Kang Liao;Yao Zhao","doi":"10.1109/TCSVT.2025.3534220","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3534220","url":null,"abstract":"Panoramic distortion poses a significant challenge in 360° depth estimation, particularly pronounced at the north and south poles. Existing methods either adopt a bi-projection fusion strategy to remove distortions or model long-range dependencies to capture global structures, resulting in either unclear structure or insufficient local perception. In this paper, we propose a spherical geometry transformer, named SGFormer, to address the above issues, with an innovative step to integrate spherical geometric priors into vision transformers. To this end, we retarget the transformer decoder to a spherical prior decoder (termed SPDecoder), which endeavors to uphold the integrity of spherical structures during decoding. Concretely, we leverage bipolar reprojection, circular rotation, and curve local embedding to preserve the spherical characteristics of equidistortion, continuity, and surface distance, respectively. Furthermore, we present a query-based global conditional position embedding to compensate for spatial structure at varying resolutions. It not only boosts the global perception of spatial position but also sharpens the depth structure across different patches. Finally, we conduct extensive experiments on popular benchmarks, demonstrating our superiority over state-of-the-art solutions. Our code will be made publicly at <uri>https://github.com/iuiuJaon/SGFormer</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5738-5748"},"PeriodicalIF":8.3,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144272936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TRNet: Two-Tier Recursion Network for Co-Salient Object Detection TRNet:用于共显著目标检测的两层递归网络
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-27 DOI: 10.1109/TCSVT.2025.3534908
Runmin Cong;Ning Yang;Hongyu Liu;Dingwen Zhang;Qingming Huang;Sam Kwong;Wei Zhang
{"title":"TRNet: Two-Tier Recursion Network for Co-Salient Object Detection","authors":"Runmin Cong;Ning Yang;Hongyu Liu;Dingwen Zhang;Qingming Huang;Sam Kwong;Wei Zhang","doi":"10.1109/TCSVT.2025.3534908","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3534908","url":null,"abstract":"Co-salient object detection (CoSOD) is to find the salient and recurring objects from a series of relevant images, where modeling inter-image relationships plays a crucial role. Different from the commonly used direct learning structure that inputs all the intra-image features into some well-designed modules to represent the inter-image relationship, we resort to adopting a recursive structure for inter-image modeling, and propose a two-tier recursion network (TRNet) to achieve CoSOD in this paper. The two-tier recursive structure of the proposed TRNet is embodied in two stages of inter-image extraction and distribution. On the one hand, considering the task adaptability and inter-image correlation, we design an inter-image exploration with recursive reinforcement module to learn the local and global inter-image correspondences, guaranteeing the validity and discriminativeness of the information in the step-by-step propagation. On the other hand, we design a dynamic recursion distribution module to fully exploit the role of inter-image correspondences in a recursive structure, adaptively assigning common attributes to each individual image through an improved semi-dynamic convolution. Experimental results on five prevailing CoSOD benchmarks demonstrate that our TRNet outperforms other competitors in terms of various evaluation metrics. The code and results of our method are available at <uri>https://github.com/rmcong/TRNet_TCSVT2025</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5844-5857"},"PeriodicalIF":8.3,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Facial Depression Estimation via Multi-Cue Contrastive Learning 基于多线索对比学习的面部抑郁估计
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-24 DOI: 10.1109/TCSVT.2025.3533543
Xinke Wang;Jingyuan Xu;Xiao Sun;Mingzheng Li;Bin Hu;Wei Qian;Dan Guo;Meng Wang
{"title":"Facial Depression Estimation via Multi-Cue Contrastive Learning","authors":"Xinke Wang;Jingyuan Xu;Xiao Sun;Mingzheng Li;Bin Hu;Wei Qian;Dan Guo;Meng Wang","doi":"10.1109/TCSVT.2025.3533543","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3533543","url":null,"abstract":"Vision-based depression estimation is an emerging yet impactful task, whose challenge lies in predicting the severity of depression from facial videos lasting at least several minutes. Existing methods primarily focus on fusing frame-level features to create comprehensive representations. However, they often overlook two crucial aspects: 1) inter- and intra-cue correlations, and 2) variations among samples. Hence, simply characterizing sample embeddings while ignoring to mine the relation among multiple cues leads to limitations. To address this problem, we propose a novel Multi-Cue Contrastive Learning (MCCL) framework to mine the relation among multiple cues for discriminative representation. Specifically, we first introduce a novel cross-characteristic attentive interaction module to model the relationship among multiple cues from four facial features (e.g., 3D landmarks, head poses, gazes, FAUs). Then, we propose a temporal segment attentive interaction module to capture the temporal relationships within each facial feature over time intervals. Moreover, we integrate contrastive learning to leverage the variations among samples by regarding the embeddings of inter-cue and intra-cue as positive pairs while considering embeddings from other samples as negative. In this way, the proposed MCCL framework leverages the relationships among the facial features and the variations among samples to enhance the process of multi-cue mining, thereby achieving more accurate facial depression estimation. Extensive experiments on public datasets, DAIC-WOZ, CMDC, and E-DAIC, demonstrate that our model not only outperforms the advanced depression methods but that the discriminative representations of facial behaviors provide potential insights about depression. Our code is available at: <uri>https://github.com/xkwangcn/MCCL.git</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"6007-6020"},"PeriodicalIF":8.3,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Underwater Image Quality Assessment Using Feature Disentanglement and Dynamic Content-Distortion Guidance 基于特征解缠和动态内容失真制导的水下图像质量评估
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-24 DOI: 10.1109/TCSVT.2025.3533598
Junjie Zhu;Liquan Shen;Zhengyong Wang;Yihan Yu
{"title":"Underwater Image Quality Assessment Using Feature Disentanglement and Dynamic Content-Distortion Guidance","authors":"Junjie Zhu;Liquan Shen;Zhengyong Wang;Yihan Yu","doi":"10.1109/TCSVT.2025.3533598","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3533598","url":null,"abstract":"Due to the complex underwater imaging process, underwater images contain a variety of unique distortions. While existing underwater image quality assessment (UIQA) methods have made progress by highlighting these distortions, they overlook the fact that image content also affects how distortions are perceived, as different content exhibits varying sensitivities to different types of distortions. Both the characteristics of the content itself and the properties of the distortions determine the quality of underwater images. Additionally, the intertwined nature of content and distortion features in underwater images complicates the accurate extraction of both. In this paper, we address these issues by comprehensively accounting for both content and distortion information and explicitly disentangling underwater image features into content and distortion components. To achieve this, we introduce a dynamic content-distortion guiding and feature disentanglement network (DysenNet), composed of three main components: the feature disentanglement sub-network (FDN), the dynamic content guidance module (DCM), and the dynamic distortion guidance module (DDM). Specifically, the FDN disentangles underwater features into content and distortion elements, allowing us to more clearly measure their respective contributions to image quality. The DCM generates dynamic multi-scale convolutional kernels tailored to the unique content of each image, enabling content-adaptive feature extraction for quality perception. The DDM, on the other hand, addresses both global and local underwater distortions by identifying distortion cues from both channel and spatial perspectives, focusing on regions and channels with severe degradation. Extensive experiments on UIQA datasets demonstrate the state-of-the-art performance of the proposed method.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5602-5616"},"PeriodicalIF":8.3,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144272998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Edge Guided Network With Motion Enhancement for Few-Shot Action Recognition 基于运动增强的边缘引导网络的少镜头动作识别
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-24 DOI: 10.1109/TCSVT.2025.3533573
Kaiwen Du;Weirong Ye;Hanyu Guo;Yan Yan;Hanzi Wang
{"title":"Edge Guided Network With Motion Enhancement for Few-Shot Action Recognition","authors":"Kaiwen Du;Weirong Ye;Hanyu Guo;Yan Yan;Hanzi Wang","doi":"10.1109/TCSVT.2025.3533573","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3533573","url":null,"abstract":"Existing state-of-the-art methods for few-shot action recognition (FSAR) achieve promising performance by spatial and temporal modeling. However, most current methods ignore the importance of edge information and motion cues, leading to inferior performance. For the few-shot task, it is important to effectively explore limited data. Additionally, effectively utilizing edge information is beneficial for exploring motion cues, and vice versa. In this paper, we propose a novel edge guided network with motion enhancement (EGME) for FSAR. To the best of our knowledge, this is the first work to utilize the edge information as guidance in the FSAR task. Our EGME contains two crucial components, including an edge information extractor (EIE) and a motion enhancement module (ME). Specifically, EIE is used to obtain edge information on video frames. Afterward, the edge information is used as guidance to fuse with the frame features. In addition, ME can adaptively capture motion-sensitive features of videos. It adopts a self-gating mechanism to highlight motion-sensitive regions in videos from a large temporal receptive field. Based on the above designed components, EGME can capture edge information and motion cues, resulting in superior recognition performance. Experimental results on four challenging benchmarks show that EGME performs favorably against recent advanced methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5331-5342"},"PeriodicalIF":8.3,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic Concept Perception Network With Interactive Prompting for Cross-View Image Geo-Localization 基于交互式提示的跨视图像地理定位语义概念感知网络
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-24 DOI: 10.1109/TCSVT.2025.3533574
Yuan Gao;Haibo Liu;Xiaohui Wei
{"title":"Semantic Concept Perception Network With Interactive Prompting for Cross-View Image Geo-Localization","authors":"Yuan Gao;Haibo Liu;Xiaohui Wei","doi":"10.1109/TCSVT.2025.3533574","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3533574","url":null,"abstract":"Cross-view image geo-localization aims to estimate the geographic position of a query image from the ground platform (such as mobile phone, vehicle camera) by matching it with geo-tagged reference images from the aerial platform (such as drone, satellite). Although existing studies have achieved promising results, they usually rely only on depth features and fail to effectively handle the serious changes in geometric shape and appearance caused by view differences. In this paper, a novel Semantic Concept Perception Network (SCPNet) with interactive prompting is proposed, whose core is to extract and integrate semantic concept information reflecting spatial position relationship between objects. Specifically, for a given of pair input images, a CNN stem with positional embedding is first adopted to extract depth features. Meanwhile, a semantic concept mining module is designed to distinguish different objects and capture the associations between them, thereby achieving the purpose of extracting semantic concept information. Furthermore, to obtain global descriptions of different views, a feature bidirectional injection fusion module based on attention mechanism is proposed to exploit the long-range dependencies of semantic concept and depth features. Finally, a triplet loss with a flexible hard sample mining strategy is used to guide the optimization of the network. Experimental results have shown that our proposed method can achieve better performance compared with state-of-the-art methods on mainstream cross-view datasets.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5343-5354"},"PeriodicalIF":8.3,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Conditional Dual Diffusion for Multimodal Clustering of Optical and SAR Images 光学和SAR图像多模态聚类的条件双重扩散
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-23 DOI: 10.1109/TCSVT.2025.3533301
Shujun Liu;Ling Chang
{"title":"Conditional Dual Diffusion for Multimodal Clustering of Optical and SAR Images","authors":"Shujun Liu;Ling Chang","doi":"10.1109/TCSVT.2025.3533301","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3533301","url":null,"abstract":"Acknowledging different wavelengths by imaging mechanisms, optical images usually embed higher low-dimensional manifolds into ambient spaces than SAR images do. How to utilize their complementarity remains challenging for multimodal clustering. In this study, we devise a conditional dual diffusion (CDD) model for multimodal clustering of optical and SAR images, and theoretically prove that it is equivalent to a probability flow ordinary differential equation (ODE) having a unique solution. Different from vanilla diffusion models, the CDD model is equipped with a decoupling autoencoder to predict noises and clear images simultaneously, preserving data manifolds embedded in latent space. To the fuse manifolds of optical and SAR images, we train the model to generate optical images conditioned by SAR images, mapping them into a unified latent space. The learned features extracted from the model are fed to K-means algorithm to produce resulting clusters. To the best of our knowledge, this study could be one of the first diffusion models for multimodal clustering. Extensive comparison experiments on three large-scale optical-SAR pair datasets show the superiority of our method over state-of-the-art (SOTA) methods overall in terms of clustering performance and time consumption. The source code is available at <uri>https://github.com/suldier/CDD</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5318-5330"},"PeriodicalIF":8.3,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信