{"title":"Exponential Dissimilarity-Dispersion Family for Domain-Specific Representation Learning","authors":"Ren Togo;Nao Nakagawa;Takahiro Ogawa;Miki Haseyama","doi":"10.1109/TIP.2025.3608661","DOIUrl":"10.1109/TIP.2025.3608661","url":null,"abstract":"This paper presents a new domain-specific representation learning method, exponential dissimilarity-dispersion family (EDDF), a novel distribution family that includes a dissimilarity function and a global dispersion parameter. In generative models, variational autoencoders (VAEs) has a solid theoretical foundation based on variational inference in visual representation learning and are also used as one of core components of other generative models. This paper addresses the issue where conventional VAEs, with the commonly adopted Gaussian settings, tend to experience performance degradation in generative modeling for high-dimensional data. This degradation is often caused by their excessively limited model family. To tackle this problem, we propose EDDF, a new domain-specific method introducing a novel distribution family with a dissimilarity function and a global dispersion parameter. A decoder using this family employs dissimilarity functions for the evidence lower bound (ELBO) reconstruction loss, leveraging domain-specific knowledge to enhance high-dimensional data modeling. We also propose an ELBO optimization method for VAEs with EDDF decoders that implicitly approximates the stochastic gradient of the normalizing constant using log-expected dissimilarity. Empirical evaluations of the generative performance show the effectiveness of our model family and proposed method. Our framework can be integrated into any VAE-based generative models in representation learning. The code and model are available at <uri>https://github.com/ganmodokix/eddf-vae</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"6110-6125"},"PeriodicalIF":13.7,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11175279","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145116233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Volume Fusion-Based Self-Supervised Pretraining for 3D Medical Image Segmentation","authors":"Guotai Wang;Jia Fu;Jianghao Wu;Xiangde Luo;Yubo Zhou;Xinglong Liu;Kang Li;Jingsheng Lin;Baiyong Shen;Shaoting Zhang","doi":"10.1109/TIP.2025.3610249","DOIUrl":"10.1109/TIP.2025.3610249","url":null,"abstract":"The performance of deep learning models for medical image segmentation is often limited in scenarios where training data or annotations are limited. Self-Supervised Learning (SSL) is an appealing solution for this dilemma due to its feature learning ability from a large amount of unannotated images. Existing SSL methods have focused on pretraining either an encoder for global feature representation or an encoder-decoder structure for image restoration, where the gap between pretext and downstream tasks limits the usefulness of pretrained decoders in downstream segmentation. In this work, we propose a novel SSL strategy named Volume Fusion (VolF) for pretraining 3D segmentation models. It minimizes the gap between pretext and downstream tasks by introducing a pseudo-segmentation pretext task, where two sub-volumes are fused by a discretized block-wise fusion coefficient map. The model takes the fused result as input and predicts the category of fusion coefficient for each voxel, which can be trained with standard supervised segmentation loss functions without manual annotations. Experiments with an abdominal CT dataset for pretraining and both in-domain and out-domain downstream datasets showed that VolF led to large performance gain from training from scratch with faster convergence speed, and outperformed several state-of-the-art SSL methods. In addition, it is general to different network structures, and the learned features have high generalizability to different body parts and modalities.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"6041-6052"},"PeriodicalIF":13.7,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145116208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatial Clustering Guided Two-View Multi-Structural Deterministic Geometric Model Fitting","authors":"Guobao Xiao","doi":"10.1109/TIP.2025.3610248","DOIUrl":"10.1109/TIP.2025.3610248","url":null,"abstract":"This paper addresses the two-view geometric model fitting problem on the multi-structural data with severe outliers for providing reliable and consistent fitting results. The key idea is to adopt spatial clustering to guide deterministically sample minimum subsets. Specifically, we firstly improve the effectiveness of spatial clustering with good neighbors that preserve the consensus of neighborhood elements and neighborhood topology, for enhancing the quality of sampled minimum subsets. Then we further design a multi-scale fusion strategy, which not only boosts more high-quality minimum subsets, but also enables our method to cover all model instances in data. Moreover, we propose a simple and effective model selection algorithm to estimate the parameters of model instances in data. The final proposed method is able to guarantee fast, accurate and stable model fitting results for the multi-structural data. In addition, we construct two large labeled datasets, for homography and fundamental matrix estimation, respectively. Experimental results on real images from six datasets show the significant superiority of the proposed method on both accuracy and speed over several state-of-the-art alternatives. Especially for the MS-COCO-F and YFCC100M-F datasets, the proposed method yields a performance boost of over three times on segmentation error, parameter error and the CPU time.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"6016-6028"},"PeriodicalIF":13.7,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145116210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aiming Zhang;Tianyuan Yu;Liang Bai;Jun Tang;Yanming Guo;Yirun Ruan;Yun Zhou;Zhihe Lu
{"title":"COLA: Context-Aware Language-Driven Test-Time Adaptation","authors":"Aiming Zhang;Tianyuan Yu;Liang Bai;Jun Tang;Yanming Guo;Yirun Ruan;Yun Zhou;Zhihe Lu","doi":"10.1109/TIP.2025.3607634","DOIUrl":"10.1109/TIP.2025.3607634","url":null,"abstract":"Test-time adaptation (TTA) has gained increasing popularity due to its efficacy in addressing “distribution shift” issue while simultaneously protecting data privacy. However, most prior methods assume that a paired source domain model and target domain sharing the same label space coexist, heavily limiting their applicability. In this paper, we investigate a more general source model capable of adaptation to multiple target domains without needing shared labels. This is achieved by using a pre-trained vision-language model (VLM), e.g., CLIP, that can recognize images through matching with class descriptions. While the zero-shot performance of VLMs is impressive, they struggle to effectively capture the distinctive attributes of a target domain. To that end, we propose a novel method – Context-aware Language-driven TTA (COLA). The proposed method incorporates a lightweight context-aware module that consists of three key components: a task-aware adapter, a context-aware unit, and a residual connection unit for exploring task-specific knowledge, domain-specific knowledge from the VLM and prior knowledge of the VLM, respectively. It is worth noting that the context-aware module can be seamlessly integrated into a frozen VLM, ensuring both minimal effort and parameter efficiency. Additionally, we introduce a Class-Balanced Pseudo-labeling (CBPL) strategy to mitigate the adverse effects caused by class imbalance. We demonstrate the effectiveness of our method not only in TTA scenarios but also in class generalisation tasks. The source code is available at <uri>https://github.com/NUDT-Bai-Group/COLA-TTA</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"6002-6015"},"PeriodicalIF":13.7,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145089105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sheng Huang;Jiexuan Yan;Beiyan Liu;Bo Liu;Richang Hong
{"title":"Dual-View Alignment Learning With Hierarchical-Prompt for Class-Imbalance Multi-Label Image Classification","authors":"Sheng Huang;Jiexuan Yan;Beiyan Liu;Bo Liu;Richang Hong","doi":"10.1109/TIP.2025.3609185","DOIUrl":"10.1109/TIP.2025.3609185","url":null,"abstract":"Real-world datasets often exhibit class imbalance across multiple categories, manifesting as long-tailed distributions and few-shot scenarios. This is especially challenging in Class-Imbalanced Multi-Label Image Classification (CI-MLIC) tasks, where data imbalance and multi-object recognition present significant obstacles. To address these challenges, we propose a novel method termed Dual-View Alignment Learning with Hierarchical Prompt (HP-DVAL), which leverages multi-modal knowledge from vision-language pretrained (VLP) models to mitigate the class-imbalance problem in multi-label settings. Specifically, HP-DVAL employs dual-view alignment learning to transfer the powerful feature representation capabilities from VLP models by extracting complementary features for accurate image-text alignment. To better adapt VLP models for CI-MLIC tasks, we introduce a hierarchical prompt-tuning strategy that utilizes global and local prompts to learn task-specific and context-related prior knowledge. Additionally, we design a semantic consistency loss during prompt tuning to prevent learned prompts from deviating from general knowledge embedded in VLP models. The effectiveness of our approach is validated on two CI-MLIC benchmarks: MS-COCO and VOC2007. Extensive experimental results demonstrate the superiority of our method over SOTA approaches, achieving mAP improvements of 10.0% and 5.2% on the long-tailed multi-label image classification task, and 6.8% and 2.9% on the multi-label few-shot image classification task.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5989-6001"},"PeriodicalIF":13.7,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145083515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haolin Ji;Fengying Xie;Linpeng Pan;Yushan Zheng;Zhenwei Shi
{"title":"HUNTNet: Homomorphic Unified Nexus Topology for Camouflaged Object Detection","authors":"Haolin Ji;Fengying Xie;Linpeng Pan;Yushan Zheng;Zhenwei Shi","doi":"10.1109/TIP.2025.3607635","DOIUrl":"10.1109/TIP.2025.3607635","url":null,"abstract":"Camouflaged object detection (COD) is challenging for both human and computer vision, as targets often blend into the background by sharing similar color, texture, or shape. While many feature enhancement techniques exist, single-view methods tend to overemphasize certain Recognizing that camouflaged objects exhibit different concealment strategies under varying observational perspectives, we propose HUNTNet, a network that establishes a dynamic detection mechanism to decouple target features from RGB images and perform topological decamouflage across multiple homomorphic feature spaces through a unified feature focusing architecture. We adopt PVTv2 as the backbone to extract multi-perspective spatial features. Detail representation is enhanced via a feature module that integrates Dual-Channel Recursive (DCR), Wavelet-Gabor Transform (WGT), and Anisotropic Gradient Responding (AGR), which together improve boundary discrimination and edge contour detection. To further boost performance, the Simplicial Feature Integration (SFI) module recursively fuses multi-layer features, enabling high-resolution focus on target regions. Experiments show that HUNTNet surpasses state-of-the-art methods in both accuracy and generalization, offering a robust solution for COD and improving segmentation in complex scenes. Our code is available at <uri>https://github.com/HaolinJi817/HUNTNet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"6068-6082"},"PeriodicalIF":13.7,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145083516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical Color Constancy via Efficient Spectral Feature Extraction","authors":"Dong-Keun Han;Dong-Hoon Kang;Jong-Ok Kim","doi":"10.1109/TIP.2025.3607631","DOIUrl":"10.1109/TIP.2025.3607631","url":null,"abstract":"This paper presents an empirical investigation into illuminant estimation using multi-spectral images. Our study emphasizes two key contributions: (1) the utilization of the estimated multi-spectral images and (2) the incorporation of a hierarchical structure. Firstly, exploiting multi-spectral images proves to have a positive influence on illuminant estimation, particularly in scenarios characterized by monochromatic images where conventional color constancy methods face challenges. Our experimental results vividly illustrate the effectiveness of leveraging spectral information in enhancing illuminant estimation. Secondly, the adoption of a hierarchical structure stems from the need for spatial invariance in the task of estimating a global illuminant. To further enhance the performance of the hierarchical structure, we employ a contrastive loss applied to different scaled outputs. This approach demonstrates remarkable effectiveness on our custom dataset, showcasing superior performance compared to the existing methods. In addition, we extend the evaluation to the widely recognized NUS-8 dataset, where the proposed method showcases a notable 26.7% relative improvement over the previous state-of-the-art methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"6029-6040"},"PeriodicalIF":13.7,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145083401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Perceptually-Guided VR Style Transfer","authors":"Seonghwa Choi;Jungwoo Huh;Sanghoon Lee;Alan Conrad Bovik","doi":"10.1109/TIP.2025.3607611","DOIUrl":"10.1109/TIP.2025.3607611","url":null,"abstract":"Virtual reality (VR) makes it possible to provide immersive multimedia content composed of omnidirectional videos (ODVs). Towards enabling more immersive and satisfying VR content, methods are needed to manipulate VR scenes, taking into account perceptual factors related to viewers’ quality of experience (QoE). For example, style transfer methods can be applied to VR content, allowing users to create artistic or surreal effects in their immersive environments. Here, we study perceptual factors that affect the sensation of stylized immersiveness, including color dynamics and spatio-temporal consistency. To do this, we introduce an immersiveness sensitivity model of luminance and color perception, and use it to measure the color dynamics and spatio-temporal consistency of stylized VR contents. We subsequently use this model to construct a perceptually-guided VR style transfer model called VR Style Transfer GAN (VRST-GAN). VRST-GAN learns to transfer a desired style into VR to enhance immersiveness by considering color dynamics while preserving spatio-temporal consistency. We demonstrate the effectiveness of VRST-GAN via qualitative and quantitative experiments. We also develop a VR Immersiveness Predictor (VR-IP) that is able to predict the sensation of immersiveness using the perceptual model. In our experiments, VR-IP predicts immersiveness with an accuracy of 91%.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"6083-6097"},"PeriodicalIF":13.7,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145072842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Anchor-Guided Representation Learning for Efficient Multi-View Subspace Clustering","authors":"Mengjiao Zhang;Xinwang Liu;Tianhao Han;Xiaofeng Qu;Sijie Niu","doi":"10.1109/TIP.2025.3607587","DOIUrl":"10.1109/TIP.2025.3607587","url":null,"abstract":"Multi-view Subspace Clustering (MVSC) effectively aggregating multiple data sources to promise clustering performance. Recently, various anchor-based variants have been introduced to effectively alleviate the computation complexity of MVSC. Although satisfactory advancement has been achieved, existing methods either independently learn anchor matrices and their anchor representations or learn a consensus anchor matrix and unified anchor representation, failing to capture both consistency and complementary information simultaneously. In addition, the time complexity of obtaining clustering results by applying Singular Value Decomposition (SVD) on the anchor representation matrix remains high. To tackle the above problems, we propose an Adaptive Anchor-guided Representation Learning for Efficient Multi-view Subspace Clustering (A2RL-EMVSC) framework, which integrates consensus anchors learning, anchor-guided representation learning and matrix factorization to enhance clustering performance and scalability. Technically, the proposed method learns view-specific anchor representation matrices by consensus anchors guidance, which simultaneously exploit consistency and complementary information. Moreover, by applying matrix decomposition to the view-specific anchor representation matrices, clustering results can be achieved with linear time complexity. Extensive experiments on ten challenging multi-view datasets show that the proposed method can improve the effectiveness and superiority of clustering compared with state-of-the-art methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"6053-6067"},"PeriodicalIF":13.7,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenqian Dong;Junying Ren;Song Xiao;Leyuan Fang;Jiahui Qu;Yunsong Li
{"title":"Cycle Translation-Based Collaborative Training for Hyperspectral-RGB Multimodal Change Detection","authors":"Wenqian Dong;Junying Ren;Song Xiao;Leyuan Fang;Jiahui Qu;Yunsong Li","doi":"10.1109/TIP.2025.3607609","DOIUrl":"10.1109/TIP.2025.3607609","url":null,"abstract":"Hyperspectral image change detection (HSI-CD) benefits from HSIs with continuous spectral bands, which uniquely enables the analysis of more subtle changes. Existing methods have achieved desirable performance relying on multi-temporal homogenous HSIs over the same region, which is generally difficult to obtain in real scenes. HSI-RGB multimodal CD overcomes the constraint of limited HSI availability by incorporating another temporal RGB data, and the combination of advantages within different modalities enhances the robustness of detection results. Nevertheless, due to the different imaging mechanisms between two modalities, existing HSI CD methods cannot be directly applied. In this paper, we propose a cycle translation-based collaborative training (co-training) for HSI-RGB multimodal CD, which achieves cross-modal mutual guidance to collaboratively learn complementary difference information from diverse modalities for identifying changes. Specifically, a cross-modal guided CycleGAN-based image translation module is designed to implement bi-directional image translation, which mitigates modal difference and enables the extraction of information related to land cover changes. Then, a spatial-spectral interactive co-training CD module is proposed to achieve iterative interaction between cross-modal information, which jointly extracts the multimodal difference features to generate the final results. The proposed method outperforms several leading CD methods in extensive experiments carried out on both real and synthetic datasets. In addition, a new public HSI-RGB multimodal dataset along with our code are available at <uri>https://github.com/Jiahuiqu/CT2Net</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"6347-6360"},"PeriodicalIF":13.7,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}