{"title":"Mgs-Stereo: Multi-scale Geometric-Structure-Enhanced Stereo Matching for Complex Real-World Scenes.","authors":"Zhien Dai,Zhaohui Tang,Hu Zhang,Yongfang Xie","doi":"10.1109/tip.2025.3612754","DOIUrl":"https://doi.org/10.1109/tip.2025.3612754","url":null,"abstract":"Complex imaging environments and conditions in real-world scenes pose significant challenges for stereo matching tasks. Models are susceptible to underperformance in non-Lambertian surfaces, weakly textured regions, and occluded regions, due to the difficulty in establishing accurate matching relationships between pixels. To alleviate these problems, we propose a multi-scale geometrically enhanced stereo matching model that exploits the geometric structural relationships of the objects in the scene to mitigate these problems. Firstly, a geometric structure perception module is designed to extract geometric information from the reference view. Secondly, a geometric structure-adaptive embedding module is proposed to integrate geometric information with matching similarity information. This module integrates multi-source features dynamically to predict disparity residuals in different regions. Third, a geometric-based normalized disparity correction module is proposed to improve matching robustness for pathological regions in realistic complex scenes. Extensive evaluations on popular benchmarks demonstrate that our method achieves competitive performance against leading approaches. Notably, our model provides robust and accurate predictions in challenging regions containing edges, occlusions, reflective, and non-Lambertian surfaces. Our source code will be publicly available.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"42 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145153460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reduced Biquaternion Dual-Branch Deraining U-Network via Multi-Attention Mechanism.","authors":"Shan Gai,Yihao Ni","doi":"10.1109/tip.2025.3612841","DOIUrl":"https://doi.org/10.1109/tip.2025.3612841","url":null,"abstract":"As a prerequisite for many vision-oriented tasks, image deraining is an effective solution to alleviate performance degradation of these tasks on rainy days. In recent years, the introduction of deep learning has obtained the significant developments in deraining techniques. However, due to the inherent constraints of synthetic datasets and the insufficient robustness of network architecture designs, most existing methods are difficult to fit varied rain patterns and adapt to the transition from synthetic rainy images to real ones, ultimately resulting in unsatisfactory restoration outcomes. To address these issues, we propose a reduced biquaternion dual-branch deraining U-Network (RQ-D2UNet) for better deraining performance, which is the first attempt to apply the reduced biquaternion-valued neural network in the deraining task. The algebraic properties of reduced biquaternion (RQ) can facilitate modeling the rainy artifacts more accurately while preserving the underlying spatial structure of the background image. The comprehensive design scheme of U-shaped architecture and dual-branch structure can extract multi-scale contextual information and fully explore the mixed correlation between rain and rain-free features. Moreover, we also extend the self-attention and convolutional attention mechanisms in the RQ domain, which allow the proposed model to balance both global dependency capture and local feature extraction. Extensive experimental results on various rainy datasets (i.e., rain streak/rain-haze/raindrop/real rain), downstream vision applications (i.e., object detection and segmentation), and similar image restoration tasks (i.e., image desnowing and low-light image enhancement) demonstrate the superiority and versatility of our proposed method.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"52 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145153461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xueting Chen,Yan Yan,Jing-Hao Xue,Chang Shu,Hanzi Wang
{"title":"Hyperbolic Self-Paced Multi-Expert Network for Cross-Domain Few-Shot Facial Expression Recognition.","authors":"Xueting Chen,Yan Yan,Jing-Hao Xue,Chang Shu,Hanzi Wang","doi":"10.1109/tip.2025.3612281","DOIUrl":"https://doi.org/10.1109/tip.2025.3612281","url":null,"abstract":"Recently, cross-domain few-shot facial expression recognition (CF-FER), which identifies novel compound expressions with a few images in the target domain by using the model trained only on basic expressions in the source domain, has attracted increasing attention. Generally, existing CF-FER methods leverage the multi-dataset to increase the diversity of the source domain and alleviate the discrepancy between the source and target domains. However, these methods learn feature embeddings in the Euclidean space without considering imbalanced expression categories and imbalanced sample difficulty in the multi-dataset. Such a way makes the model difficult to capture hierarchical relationships of facial expressions, resulting in inferior transferable representations. To address these issues, we propose a hyperbolic self-paced multi-expert network (HSM-Net), which contains multiple mixture-of-experts (MoE) layers located in the hyperbolic space, for CF-FER. Specifically, HSM-Net collaboratively trains multiple experts in a self-distillation manner, where each expert focuses on learning a subset of expression categories from the multi-dataset. Based on this, we introduce a hyperbolic self-paced learning (HSL) strategy that exploits sample difficulty to adaptively train the model from easy-to-hard samples, greatly reducing the influence of imbalanced expression categories and imbalanced sample difficulty. Our HSM-Net can effectively model rich hierarchical relationships of facial expressions and obtain a highly transferable feature space. Extensive experiments on both in-the-lab and in-the-wild compound expression datasets demonstrate the superiority of our proposed method over several state-of-the-art methods. Code will be released at https://github.com/cxtjl/HSM-Net.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"42 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145140286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haolin Wang,Yixuan Zhu,Wenliang Zhao,Jie Zhou,Jiwen Lu
{"title":"VisionHub: Learning Task-Plugins for Efficient Universal Vision Model.","authors":"Haolin Wang,Yixuan Zhu,Wenliang Zhao,Jie Zhou,Jiwen Lu","doi":"10.1109/tip.2025.3611645","DOIUrl":"https://doi.org/10.1109/tip.2025.3611645","url":null,"abstract":"Building on the success of universal language models in natural language processing (NLP), researchers have recently sought to develop methods capable of tackling a broad spectrum of visual tasks within a unified foundation framework. However, existing universal vision models face significant challenges when adapting to the rapidly expanding scope of downstream tasks. These challenges stem not only from the prohibitive computational and storage expenses associated with training such models but also from the complexity of their workflows, which makes efficient adaptations difficult. Moreover, these models often fail to deliver the required performance and versatility for a broad spectrum of applications, largely due to their incomplete visual generation and perception capabilities, limiting their generalizability and effectiveness in diverse settings. In this paper, we present VisionHub, a novel universal vision model designed to concurrently manage multiple visual restoration and perception tasks, while offering streamlined transferability to downstream tasks. Our model leverages the frozen denoising U-Net architecture from Stable Diffusion as the backbone, fully exploiting its inherent potential for both visual restoration and perception. To further enhance the model's flexibility, we propose the incorporation of lightweight task-plugins and the task router, which are seamlessly integrated onto the U-Net backbone. This architecture enables VisionHub to efficiently handle various vision tasks according to user-provided natural language instructions, all while maintaining minimal storage costs and operational overhead. Extensive experiments across 11 different vision tasks showcase both the efficiency and effectiveness of our approach. Remarkably, VisionHub achieves competitive performance across a variety of benchmarks, including 53.3% mIoU on ADE20K semantic segmentation, 0.253 RMSE on NYUv2 depth estimation, and 74.2 AP on MS-COCO pose estimation.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"91 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145140450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Renrong Shao,Wei Zhang,Kangyang Luo,Qin Li,Jun Wang
{"title":"Consistent Assistant Domains Transformer for Source-free Domain Adaptation.","authors":"Renrong Shao,Wei Zhang,Kangyang Luo,Qin Li,Jun Wang","doi":"10.1109/tip.2025.3611799","DOIUrl":"https://doi.org/10.1109/tip.2025.3611799","url":null,"abstract":"Source-free domain adaptation (SFDA) aims to address the challenge of adapting to a target domain without accessing the source domain directly. However, due to the inaccessibility of source domain data, deterministic invariable features cannot be obtained. Current mainstream methods primarily focus on evaluating invariant features in the target domain that closely resemble those in the source domain, subsequently aligning the target domain with the source domain. However, these methods are susceptible to hard samples and influenced by domain bias. In this paper, we propose a Consistent Assistant Domains Transformer for SFDA, abbreviated as CADTrans, which solves the issue by constructing invariable feature representations of domain consistency. Concretely, we develop an assistant domain module for CADTrans to obtain diversified representations from the intermediate aggregated global attentions, which addresses the limitation of existing methods in adequately representing diversity. Based on assistant and target domains, invariable feature representations are obtained by multiple consistent strategies, which can be used to distinguish easy and hard samples. Finally, to align the hard samples to the corresponding easy samples, we construct a conditional multi-kernel max mean discrepancy (CMK-MMD) strategy to distinguish between samples of the same category and those of different categories. Extensive experiments are conducted on various benchmarks such as Office-31, Office-Home, VISDA-C, and DomainNet-126, proving the significant performance improvements achieved by our proposed approaches. Code is available at https://github.com/RoryShao/CADTrans.git.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"93 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145140452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Revisiting RGBT Tracking Benchmarks from the Perspective of Modality Validity: A New Benchmark, Problem, and Solution","authors":"Zhangyong Tang, Tianyang Xu, Xiao-Jun Wu, Xuefeng Zhu, Chunyang Cheng, Zhenhua Feng, Josef Kittler","doi":"10.1109/tip.2025.3611687","DOIUrl":"https://doi.org/10.1109/tip.2025.3611687","url":null,"abstract":"","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"31 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Sparse-to-Dense Inbetweening for Multi-View Light Fields.","authors":"Yifan Mao,Zeyu Xiao,Ping An,Deyang Liu,Caifeng Shan","doi":"10.1109/tip.2025.3612257","DOIUrl":"https://doi.org/10.1109/tip.2025.3612257","url":null,"abstract":"Light field (LF) imaging, which captures both intensity and directional information of light rays, extends the capabilities of traditional imaging techniques. In this paper, we introduce a task in the field of LF imaging, sparse-to-dense inbetweening, which focuses on generating dense novel views from sparse multi-view LFs. By synthesizing intermediate views from sparse inputs, this task enhances LF view synthesis through filling in interperspective gaps within an expanded field of view and increasing data robustness by leveraging complementary information between light rays from different perspectives, which are limited by non-robust single-view synthesis and the inability to handle sparse inputs effectively. To address these challenges, we construct a high-quality multi-view LF dataset, consisting of 60 indoor scenes and 59 outdoor scenes. Building upon this dataset, we propose a baseline method. Specifically, we introduce an adaptive alignment module to dynamically align information by capturing relative displacements. Next, we explore angular consistency and hierarchical information using a multi-level feature decoupling module. Finally, a multi-level feature refinement module is applied to enhance features and facilitate reconstruction. Additionally, we introduce a universally applicable artifact-aware loss function to effectively suppress visual artifacts. Experimental results demonstrate that our method outperforms existing approaches, establishing a benchmark for sparse-to-dense inbetweening. The code is available at https://github.com/Starmao1/MutiLF.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"41 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145140288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}