Siran Peng, Xiangyu Zhu, Shang-Qi Deng, Liang-Jian Deng, Zhen Lei
{"title":"Adaptive 3D Convolution for Remote Sensing Image Fusion.","authors":"Siran Peng, Xiangyu Zhu, Shang-Qi Deng, Liang-Jian Deng, Zhen Lei","doi":"10.1109/TIP.2026.3689418","DOIUrl":"https://doi.org/10.1109/TIP.2026.3689418","url":null,"abstract":"<p><p>Remote sensing image fusion aims to create a high-resolution multi/hyper-spectral image from a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Recently, deep learning (DL) techniques have shown significant effectiveness in this area. Most DL-based methods approach image fusion as a 2D problem by encoding spectral information into feature map channels. However, our research suggests that this strategy introduces notable spectral distortions. In contrast, some methods consider spectral data as an additional dimension, utilizing standard 3D convolutions to preserve spectral information. Nevertheless, in a standard 3D convolutional layer, the same set of kernels is applied across all input regions, which we have found to be sub-optimal for image fusion. Furthermore, standard 3D convolutions necessitate substantial computational resources. To address these challenges, we propose a novel convolutional paradigm called Adaptive 3D Convolution (Ada3D) for remote sensing image fusion. Ada3D applies a unique set of 3D kernels to each input voxel, enabling the capture of fine-grained details. These adaptive kernels are generated through a two-step process: (i) spatial and spectral kernels are derived from their respective image sources; (ii) these two types of kernels are then combined to form content-aware 3D kernels that effectively integrate spatial and spectral information. Additionally, adaptive biases are introduced to enhance the convolutional outcome at the voxel level. Furthermore, we incorporate the group convolution technique to reduce computational complexity. As a result, Ada3D offers full adaptivity in an efficient manner. Evaluation results across five datasets demonstrate that our method achieves state-of-the-art (SOTA) performance, underscoring the superiority of Ada3D. The code is available at https://github.com/PSRben/Ada3D.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147857740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haoyou Deng, Zhiqiang Li, Feng Zhang, Bin Xu, Qingbo Lu, Changxin Gao, Nong Sang
{"title":"Towards Robust Alignment for Video Dehazing with Temporal Lookup Table.","authors":"Haoyou Deng, Zhiqiang Li, Feng Zhang, Bin Xu, Qingbo Lu, Changxin Gao, Nong Sang","doi":"10.1109/TIP.2026.3689423","DOIUrl":"https://doi.org/10.1109/TIP.2026.3689423","url":null,"abstract":"<p><p>Video dehazing aims to restore clean scenarios from a sequence of hazy frames, where frame alignment is a critical stage for leveraging temporal information. However, haze degrades contrast and obscures details, making alignment challenging. Existing methods ignore the impairment of haze on alignment and thus struggle to align frames accurately. To address this challenge, we propose an alignment network with the temporal lookup table (temporal-LUT), which effectively enhances the haze-degraded frames and provides vivid cues for precise alignment. Specifically, to tackle the color degradation of haze, we employ a learnable lookup table (LUT) to enhance hazy color. The color mapping nature of LUT favorably preserves the naturalness of enhanced outcomes. Besides, we introduce a temporal weight prediction strategy to strengthen inter-frame interaction, which ensures temporal consistency across enhanced results and thereby benefits alignment. Extensive experimental results on two widely used benchmarks and real-world scenes demonstrate the superiority of our method.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147857816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, Jiwen Lu
{"title":"OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving.","authors":"Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, Jiwen Lu","doi":"10.1109/TIP.2026.3687468","DOIUrl":"https://doi.org/10.1109/TIP.2026.3687468","url":null,"abstract":"<p><p>Understanding the evolution of 3D scenes is crucial for autonomous driving. While conventional methods describe scene development through individual instance motions, world models provide a generative framework for modeling overall scene dynamics. However, most existing approaches rely on autoregressive next-token prediction, which suffers from error accumulation and limited global spatiotemporal reasoning, leading to degraded long-term consistency. To address these issues, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate 3D world evolution for autonomous driving. A 4D scene tokenizer is introduced to obtain compact spatiotemporal representations and enable high-quality reconstruction of long occupancy sequences. We then train a diffusion transformer on these representations to generate 4D occupancy conditioned on trajectory prompts. Experiments on the nuScenes dataset with Occ3D annotations show that OccSora can generate 16s videos with authentic 3D layout and strong temporal consistency. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for autonomous driving decisionmaking.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SAGD: Boundary-Enhanced Segment Anything in 3D Gaussian via Gaussian Decomposition.","authors":"Xu Hu, Yuxi Wang, Lue Fan, Chuanchen Luo, Junsong Fan, Zhen Lei, Qing Li, Junran Peng, Zhaoxiang Zhang","doi":"10.1109/TIP.2026.3689408","DOIUrl":"https://doi.org/10.1109/TIP.2026.3689408","url":null,"abstract":"<p><p>3D Gaussian Splatting has emerged as an alternative 3D representation for novel view synthesis, benefiting from its high-quality rendering results and real-time rendering speed. However, the 3D Gaussians learned by 3D-GS have ambiguous structures without any geometry constraints. This inherent issue in 3D-GS leads to a rough boundary when segmenting individual objects. To remedy these problems, we propose SAGD, a conceptually simple yet effective boundary-enhanced segmentation pipeline for 3D-GS to improve segmentation accuracy while preserving segmentation speed. Specifically, we introduce a Gaussian Decomposition scheme, which ingeniously utilizes the special structure of 3D Gaussians, finds out, and then decomposes the boundary Gaussians. Moreover, to achieve fast interactive 3D segmentation, we introduce a novel training-free pipeline by lifting a 2D foundation model to 3D-GS. Extensive experiments demonstrate that our approach achieves high-quality 3D segmentation without rough boundary issues, which can be easily applied to other scene editing tasks. Our code is publicly available at https://github.com/XuHu0529/SAGS.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prompting Rain Off: Evolving Compact Dual Prompts for Continual De-Raining.","authors":"Minghao Liu, Wenhan Yang, Jiaying Liu","doi":"10.1109/TIP.2026.3689428","DOIUrl":"https://doi.org/10.1109/TIP.2026.3689428","url":null,"abstract":"<p><p>In recent years, there has been notable progress in single-image rain removal, particularly focusing on static data distributions in these approaches. When dealing with data that constantly changes, the challenge of catastrophic forgetting arises, which is quite common and critical in real-world scenarios. To address this, we propose Evolving COmpact Dual Prompt Learning (EcoDPL), an efficient rehearsal-free continual learning deraining framework designed specifically for low-level vision tasks. Specifically, we design two prompt pools at both image and feature levels and insert these prompts into images and embedding tokens, for better knowledge transfer across tasks. Our adaptive weight generation module, P-Fuser, attaches an attention map to each prompt, to adaptively pay attention to different inputs, and get different weights to fuse prompts, making the inserted prompts more flexible with various inputs. Also, we introduce Grad-Tuner, a dictionary learning strategy, to compress knowledge into fewer prompts. This makes the knowledge more compact and provides more space for new prompts to learn new tasks. Our method stands out by leveraging small, learnable prompts for efficient knowledge retention across tasks, not increasing training time or parameters. Furthermore, we present an augmented method that upgrades the distance function γ from simple cosine distance to a more advanced weight generation network. We also employ a fine-tuned dictionary learning technique, compressing knowledge into a more compact form, and enhancing the ability of prompts to learn new tasks. With our new designs, the model becomes more flexible with various inputs and it compresses knowledge into fewer prompts to free up spaces to learn new tasks. Through extensive experiments on various rain removal datasets, our EcoDPL method consistently outperforms previous continual learning techniques. Notably, although EcoDPL is designed for continual learning with changing data, it also performs well with stationary data, proving its robustness and versatility.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ye Wang, Pan Sun, Xuyang Zhou, Lifeng Shen, Jiaxu Leng, Guoyin Wang, Hong Yu
{"title":"Hierarchical Causal Learning for Face Age Synthesis.","authors":"Ye Wang, Pan Sun, Xuyang Zhou, Lifeng Shen, Jiaxu Leng, Guoyin Wang, Hong Yu","doi":"10.1109/TIP.2026.3689413","DOIUrl":"https://doi.org/10.1109/TIP.2026.3689413","url":null,"abstract":"<p><p>Face age synthesis (FAS) predicts a person's future or past facial appearance. In FAS, modifying one facial attribute usually affects the generation of other attributes during face image generation. Current models directly learn entangled representations of age-related features, resulting in insufficient feature disentanglement, which consequently impairs their causal reasoning capability for FAS tasks. To this end, we propose a hierarchical causal learning model for face age synthesis (HCFace), which integrates hierarchical structures and causal relationships into the facial generative model. Specifically, we propose to leverage hierarchical causal relationships to align with facial features for feature disentanglement. Furthermore, we design a novel nonlinear mapping function that captures the true patterns of facial attribute changes with age, enhancing the disentanglement of these attributes. We conduct extensive experiments to validate the superiority of our proposed model. Compared to other advanced baseline methods, HCFace improves overall accuracy by 2.47%, with improvements of 9.75% and 9.69% in certain age-related attributes, such as skin and hair. Our source code is available at https://github.com/SE-hash/HCFace.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation.","authors":"Guoan Xu, Jiaming Chen, Wenfeng Huang, Wenjing Jia, Guangwei Gao, Guo-Jun Qi","doi":"10.1109/TIP.2026.3688157","DOIUrl":"https://doi.org/10.1109/TIP.2026.3688157","url":null,"abstract":"<p><p>The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants widely validated across various downstream tasks, including semantic segmentation. However, as general-purpose visual encoders, ViT back-bones often do not fully address the specific requirements of task decoders, highlighting opportunities for designing decoders optimized for efficient semantic segmentation. This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head specifically designed for semantic segmentation. Instead of relying on the conventional skip connections, we utilize lateral connections between encoder and decoder stages, leveraging encoder features as Queries in cross-attention modules. Additionally, we introduce a Cross-Layer Block (CLB) that integrates hierarchical feature maps from various encoder and decoder stages to form a unified representation for Keys and Values. The CLB also incorporates the local perceptual strengths of convolution, enabling SCASeg to capture both global and local context dependencies across multiple layers, thus enhancing feature interaction at different scales and improving overall efficiency. To further optimize computational efficiency, SCASeg compresses the channels of queries and keys into one dimension, creating strip-like patterns that reduce memory usage and increase inference speed compared to traditional vanilla cross-attention. Experiments show that SCASeg's adaptable decoder delivers competitive performance across various setups, outperforming leading segmentation architectures on benchmark datasets, including ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012, even under diverse computational constraints.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunzhi Zhuge, Sitong Gong, Lu Zhang, Qi Xu, Wenda Zhao, Jin Zhan, Huchuan Lu
{"title":"Context-Infused Trajectories: Enhancing Context and Frame Consistency in Reasoning Video Object Segmentation.","authors":"Yunzhi Zhuge, Sitong Gong, Lu Zhang, Qi Xu, Wenda Zhao, Jin Zhan, Huchuan Lu","doi":"10.1109/TIP.2026.3689427","DOIUrl":"https://doi.org/10.1109/TIP.2026.3689427","url":null,"abstract":"<p><p>Reasoning video object segmentation (ReaVOS) aims to segment referred objects in video sequences based on implicit and complex linguistic queries. Existing methods typically compress limited video frames into pooled representations and prompt multimodal large language models (MLLMs) to generate a single global segmentation token. However, this strategy lacks explicit contextual guidance and causes substantial loss of spatial details, limiting capability and segmentation consistency. To overcome these limitations, we introduce Context-infused Consistent Video Segmentor (CiCVS), a novel framework leveraging contextual information to guide generation of temporally coherent and accurate mask trajectories. CiCVS incorporates a Hierarchical Frame Sampling (HFS) module, which globally samples support frames across the entire video to ensure broad temporal coverage, and then uniformly selects target frames within the support set. It also employs a Contextual Token Prompting (CTP) module, which utilizes contextual cues from support frames to guide the MLLM in generating specialized tokens for various target frames, enabling the model to capture intricate temporal patterns and ensure consistency across long-range sequences. At the core of CTP is the Multimodal Injection Compressor (MIC) block, which efficiently integrates support frame features and textual semantic information into a compact set of latent queries, enhancing temporal-level object perception. To further advance the ReaVOS field, we introduce the CoCoRVOS benchmark, which features more temporally intricate reasoning instructions and a diverse set of video scenarios. Extensive experiments demonstrate that CiCVS establishes a new state-of-the-art on multiple benchmarks, achieving significant improvements in J&F scores, including +2.7 on CoCoRVOS, +1.4 on ReVOS, and +7.0 on ReasonVOS, underscoring its superior contextual reasoning and segmentation capabilities.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ADANet: Adversarial Distribution Alignment Network for Multi-view Semi-supervised Classification.","authors":"Sujia Huang, Lele Fu, Zhaoliang Chen, Tong Zhang, Xiaoli Li, Zhen Cui","doi":"10.1109/TIP.2026.3689430","DOIUrl":"https://doi.org/10.1109/TIP.2026.3689430","url":null,"abstract":"<p><p>Multi-view learning aims to integrate multi-source information for a comprehensive data representation, which has gained widespread attention in image processing. Each view contains view-specific noise and joint features associated with other views, and thus exploring the specificity and consistency among views is a typical solution to deal with multi-view data for learning discriminative representations. In this paper, we present a theory-induced model, termed Adversarial Distribution Alignment Network (ADANet), which learns view-invariant features and alleviate the negative impact of view-specific noise. We first demonstrate the necessity of suppressing view-specific noise and capturing view-invariant features inspired by the theory of view generalization, and then derive two collaborative modules: a feature disentangler and an adversarial alignment module. In detail, the feature disentanglement separates view-specific noise and view-invariant features by minimizing the mutual information between them. Following this, a negative entropy is proposed to suppress the negative impact of view-specific noise. Meanwhile, the adversarial module uses the adversarial technique that can fit more complex data conformed to different distributions to adaptively align cross-view features so that features encoded in different views converge. Substantial experiments are constructed on multi-view datasets, demonstrating that ADANet can achieve more promising performance compared to other superior methods. Code is available at https://github.com/huangsuj/ADANet.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Open Set Domain Adaptation via Target-relaxed Optimal Transport.","authors":"Chuan-Xian Ren, Zi-Xian Huang, Hong Yan","doi":"10.1109/TIP.2026.3689416","DOIUrl":"https://doi.org/10.1109/TIP.2026.3689416","url":null,"abstract":"<p><p>Open set domain adaptation (OSDA) aims to transfer classification-oriented knowledge from a labeled source domain to an unlabeled target domain, which faces the challenges from unseen knowledge in open-set scenarios, i.e., unknown classes privileged to the target domain. Existing methods usually identify unknown classes from classifier prediction directly, which are sensitive to the intrinsic clustering structure and cluster numbers of the unknown class data. In this paper, inspired by the sample relation characterization ability of Optimal Transport (OT), we propose a new type of OT method for OSDA, namely, Target-relaxed Optimal Transport (TROT). Compared with existing OT with strict marginal constraints, TROT imposes a single-side relaxation to the mass requirement on the open-set target domain. Theoretically, we prove that such a relaxation can reduce mis-matches between known and unknown classes, which indicates the transport plan of TROT is promising to identify unknown classes. Methodologically, TROT can identify unknown classes adaptively and map the cross-domain shared data with a sparse plan assignment, which improves both the effectiveness and robustness of known class alignment; besides, a graph embedding with multi-cluster structure of unknown classes is designed to learn a discriminative metric space for open-set classification. Empirically, extensive evaluations are conducted on several image datasets, where TROT achieves significant performance improvements compared with existing techniques for visual recognition in open-set scenarios.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}