Shaobing Gao, Minjie Tan, Shun Lv, Yiguang Liu, Yongjie Li
{"title":"Infrared and Visible Image Fusion Using Bimodal Neuron and Dynamic Receptive Field Mechanisms.","authors":"Shaobing Gao, Minjie Tan, Shun Lv, Yiguang Liu, Yongjie Li","doi":"10.1109/TIP.2026.3689405","DOIUrl":"https://doi.org/10.1109/TIP.2026.3689405","url":null,"abstract":"<p><p>Infrared and visible image fusion (IVIF) significantly enhances scene interpretation by integrating broad-spectrum information. Drawing inspiration from specific snakes that possess an evolutionarily optimized bimodal sensory system capable of parallel processing infrared and visible radiation, we propose a novel IVIF framework incorporating two key elements: nonlinear cross-modal interactions across six distinct classes of snake bimodal neurons and dynamic center-surround receptive field organization. These biological principles are mathematically formalized and integrated within a deep neural network (DNN), optimized through an object detection region-guided loss and a frequency-dependent fusion loss that enable data-driven fusion strategy learning. Experimental results demonstrate that the optimized model effectively emulates the infrared-visible information integration observed in snake bimodal neurons. Critically, the nonlinear bimodal neurons capture a significantly greater amount of edge information and finer mid-to-high-frequency details, which are essential for the subsequent reconstruction of the fused image. Furthermore, a comprehensive evaluation of visual quality, encompassing both qualitative and quantitative assessments on six datasets, along with extensive object detection and semantic segmentation experiments using the fused images in both daytime and nighttime scenarios, demonstrates that our model outperforms traditional biologically-inspired IVIF algorithms, achieving performance comparable to SOTA DNN-based methods.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ye Fan, Lina Gao, Fuheng Zhou, Ning Li, Yulong Huang
{"title":"HRMamba: A Hybrid Retinex and State-Space Model for Underwater Image Enhancement.","authors":"Ye Fan, Lina Gao, Fuheng Zhou, Ning Li, Yulong Huang","doi":"10.1109/TIP.2026.3689407","DOIUrl":"https://doi.org/10.1109/TIP.2026.3689407","url":null,"abstract":"<p><p>Underwater light absorption and scattering lead to severe color distortion, reduced visibility, contrast loss, and a significant degradation in image quality, thereby impeding both human visual analysis and machine vision tasks. Although considerable progress has been achieved in improving image quality, existing deep learning-based methods for underwater image enhancement (UIE) remain constrained by high computational complexity and insufficient modeling of global dependencies, which restricts their practical deployment in resource-limited underwater environments. To tackle these issues, we propose a novel hybrid framework integrating Retinex theory and state-space models (SSMs) for underwater image enhancement, named HRMamba. Different from existing Transformer-based approaches constrained by quadratic complexity, HRMamba attains computational efficiency through linear-complexity state-space operations while maintaining global dependency modeling capabilities. Moreover, to achieve comprehensive feature fusion, an Illumination Feature Fusion Module (IFFM) is proposed, which synergizes the global dependency modeling of SSMs with the local adaption capability of convolutional neural networks (CNNs). For context-sensitive noise suppression with illumination awareness, we propose an Illumination-Guided Denoising Module (IGDM) that employs directional-scanning Vision State Space Module (VSSM) blocks. Experiments demonstrate that HRMamba achieves state-of-the-art enhancement quality via an efficient architecture, significantly improving color fidelity and visibility restoration while substantially reducing computational demands. The project code will be released upon paper acceptance.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui Gao, Meihong Zhang, Gongyang Li, Guanyi Li, Kai Zhao, Xianchao Zhang, Dan Zeng
{"title":"MIST: A Benchmark and Baseline for Multi-frame Infrared Small Target Detection in Complex Motion.","authors":"Rui Gao, Meihong Zhang, Gongyang Li, Guanyi Li, Kai Zhao, Xianchao Zhang, Dan Zeng","doi":"10.1109/TIP.2026.3689420","DOIUrl":"https://doi.org/10.1109/TIP.2026.3689420","url":null,"abstract":"<p><p>Motion cues play a vital role in multi-frame infrared small target detection (MISTD). However, most targets in existing datasets exhibit regular and slow motion, which cannot reflect the complex and diverse motion patterns in real-world scenarios. This biased data distribution makes recent data-driven methods highly rely on simplified motion assumptions that tend to fail in irregular or fast motion, resulting in noisy feature representations cluttered with target-irrelevant factors. Hence, we stress that methods for MISTD should also work when targets are in complex motion. To enable this research, we propose a large-scale dataset called MIST for airborne infrared detection scenarios. The dataset is built on a synthetic data engine that models variations in pose, size, and intensity of moving targets while seamlessly blending them into real backgrounds for physical, geometric, and visual realism. Targets in MIST exhibit low signal-to-clutter ratios and complex motion, making it a promising yet challenging benchmark for developing algorithms focused on motion analysis. To tackle the challenges of MIST, we develop MISTNet, a robust baseline based on the Information Bottleneck theory. To handle irregular and fast motion, we propose a shifted neighborhood compensation block to efficiently model multi-scale correspondences for implicit motion compensation. To distill compact representations free from irrelevant cues, we design a progressive distillation decoder to hierarchically filter out redundancy while preserving target-relevant information. We benchmark 31 state-of-the-art methods and find that their performance on MIST drops significantly compared with that on the widely used NUDT-MIRSDT dataset. Our MISTNet outperforms all other methods by a large margin, with an over 6% gain in the IoU metric, demonstrating its superiority. The dataset, code, and model weights are available at https://github.com/GR-ray/MIST.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FinePruner: Unbiased Attention-Head-Level Fine-grained Token Reduction for Efficient Inference of Large Vision-Language Models.","authors":"Zishuo Wang, Xiangtian Zheng, Yuxin Peng","doi":"10.1109/TIP.2026.3687073","DOIUrl":"https://doi.org/10.1109/TIP.2026.3687073","url":null,"abstract":"<p><p>Large Vision-Language Models (LVLMs) suffer from the high computational cost of the attention mechanism caused by the large number of visual tokens. Token reduction has emerged as a promising approach to reduce the complexity by eliminating redundant visual tokens. However, existing token reduction methods struggle to preserve task-relevant tokens and eliminate irrelevant ones. This is due to the attention biases of LVLMs, where tokens with high attention scores are not always the critical ones. Such biases force existing methods into a dilemma: they face either high performance degradation or limited inference acceleration. This issue becomes more severe in fine-grained perception tasks, which rely heavily on the fine-grained information stored in specific visual tokens. To address the above issue, we propose an unbiased fine-grained token reduction method named FinePruner, which explores the attention patterns of LVLMs at the attention-head-level to mitigate the interference of attention biases. Concretely, we first conducted comparative studies to validate the impact of tokens corresponding to visual objects on final task performance, which established the conclusion that these tokens should be preserved while others can be pruned. Also, a series of visualizations unveils the changing patterns of LVLMs' attention biases across layers and attention heads. Based on the patterns of attention biases, the pipeline of FinePruner is divided into two stages. The first stage, named Instruction-Agnostic Clustering, clusters visual tokens into groups according to their embeddings to exclude the attention biases. The second stage, named Attention-Refined Pruning, selects attention heads with less bias by the divergence, which are used to identify the preserved tokens. Experiments on VQA benchmarks and fine-grained benchmarks demonstrate that our FinePruner achieves better accuracy-efficiency tradeoffs than state-of-the-art methods. The code is available at https: //github.com/PKU-ICST-MIPL/FinePruner TIP2026.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A cross-modal network for facial expression recognition.","authors":"Chunwei Tian, Jingyuan Xie, Qi Zhang, Chao Li, Wangmeng Zuo, Shichao Zhang","doi":"10.1109/TIP.2026.3688163","DOIUrl":"https://doi.org/10.1109/TIP.2026.3688163","url":null,"abstract":"<p><p>Deep neural networks enriched with structural information have been widely employed for facial expression recognition tasks. However, these methods often depend on hierarchical information rather than face property to finish expression recognition. In this paper, we propose a cross-modal network with strong biological and structural information for facial expression recognition (CMNet). CMNet can respectively learn expression information via face symmetry on a whole face, left and right half faces to extract complementary facial features. To prevent native effect of biological and structural information fusion, a salient facial information refinement module can obtain salient facial expression information to improve stability of an obtained facial expression classifier. To reduce reliance on unilateral facial features, a half-face alignment optimization mechanism is designed to align obtained expression information of learned left and right half faces. Our experimental results demonstrate that CMNet outperforms several novel methods, i.e., SCN and LAENet-SA for facial expression recognition. Codes can be obtained at https://github.com/hellloxiaotian/CMNet.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147846961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yun Xing, Qing Guo, Xiaoguang Li, Yihao Huang, Xiaofeng Cao, Luqi Gong, Di Lin, Ivor Tsang, Lei Ma
{"title":"Time-variant Image Inpainting via Interactive Distribution Transition Estimation.","authors":"Yun Xing, Qing Guo, Xiaoguang Li, Yihao Huang, Xiaofeng Cao, Luqi Gong, Di Lin, Ivor Tsang, Lei Ma","doi":"10.1109/TIP.2026.3687440","DOIUrl":"https://doi.org/10.1109/TIP.2026.3687440","url":null,"abstract":"<p><p>In this work, we focus on a novel and practical task, i.e., Time-vAriant iMage inPainting (TAMP). The aim of TAMP is to restore a damaged target image by leveraging the complementary information from a reference image, where both images captured the same scene but with a significant time gap in between, i.e., time-variant images. Different from conventional reference-guided image inpainting, the reference image under TAMP setup presents significant content distinction to the target image and potentially also suffers from damages. Such an application frequently happens in our daily lives to restore a damaged image by referring to another reference image, where there is no guarantee of the reference image's source and quality. In particular, our study finds that even state-of-the-art (SOTA) reference-guided image inpainting methods fail to achieve plausible results due to the chaotic image complementation. To address such an ill-posed problem, we propose a novel Interactive Distribution Transition Estimation (InDiTE) module which interactively complements the time-variant images with appropriate semantics thus facilitate the restoration of damaged regions. To further boost the performance, we propose our TAMP solution, namely Interactive Distribution Transition Estimation-driven Diffusion (InDiTE-Diff), which integrates InDiTE with SOTA diffusion model and conducts latent cross-reference during sampling. Moreover, considering the lack of benchmarks for TAMP task, we newly assembled a dataset, i.e., TAMP-Street, based on existing image and mask datasets. We conduct experiments on the TAMP-Street datasets under two different time-variant image inpainting settings, which show our method consistently outperform SOTA reference-guided image inpainting methods for solving TAMP.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147824967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning.","authors":"Hyungyu Choi, Young Kyun Jang, Chanho Eom","doi":"10.1109/TIP.2026.3687083","DOIUrl":"https://doi.org/10.1109/TIP.2026.3687083","url":null,"abstract":"<p><p>Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions due to pre-training on short and concise captions. We present FAST-GOAL (Fast and Efficient Global-local Object Alignment Learning), an efficient fine-tuning method that enhances ability of CLIP to handle lengthy text through global-local semantic alignment. Our method consists of two key components. First, Fast Local Image-Sentence Matching (FLISM) efficiently extracts local image regions through object detection and spatial division, then matches them with corresponding sentences. Second, Token Similarity-based Learning (TSL) maximizes the similarity between patch tokens from specific regions in the image and their corresponding region embeddings, applying the same principle to text, which enhances the ability of the model to capture detailed correspondences. Additionally, we introduce GLIT100k, a dataset that provides both global image-lengthy caption pairs and context-derived local pairs, where local descriptions are extracted from global captions to maintain semantic coherence. Through extensive experiments on long caption datasets (DOCCI, DCI) and short caption datasets (MSCOCO, Flickr30k), we demonstrate that FAST-GOAL achieves significant improvements over baselines, enabling effective adaptation of CLIP to detailed textual descriptions while maintaining computational efficiency.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147793047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Soft Supervision Guided Spatial-Temporal Refinement Network For Video-based Visible-Infrared Person Re-Identification.","authors":"Jinxing Li, Chuhao Zhou, Rundong Li, Huafeng Li, Xinyu Lin, Guangming Lu, Yong Xu, David Zhang","doi":"10.1109/TIP.2026.3687081","DOIUrl":"https://doi.org/10.1109/TIP.2026.3687081","url":null,"abstract":"<p><p>Thanks to automatic switch between visible and infrared modes, person re-identification (Re-ID) in 24-hour has been possible through cross-modal retrieval. Instead of exploiting still images, video-based cross-modal person Re-ID is studied in this paper. Specifically, a large-scale dataset 'HITSZ-PVCM' is first collected, consisting of as many as 1,681 identities and 839,632 frames. Generally, videos contain much richer pedestrian appearances. However, most existing works only generate temporal representations by whole frames, inevitably losing fine-grained details. Furthermore, training a network by metric losses (e.g., center loss) is a common strategy, while such point-to-point constraints are too strong and limit model generalization due to existing diversity among intra-class samples. Here, we propose a Soft Supervision guided Spatial-Temporal Refinement (S<sup>3</sup>TR) network to tackle these problems. Specifically, S<sup>3</sup>TR refines each frame guided by a coarse temporal feature, so that more discriminative features are extracted and transformed to a sequential representation. Followed by a global-local mutual learning module, the modality gap is then erased without losing fine-grained details. Furthermore, we propose a novel soft-clustering center loss to measure intra-/inter-class similarity/dissimilarity in a group-to-group way, efficiently improving model generalization. To the best of our knowledge, HITSZ-PVCM is the largest dataset and S<sup>3</sup>TR achieves superior performances compared with state-of-the-arts.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147793033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LSRNet: A Novel Interpretable Low-rank Sparse Representation Guided Fusion Network for Polarization and Intensity Images.","authors":"Bin Yang, Yuxuan Hu, Licheng Liu, Yu Liu, Jing Li","doi":"10.1109/TIP.2026.3687104","DOIUrl":"https://doi.org/10.1109/TIP.2026.3687104","url":null,"abstract":"<p><p>Polarization and intensity images fusion (PIF) has extracted extensive attentions as it can generate images with clear scene information and salient texture details of the object surface that are important for downstream applications. However, existing deep learning-based PIF methods usually lack interpretability and ignore the interactions among multi-modal features. To this end, we propose a novel interpretable low-rank sparse representation guided fusion network for polarization and intensity images (termed LSRNet). Specifically, a low-rank sparse representation deep unfolding module is designed to acquire the base and detail features of the source images, with the ability of improving the interpretability of the network. In addition, a cross-modal connection complementary feature extraction module is proposed, which aims to establish dependency among features of multi-modalities to fully extract complementary features of the source images. In order to demonstrate the validity of our LSRNet and take into account shortcomings of existing datasets for PIF, a multi-scene polarization and intensity image dataset, named MSPI dataset, is constructed, which includes 1034 high-resolution aligned image pairs. According to the best of our knowledge, this is the most comprehensive dataset for PIF that with a large number of image pairs, high resolution and multiple scene types. Extensive experiments on our MSPI dataset and two publicly available datasets (i.e., 12CFC and HCP) demonstrate the superior fusion performance, generalization ability, and desirable running efficiency of our LSRNet. Our codes and dataset will be publicly available at https://github.com/thebinyang/LSRNet.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147792960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sangwoo Hong, Sehwan Kim, Hyungjun Joo, Hyeonggeun Han, Jiyoon Shin, Yoav Wald, Jungwoo Lee
{"title":"Bias Alleviation through Network Pruning for Sparse and Debiased Models.","authors":"Sangwoo Hong, Sehwan Kim, Hyungjun Joo, Hyeonggeun Han, Jiyoon Shin, Yoav Wald, Jungwoo Lee","doi":"10.1109/TIP.2026.3687070","DOIUrl":"https://doi.org/10.1109/TIP.2026.3687070","url":null,"abstract":"<p><p>Pruning is a highly effective method for reducing the size of neural networks with negligible impact on their average performance. However, recent studies have revealed that pruning actually amplifies the bias in the models, leading to decreased performance for underrepresented groups. To address this issue, we first analyze the impact of pruning on the confidence of each sample and introduce Accumulated Confidence (AC). AC is a proxy that facilitates the identification of bias-conflicting and bias-aligned samples without relying on group annotations. We then propose a debiasing algorithm, which is called DEbiasing Network through Pruning (DENP). DENP utilizes AC to mitigate bias within the network. Even without bias information, DENP exhibits remarkable debiasing performance on varying levels of sparsity, effectively mitigating the bias-exacerbating property of pruning and resulting in both sparse and debiased neural networks. Moreover, even when compared with state-of-the-art debiasing baselines under identical conditions, the DENP still achieves the best performance on multiple benchmark datasets, demonstrating its superior debiasing capabilities.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147792933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}