Bo Yang;Fengqian Li;Songliang Zhao;Wei Wang;Jun Luo;Huayan Pu;Mingliang Zhou;Yangjun Pi
{"title":"MTMLNet: Multi-Task Mutual Learning Network for Infrared Small Target Detection and Segmentation","authors":"Bo Yang;Fengqian Li;Songliang Zhao;Wei Wang;Jun Luo;Huayan Pu;Mingliang Zhou;Yangjun Pi","doi":"10.1109/TIP.2025.3587576","DOIUrl":"10.1109/TIP.2025.3587576","url":null,"abstract":"Infrared small target detection has been extensively studied due to its wide range of applications. Most studies treat infrared small target detection as an independent task, either as a detection-based or a segmentation-based, failing to fully leverage the supervisory information from different annotation forms. To address this issue, we propose a multi-task mutual learning network (MTMLNet) specifically designed for infrared small targets, aiming to enhance both detection and segmentation performance by effectively utilizing various forms of supervisory information. Specifically, we design a multi-stage feature aggregation (MFA) module capable of capturing features with varying gradients and receptive fields simultaneously. Additionally, a hybrid pooling down-sampling (HPDown) module is proposed to mitigate information loss during the down-sampling process of infrared small targets. Finally, the hierarchical feature fusion (HFF) module is designed to adaptively select and fuse features from different semantic layers, learning the optimal way to fuse features across semantic layers. The results on IRSTD-1k and SIRST-V2 datasets show that our proposed MTMLNet achieves state-of-the-art (SOTA) performance in both detection-based and segmentation-based methods. The codes are available at <uri>https://github.com/YangBo0411/MTMLNet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4414-4425"},"PeriodicalIF":0.0,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144629843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sharing Task-Relevant Information in Visual Prompt Tuning by Cross-Layer Dynamic Connection","authors":"Nan Zhou;Jiaxin Chen;Di Huang","doi":"10.1109/TIP.2025.3587587","DOIUrl":"10.1109/TIP.2025.3587587","url":null,"abstract":"Recent progress has shown great potential of visual prompt tuning (VPT) when adapting pre-trained vision transformers to various downstream tasks. However, most existing solutions independently optimize prompts at each layer, thereby neglecting the usage of task-relevant information encoded in prompt tokens across layers. Additionally, existing prompt structures are prone to interference from task-irrelevant noise in input images, which can adversely affect the sharing of task-relevant information. In this paper, we propose a novel VPT approach, SVPT. It innovatively incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information. Furthermore, we design a dynamic aggregation (DA) module that facilitates selective sharing of information between layers. The combination of CDC and DA enhances the flexibility of the attention process within the VPT framework. Building upon these foundations, SVPT introduces an attentive enhancement (AE) mechanism that automatically identifies salient image tokens and refines them with prompt tokens in an additive manner. Extensive experiments on 24 image classification and semantic segmentation benchmarks clearly demonstrate the advantages of the proposed SVPT, compared to the state-of-the-art counterparts.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4527-4540"},"PeriodicalIF":0.0,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144629856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FullLoRA: Efficiently Boosting the Robustness of Pretrained Vision Transformers","authors":"Zheng Yuan;Jie Zhang;Shiguang Shan;Xilin Chen","doi":"10.1109/TIP.2025.3587598","DOIUrl":"10.1109/TIP.2025.3587598","url":null,"abstract":"In recent years, the Vision Transformer (ViT) model has gradually become mainstream in various computer vision tasks, and the robustness of the model has received increasing attention. However, existing large models tend to prioritize performance during training, potentially neglecting the robustness, which may lead to serious security concerns. In this paper, we establish a new challenge: exploring how to use a small number of additional parameters for adversarial finetuning to quickly and effectively enhance the adversarial robustness of a standardly trained model. To address this challenge, we develop novel LNLoRA module, incorporating a learnable layer normalization before the conventional LoRA module, which helps mitigate magnitude differences in parameters between the adversarial and standard training paradigms. Furthermore, we propose the FullLoRA framework by integrating the learnable LNLoRA modules into all key components of ViT-based models while keeping the pretrained model frozen, which can significantly improve the model robustness via adversarial finetuning in a parameter-efficient manner. Extensive experiments on several datasets demonstrate the superiority of our proposed FullLoRA framework. It achieves comparable robustness with full finetuning while only requiring about 5% of the learnable parameters. This also effectively addresses concerns regarding extra model storage space and enormous training time caused by adversarial finetuning.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4580-4590"},"PeriodicalIF":0.0,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144629844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Joint Visual Compression and Perception Framework for Neuromorphic Spiking Camera","authors":"Kexiang Feng;Chuanmin Jia;Siwei Ma;Wen Gao","doi":"10.1109/TIP.2025.3581372","DOIUrl":"10.1109/TIP.2025.3581372","url":null,"abstract":"The advent of Neuromorphic spike cameras has garnered significant attention for their ability to capture continuous motion with unparalleled temporal resolution. However, this imaging attribute necessitates considerable resources for binary spike data storage and transmission. In light of compression and spike-driven intelligent applications, we present the notion of Spike Coding for Intelligence (SCI), wherein spike sequences are compressed and optimized for both bit-rate and task performance. Drawing inspiration from the mammalian vision system, we propose a dual-pathway architecture for separate processing of spatial semantics and motion information, which is then merged to produce features for compression. A refinement scheme is also introduced to ensure consistency between decoded features and motion vectors. We further propose a temporal regression approach that integrates various motion dynamics, capitalizing on the advancements in warping and deformation simultaneously. Comprehensive experiments demonstrate our scheme achieves state-of-the-art (SOTA) performance for spike compression and analysis. We achieve an average 17.25% BD-rate reduction compared to SOTA codecs and a 4.3% accuracy improvement over SpiReco for spike-based classification, with 88.26% complexity reduction and 42.41% inference time saving on the encoding side.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4343-4356"},"PeriodicalIF":0.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144602617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Duan;Huimin Chen;Runxin Zhang;Rong Wang;Feiping Nie;Xuelong Li
{"title":"Soft Neighbors Supported Contrastive Clustering","authors":"Yu Duan;Huimin Chen;Runxin Zhang;Rong Wang;Feiping Nie;Xuelong Li","doi":"10.1109/TIP.2025.3583194","DOIUrl":"10.1109/TIP.2025.3583194","url":null,"abstract":"Existing deep clustering methods leverage contrastive or non-contrastive learning to facilitate downstream tasks. Most contrastive-based methods typically learn representations by comparing positive pairs (two views of the same sample) against negative pairs (views of different samples). However, we spot that this hard treatment of samples ignores inter-sample relationships, leading to class collisions and degrade clustering performances. In this paper, we propose a soft neighbor supported contrastive clustering method to address this issue. Specifically, we first introduce a concept called perception radius to quantify similarity confidence between a sample and its neighbors. Based on this insight, we design a two-level soft neighbor loss that captures both local and global neighborhood relationships. Additionally, a cluster-level loss enforces compact and well-separated cluster distributions. Finally, we conduct a pseudo-label refinement strategy to mitigate false negative samples. Extensive experiments on benchmark datasets demonstrate the superiority of our method. The code is available at <uri>https://github.com/DuannYu/soft-neighbors-supported-clustering</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4315-4327"},"PeriodicalIF":0.0,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144562422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meiyan Liang;Shupeng Zhang;Xikai Wang;Bo Li;Muhammad Hamza Javed;Xiaojun Jia;Lin Wang
{"title":"NSB-H2GAN: “Negative Sample”-Boosted Hierarchical Heterogeneous Graph Attention Network for Interpretable Classification of Whole-Slide Images","authors":"Meiyan Liang;Shupeng Zhang;Xikai Wang;Bo Li;Muhammad Hamza Javed;Xiaojun Jia;Lin Wang","doi":"10.1109/TIP.2025.3583127","DOIUrl":"10.1109/TIP.2025.3583127","url":null,"abstract":"Gigapixel whole-slide image (WSI) prediction and region-of-interest localization present considerable challenges due to the diverse range of features both across different slides and within individual slides. Most current methods rely on weakly supervised learning using homogeneous graphs to establish context-aware relevance within slides, often neglecting the rich diversity of heterogeneous information inherent in pathology images. Inspired by the negative sampling strategy of the Determinantal Point Process (DPP) and the hierarchical structure of pathology slides, we introduce the Negative Sample Boosted Hierarchical Heterogeneous Graph Attention Network (NSB-H2GAN). This model addresses the over-smoothing issue typically encountered in classical Graph Convolutional Networks (GCNs) when applied to pathology slides. By incorporating “negative samples” at multiple scales and utilizing hierarchical, heterogeneous feature discrimination, NSB-H2GAN more effectively captures the unique features of each patch, leading to an improved representation of gigapixel WSIs. We evaluated the performance of NSB-H2GAN on three publicly available datasets: CAMELYON16, TCGA-NSCLC and TCGA-COAD. The results show that NSB-H2GAN significantly outperforms existing state-of-the-art methods in both qualitative and quantitative evaluations. Moreover, NSB-H2GAN generates more detailed and interpretable heatmaps, allowing for precise localization of tiny lesions as small as <inline-formula> <tex-math>$200mu mtimes 200mu m$ </tex-math></inline-formula> that are often missed by the human eye. The robust performance of NSB-H2GAN offers a new paradigm for computer-aided pathology diagnosis and holds great potential for advancing the clinical applications of computational pathology.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4215-4229"},"PeriodicalIF":0.0,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144547026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao Zhang;Zhi Wang;Xiuyi Jia;Zechao Li;Chunlin Chen;Huaxiong Li
{"title":"Multi-View Clustering With Incremental Instances and Views","authors":"Chao Zhang;Zhi Wang;Xiuyi Jia;Zechao Li;Chunlin Chen;Huaxiong Li","doi":"10.1109/TIP.2025.3583122","DOIUrl":"10.1109/TIP.2025.3583122","url":null,"abstract":"Multi-view clustering (MVC) has attracted increasing attention with the emergence of various data collected from multiple sources. In real-world dynamic environment, instances are continually gathered, and the number of views expands as new data sources become available. Learning for such simultaneous increment of instances and views, particularly in unsupervised scenarios, is crucial yet underexplored. In this paper, we address this problem by proposing a novel MVC method with Incremental Instances and Views, MVC-IIV for short. MVC-IIV contains two stages, an initial stage and an incremental stage. In the initial stage, a basic latent multi-view subspace clustering model is constructed to handle existing data, which can be viewed as traditional static MVC. In the incremental stage, the previously trained model is reused to guide learning for newly arriving instances with new views, transferring historical knowledge while avoiding redundant computations. In specific, we design and reuse two modules, i.e., multi-view embedding module for low-dimensional representation learning, and consensus centroids module for cluster probability learning. By adding consistency regularization on the two modules, the knowledge acquired from previous data is used, which not only enhances the exploration within current data batch, but also extracts the between-batch data correlations. The proposed model can be efficiently solved with linear space and time complexity. Extensive experiments demonstrate the effectiveness and efficiency of our method compared with the state-of-the-art approaches.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4203-4214"},"PeriodicalIF":0.0,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144547027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"WS-SAM: Generalizing SAM to Weakly Supervised Object Detection With Category Label","authors":"Hao Wang;Tong Jia;Qilong Wang;Wangmeng Zuo","doi":"10.1109/TIP.2025.3581729","DOIUrl":"10.1109/TIP.2025.3581729","url":null,"abstract":"Building an effective object detector usually depends on large well-annotated training samples. While annotating such dataset is extremely laborious and costly, where box-level supervision which contains both accurate classification category and localization coordinate is required. Compared to above box-level supervised annotation, those weakly supervised learning manners (e.g,, category, point and scribble) need relatively less laborious annotation cost, and provide a feasible way to mitigate the reliance on the dataset. Because of the lack of sufficient supervised information, current weakly supervised methods cannot achieve satisfactory detection performance. Recently, Segment Anything Model (SAM) has appeared as a task-agnostic foundation model and shown promising performance improvement in many related works due to its powerful generalization and data processing abilities. The properties of the SAM inspire us to adopt such basic benchmark to weakly supervised object detection field to compensate the deficiencies in supervised information. However, directly deploying SAM on weakly supervised object detection task meets with two issues. Firstly, SAM needs meticulously-designed prompts, and such expert-level prompts restrict their applicability and practicality. Besides, SAM is a category unawareness model, and it cannot assign the category labels to the generated predictions. To solve above issues, we propose WS-SAM, which generalizes Segment Anything Model (SAM) to weakly supervised object detection with category label. Specifically, we design an adaptive prompt generator to take full advantages of the spatial and semantic information from the prompt. It employs in a self-prompting manner by taking the output of SAM from the previous iteration as the prompt input to guide the next iteration, where the prompts can be adaptively generated based on the classification activation map. We also develop a segmentation mask refinement module and formulate the label assignment process as a shortest path optimization problem by considering the similarity between each location and prompts. Furthermore, a bidirectional adapter is also implemented to resolve the domain discrepancy by incorporating domain-specific information. We evaluate the effectiveness of our method on several detection datasets (e.g., PASCAL VOC and MS COCO), and the experiment results show that our proposed method can achieve clear improvement over state-of-the-art methods, while performing favorably against state-of-the-arts.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4052-4066"},"PeriodicalIF":0.0,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144500691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Wavelet-Guided Deep Unfolding Network for Single Image Reflection Removal","authors":"Ya-Nan Zhang;Qiufu Li;Xu Wu;Nan Mu;Xiaoning Li;Linlin Shen","doi":"10.1109/TIP.2025.3581418","DOIUrl":"10.1109/TIP.2025.3581418","url":null,"abstract":"Removing unwanted reflections from images is a fundamental yet challenging problem in low-level computer vision. Recent deep learning-based Single Image Reflection Removal (SIRR) methods have made significant progress. However, separating reflections from transmission content remains difficult, particularly in complex scenes where the two exhibit high visual similarity. Upon careful analysis, we find that reflections predominantly reside in the high-frequency components of an image. These reflections tend to distort fine details in the high-frequency range, while the low-frequency information remains relatively less affected. This observation motivates us to explore a frequency-aware approach for SIRR by leveraging the Discrete Wavelet Transform (DWT). The wavelet decomposition enables us to distinguish and isolate reflective artifacts in the frequency domain while preserving the transmission information. Building on this insight, we propose a novel Wavelet-guided Deep Unfolding Network (WDUNet) that leverages the strengths of wavelet decomposition and deep unfolding techniques to improve interpretability and generalization in SIRR. Specifically, we formulate an optimization-based reflection removal model using DWT and convolutional dictionaries. The proposed model is optimized via a proximal gradient algorithm and then unfolded into a neural network architecture, where all parameters are learned end-to-end during training. By combining wavelet domain analysis with deep unfolding, WDUNet enhances both the interpretability and generalization of SIRR methods. Additionally, we design and integrate the Low-frequency Parameter Estimation Module (LPEM) and High-frequency Parameter Estimation Module (HPEM) modules into WDUNet, allowing the network to automatically learn and optimize the models’ hyperparameters. Extensive experiments conducted on four benchmark datasets demonstrate that WDUNet consistently outperforms existing state-of-the-art methods in both objective evaluation metrics and subjective visual quality.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4040-4051"},"PeriodicalIF":0.0,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144500690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DCPI-Depth: Explicitly Infusing Dense Correspondence Prior to Unsupervised Monocular Depth Estimation","authors":"Mengtan Zhang;Yi Feng;Qijun Chen;Rui Fan","doi":"10.1109/TIP.2025.3581422","DOIUrl":"10.1109/TIP.2025.3581422","url":null,"abstract":"There has been a recent surge of interest in learning to perceive depth from monocular videos in an unsupervised fashion. A key challenge in this field is achieving robust and accurate depth estimation in regions with weak textures or where dynamic objects are present. This study makes three major contributions by delving deeply into dense correspondence priors to provide existing frameworks with explicit geometric constraints. The first novel contribution is a contextual-geometric depth consistency loss, which employs depth maps triangulated from dense correspondences based on estimated ego-motion to guide the learning of depth perception from contextual information, since explicitly triangulated depth maps capture accurate relative distances among pixels. The second novel contribution arises from the observation that there exists an explicit, deducible relationship between optical flow divergence and depth gradient. A differential property correlation loss is therefore designed to refine depth estimation with a specific emphasis on local variations. The third novel contribution is a bidirectional stream co-adjustment strategy that enhances the interaction between rigid and optical flows, encouraging the former towards more accurate correspondence and making the latter more adaptable across various scenarios under the static scene hypotheses. DCPI-Depth, a framework that incorporates all these innovative components and couples two bidirectional and collaborative streams, achieves state-of-the-art performance and generalizability across multiple public datasets, outperforming all existing prior arts. Specifically, it demonstrates accurate depth estimation in texture-less and dynamic regions, and shows more reasonable smoothness. Our source code is publicly available at <uri>https://mias.group/DCPI-Depth</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4258-4272"},"PeriodicalIF":0.0,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144488040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}