Jun Chen;Liling Yang;Wei Yu;Wenping Gong;Zhanchuan Cai;Jiayi Ma
{"title":"SDSFusion: A Semantic-Aware Infrared and Visible Image Fusion Network for Degraded Scenes","authors":"Jun Chen;Liling Yang;Wei Yu;Wenping Gong;Zhanchuan Cai;Jiayi Ma","doi":"10.1109/TIP.2025.3571339","DOIUrl":"10.1109/TIP.2025.3571339","url":null,"abstract":"A single-modal infrared or visible image offers limited representation in scenes with lighting degradation or extreme weather. We propose a multi-modal fusion framework, named SDSFusion, for all-day and all-weather infrared and visible image fusion. SDSFusion exploits the commonality in image processing to achieve enhancement, fusion, and semantic task interaction in a unified framework guided by semantic awareness and multi-scale features and losses. To address the disparity between infrared and visible images in degraded scenes, we differentiate modal features in a unified fusion model. Unlike existing joint fusion methods, we propose an adversarial generative network that refines the reconstruction of low-light images by embedding fused features. It provides feature-level brightness supplementation and image reconstruction to refine brightness and contrast. Extensive experiments in degraded scenes confirm that our approach is superior to state-of-the-art approaches in visual quality and performance, demonstrating the effectiveness of interaction improvement. The code will be posted at: <uri>https://github.com/Liling-yang/SDSFusion</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3139-3153"},"PeriodicalIF":0.0,"publicationDate":"2025-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144130364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rong Yin;Ruyue Liu;Xiaoshuai Hao;Xingrui Zhou;Yong Liu;Can Ma;Weiping Wang
{"title":"Multi-Modal Molecular Representation Learning via Structure Awareness","authors":"Rong Yin;Ruyue Liu;Xiaoshuai Hao;Xingrui Zhou;Yong Liu;Can Ma;Weiping Wang","doi":"10.1109/TIP.2025.3570604","DOIUrl":"10.1109/TIP.2025.3570604","url":null,"abstract":"Accurate extraction of molecular representations is a critical step in the drug discovery process. In recent years, significant progress has been made in molecular representation learning methods, among which multi-modal molecular representation methods based on images, and 2D/3D topologies have become increasingly mainstream. However, existing these multi-modal approaches often directly fuse information from different modalities, overlooking the potential of intermodal interactions and failing to adequately capture the complex higher-order relationships and invariant features between molecules. To overcome these challenges, we propose a structure-awareness-based multi-modal self-supervised molecular representation pre-training framework (MMSA) designed to enhance molecular graph representations by leveraging invariant knowledge between molecules. The framework consists of two main modules: the multi-modal molecular representation learning module and the structure-awareness module. The multi-modal molecular representation learning module collaboratively processes information from different modalities of the same molecule to overcome intermodal differences and generate a unified molecular embedding. Subsequently, the structure-awareness module enhances the molecular representation by constructing a hypergraph structure to model higher-order correlations between molecules. This module also introduces a memory mechanism for storing typical molecular representations, aligning them with memory anchors in the memory bank to integrate invariant knowledge, thereby improving the model’s generalization ability. Compared to existing multi-modal approaches, MMSA can be seamlessly integrated with any graph-based method and supports multiple molecular data modalities, ensuring both versatility and compatibility. Extensive experiments have demonstrated the effectiveness of MMSA, which achieves state-of-the-art performance on the MoleculeNet benchmark, with average ROC-AUC improvements ranging from 1.8% to 9.6% over baseline methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3225-3238"},"PeriodicalIF":0.0,"publicationDate":"2025-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144130363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HLDD: Hierarchically Learned Detector and Descriptor for Robust Image Matching","authors":"Maoqing Hu;Bin Sun;Fuhua Zhang;Shutao Li","doi":"10.1109/TIP.2025.3568310","DOIUrl":"10.1109/TIP.2025.3568310","url":null,"abstract":"Image matching is a critical task in computer vision research, focusing on aligning two or more images with similar features. Feature detection and description constitute the core of image matching. Handcrafted detectors are capable of obtaining distinctive points but these points may not be repeatable on the image pairs especially those with dramatic appearance changes. On the contrary, the learned detectors can extract a large number of repeatable points but many of them tend to be ambiguous points with low distinctiveness. Moreover, in the scenarios of dramatic appearance change, commonly used contrast or triplet loss in the training of descriptors employ the hard negative mining strategy, which may obtain overly challenging negative samples by global sampling, resulting in sluggish convergence or even overfitting. Those learned descriptors may not guarantee that the corresponding points enjoy larger similarities than unmatched ones, leading to inaccurate matches. To address those issues, we propose a hierarchically learned detector and descriptor (HLDD) for robust image matching, which contains three modules: a handcrafted-learned detector, a hierarchically learned descriptor, and a coarse-to-fine matching strategy. The handcrafted-learned detector integrates the advantages of handcrafted and learned detectors. It extracts distinctive feature points from a learned repeatability map robust to image changes and eliminates the ambiguous ones according to a learned distinctiveness map. The descriptor is trained by a proposed hierarchical triplet loss, which employs a dual window strategy. It can obtain the hardest negative samples in local windows, which are comparatively easier over global sampling, ensuring the effective training of descriptors. The coarse-to-fine matching strategy performs global and local mutual nearest neighbor matching on the coarse and fine descriptor maps respectively to improve the matching accuracy progressively. By comparing with other matching methods, experimental results demonstrate the superiority of the proposed method in the task of image matching, homography estimation, visual localization, and relative pose estimation. Moreover, ablation studies illustrate the effectiveness of the three proposed modules.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3123-3138"},"PeriodicalIF":0.0,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144122156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Invariant Feature Extraction Functions for UME-Based Point Cloud Detection and Registration","authors":"Amit Efraim;Yuval Haitman;Joseph M. Francos","doi":"10.1109/TIP.2025.3570628","DOIUrl":"10.1109/TIP.2025.3570628","url":null,"abstract":"Point clouds are unordered sets of coordinates in 3D with no functional relation imposed on them. The Rigid Transformation Universal Manifold Embedding (RTUME) is a mapping of volumetric or surface measurements on a 3D object to matrices, such that when two observations on the same object are related by a rigid transformation, this relation is preserved between their corresponding RTUME matrices, thus providing linear and robust solution to the registration and detection problems. To make the RTUME framework of 3D object detection and registration applicable for processing point cloud observations, there is a need to define a function that assigns each point in the cloud with a value (feature vector), invariant to the action of the transformation group. Since existing feature extraction functions do not achieve the desired level of invariance to rigid transformations, to the variability of sampling patterns, and to model mismatches, we present a novel approach for designing dense feature extraction functions, compatible with the requirements of the RTUME framework. One possible implementation of the approach is to adapt existing feature extracting functions, whether learned or analytic, designed for the estimation of point correspondences, to the RTUME framework. The novel feature-extracting function design employs integration over <inline-formula> <tex-math>$SO(3)$ </tex-math></inline-formula> to marginalize the pose dependency of extracted features, followed by projecting features between point clouds using nearest neighbor projection to overcome other sources of model mismatch. In addition, the non-linear functions that define the RTUME mapping are optimized using an MLP model, trained to minimize the RTUME registration errors. The overall RTUME registration performance is evaluated using standard registration benchmarks, and is shown to outperform existing SOTA methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3209-3224"},"PeriodicalIF":0.0,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144113867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lisha Li;Jingwen Hou;Weide Liu;Yuming Fang;Jiebin Yan
{"title":"Diffusion-Based Facial Aesthetics Enhancement With 3D Structure Guidance","authors":"Lisha Li;Jingwen Hou;Weide Liu;Yuming Fang;Jiebin Yan","doi":"10.1109/TIP.2025.3551077","DOIUrl":"10.1109/TIP.2025.3551077","url":null,"abstract":"Facial Aesthetics Enhancement (FAE) aims to improve facial attractiveness by adjusting the structure and appearance of a facial image while preserving its identity as much as possible. Most existing methods adopted deep feature-based or score-based guidance for generation models to conduct FAE. Although these methods achieved promising results, they potentially produced excessively beautified results with lower identity consistency or insufficiently improved facial attractiveness. To enhance facial aesthetics with less loss of identity, we propose the Nearest Neighbor Structure Guidance based on Diffusion (NNSG-Diffusion), a diffusion-based FAE method that beautifies a 2D facial image with 3D structure guidance. Specifically, we propose to extract FAE guidance from a nearest neighbor reference face. To allow for less change of facial structures in the FAE process, a 3D face model is recovered by referring to both the matched 2D reference face and the 2D input face, so that the depth and contour guidance can be extracted from the 3D face model. Then the depth and contour clues can provide effective guidance to Stable Diffusion with ControlNet for FAE. Extensive experiments demonstrate that our method is superior to previous relevant methods in enhancing facial aesthetics while preserving facial identity.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1879-1894"},"PeriodicalIF":0.0,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143672286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chaehwa Yoo;Xiaofeng Liu;Fangxu Xing;Jonghye Woo;Je-Won Kang
{"title":"Label Space-Induced Pseudo Label Refinement for Multi-Source Black-Box Domain Adaptation","authors":"Chaehwa Yoo;Xiaofeng Liu;Fangxu Xing;Jonghye Woo;Je-Won Kang","doi":"10.1109/TIP.2025.3570220","DOIUrl":"10.1109/TIP.2025.3570220","url":null,"abstract":"Conventional unsupervised domain adaptation (UDA) requires access to source data and/or source model parameters, prohibiting its practical application in terms of privacy, security, and intellectual property. Recent black-box UDA (BDA) reduces such constraints by defining a pseudo label from a single encapsulated source application programming interface (API) prediction, which allows for self-training of the target model. Nonetheless, existing methods have limited consideration for multi-source settings, in which multiple source domain APIs are available to generate pseudo labels. In this work, we introduce a novel training framework for multi-source BDA (MSBDA), dubbed Label Space-Induced Pseudo Label Refinement (LPR). Specifically, LPR incorporates a Pseudo label Refinery Network (PRN) that learns the relationship among source domains conditioned by the target domain only utilizing source API’s prediction. The target model is adapted by our dual phases PRN. First, a warm-up phase targets to avoid failure due to noisy samples and provide an initial pseudo-label, which is followed by a label refinement phase with domain relationship exploration. We provide theoretical support for the mechanism of the LPR. Experimental results on four benchmark datasets demonstrate that MSBDA using LPR achieves competitive performance compared to state-of-the-art approaches with different DA settings.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3181-3193"},"PeriodicalIF":0.0,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144113866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junbo Qiao;Wei Li;Haizhen Xie;Hanting Chen;Jie Hu;Shaohui Lin;Jungong Han
{"title":"LIPT: Latency-Aware Image Processing Transformer","authors":"Junbo Qiao;Wei Li;Haizhen Xie;Hanting Chen;Jie Hu;Shaohui Lin;Jungong Han","doi":"10.1109/TIP.2025.3567832","DOIUrl":"10.1109/TIP.2025.3567832","url":null,"abstract":"Transformer is leading a trend in the field of image processing. While existing lightweight image processing transformers have achieved notable success, they primarily focus on reducing FLOPs (floating-point operations) or the number of parameters, rather than on practical inference acceleration. In this paper, we present a latency-aware image processing transformer, termed LIPT. We devise the low-latency proportion LIPT block that substitutes memory-intensive operators with the combination of self-attention and convolutions to achieve practical speedup. Specifically, we propose a novel non-volatile sparse masking self-attention (NVSM-SA) that utilizes a pre-computing sparse mask to capture contextual information from a larger window with no extra computation overload. Besides, a high-frequency reparameterization module (HRM) is proposed to make LIPT block reparameterization friendly, enhancing the model’s ability to reconstruct fine details. Extensive experiments on multiple image processing tasks (e.g., image super-resolution (SR), JPEG artifact reduction, and image denoising) demonstrate the superiority of LIPT on both latency and PSNR. LIPT achieves real-time GPU inference with state-of-the-art performance on multiple image SR benchmarks. The source codes are released at <uri>https://github.com/Lucien66/LIPT</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3056-3069"},"PeriodicalIF":0.0,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144104763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DGC-Net: Dynamic Graph Contrastive Network for Video Object Detection","authors":"Qiang Qi;Hanzi Wang;Yan Yan;Xuelong Li","doi":"10.1109/TIP.2025.3551158","DOIUrl":"10.1109/TIP.2025.3551158","url":null,"abstract":"Video object detection is a challenging task in computer vision since it needs to handle the object appearance degradation problem that seldom occurs in the image domain. Off-the-shelf video object detection methods typically aggregate multi-frame features at one stroke to alleviate appearance degradation. However, these existing methods do not take supervision knowledge into consideration and thus still suffer from insufficient feature aggregation, resulting in the false detection problem. In this paper, we take a different perspective on feature aggregation, and propose a dynamic graph contrastive network (DGC-Net) for video object detection, including three improvements against existing methods. First, we design a frame-level graph contrastive module to aggregate frame features, enabling our DGC-Net to fully exploit discriminative contextual feature representations to facilitate video object detection. Second, we develop a proposal-level graph contrastive module to aggregate proposal features, making our DGC-Net sufficiently learn discriminative semantic feature representations. Third, we present a graph transformer to dynamically adjust the graph structure by pruning the useless nodes and edges, which contributes to improving accuracy and efficiency as it can eliminate the geometric-semantic ambiguity and reduce the graph scale. Furthermore, inherited from the framework of DGC-Net, we develop DGC-Net Lite to perform real-time video object detection with a much faster inference speed. Extensive experiments conducted on the ImageNet VID dataset demonstrate that our DGC-Net outperforms the performance of current state-of-the-art methods. Notably, our DGC-Net obtains 86.3%/87.3% mAP when using ResNet-101/ResNeXt-101.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2269-2284"},"PeriodicalIF":0.0,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143661353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Per-Pixel Calibration Based on Multi-View 3D Reconstruction Errors Beyond the Depth of Field","authors":"Rong Dai;Wenpan Li;Yun-Hui Liu","doi":"10.1109/TIP.2025.3551165","DOIUrl":"10.1109/TIP.2025.3551165","url":null,"abstract":"In 3D microscopic imaging, the extremely shallow depth of field presents a challenge for accurate 3D reconstruction in cases of significant defocus. Traditional calibration methods rely on the spatial extraction of feature points to establish spatial 3D information as the optimization objective. However, these methods suffer from reduced extraction accuracy under defocus conditions, which causes degradation of calibration performance. To extend calibration volume without compromising accuracy in defocused scenarios, we propose a per-pixel calibration based on multi-view 3D reconstruction errors. It utilizes 3D reconstruction errors among different binocular setups as an optimization objective. We first analyze multi-view 3D reconstruction error distributions under the poor-accuracy optical model by employing a multi-view microscopic 3D measurement system using telecentric lenses. Subsequently, the 3D proportion model is proposed for implementing our error-based per-pixel calibration, derived as a spatial linear expression directly correlated with the 3D reconstruction error distribution. The experimental results confirm the robust convergence of our method with multiple binocular setups. Near the focus volume, the multi-view 3D reconstruction error remains approximately <inline-formula> <tex-math>$8~mu $ </tex-math></inline-formula> m (less than 0.5 camera pixel pitch), with absolute accuracy maintained within 0.5% of the measurement range. Beyond tenfold depth of field, the multi-view 3D reconstruction error increases to around <inline-formula> <tex-math>$30~mu $ </tex-math></inline-formula> m (still less than 2 camera pixel pitches), while absolute accuracy remains within 1% of the measurement range. These high-precision measurement results validate the feasibility and accuracy of our proposed calibration.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2124-2132"},"PeriodicalIF":0.0,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143661351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benliu Qiu;Heqian Qiu;Haitao Wen;Lanxiao Wang;Yu Dai;Fanman Meng;Qingbo Wu;Hongliang Li
{"title":"Geodesic-Aligned Gradient Projection for Continual Task Learning","authors":"Benliu Qiu;Heqian Qiu;Haitao Wen;Lanxiao Wang;Yu Dai;Fanman Meng;Qingbo Wu;Hongliang Li","doi":"10.1109/TIP.2025.3551139","DOIUrl":"10.1109/TIP.2025.3551139","url":null,"abstract":"Deep networks notoriously suffer from performance deterioration on previous tasks when learning from sequential tasks, i.e., catastrophic forgetting. Recent methods of gradient projection show that the forgetting is resulted from the gradient interference on old tasks and accordingly propose to update the network in an orthogonal direction to the task space. However, these methods assume the task space is invariant and neglect the gradual change between tasks, resulting in sub-optimal gradient projection and a compromise of the continual learning capacity. To tackle this problem, we propose to embed each task subspace into a non-Euclidean manifold, which can naturally capture the change of tasks since the manifold is intrinsically non-static compared to the Euclidean space. Subsequently, we analytically derive the accumulated projection between any two subspaces on the manifold along the geodesic path by integrating an infinite number of intermediate subspaces. Building upon this derivation, we propose a novel geodesic-aligned gradient projection (GAGP) method that harnesses the accumulated projection to mitigate catastrophic forgetting. The proposed method utilizes the geometric structure information on the task manifold by capturing the gradual change between the new and the old tasks. Empirical studies on image classification demonstrate that the proposed method alleviates catastrophic forgetting and achieves on-par or better performance compared to the state-of-the-art approaches.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1995-2007"},"PeriodicalIF":0.0,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143661521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}