IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献_第4页

Context-CAM: Context-Level Weight-Based CAM With Sequential Denoising to Generate High-Quality Class Activation Maps 上下文-CAM：上下文级基于权重的CAM与顺序去噪，以产生高质量的类激活地图

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-02 DOI: 10.1109/TIP.2025.3573509

Jie Du;Wenbing Chen;Chi-Man Vong;Peng Liu;Tianfu Wang

{"title":"Context-CAM: Context-Level Weight-Based CAM With Sequential Denoising to Generate High-Quality Class Activation Maps","authors":"Jie Du;Wenbing Chen;Chi-Man Vong;Peng Liu;Tianfu Wang","doi":"10.1109/TIP.2025.3573509","DOIUrl":"10.1109/TIP.2025.3573509","url":null,"abstract":"Class activation mapping (CAM) methods have garnered considerable research attention because they can be used to interpret the decision-making of deep convolutional neural network (CNN) models and provide initial masks for weakly supervised semantic segmentation (WSSS) tasks. However, the class activation maps generated by most CAM methods usually have two limitations: 1) a lack of the ability to cover the whole object when using low-level features; and 2) introducing background noise. To mitigate these issues, an innovative <italic>Context-level weights-based CAM (Context-CAM) method is proposed, which guarantees: 1) the non-discriminative regions that have similar appearances and are located close to the discriminative regions can also be highlighted by the newly designed <italic>Region-Enhanced Mapping (REM) module using context-level weights; and 2) the background noises are gradually eliminated via a newly proposed <italic>Semantic-guided Reverse Sequence Fusion (SRSF) strategy that can sequentially denoise and fuse the region-enhanced maps from the last layer to the first layer. Extensive experimental results show that our Context-CAM can generate higher-quality class activation maps than classic and state-of-the-art (SOTA) CAM methods in terms of the Energy-Based Pointing Game (EBPG) score, and the improvements are up to 35.49% when compared to the second-best method. Moreover, for WSSS tasks, our Context-CAM can directly replace the CAM method used in existing WSSS methods without any architectural modification to further improve the segmentation performance. Our code is available at <uri>https://github.com/cwb0611/Context-CAM</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3431-3446"},"PeriodicalIF":0.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144201924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Resolution Natural Image Matting by Refining Low-Resolution Alpha Mattes 高分辨率自然图像抠图通过细化低分辨率Alpha抠图

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-02 DOI: 10.1109/TIP.2025.3573620

Xianmin Ye;Yihui Liang;Mian Tan;Fujian Feng;Lin Wang;Han Huang

{"title":"High-Resolution Natural Image Matting by Refining Low-Resolution Alpha Mattes","authors":"Xianmin Ye;Yihui Liang;Mian Tan;Fujian Feng;Lin Wang;Han Huang","doi":"10.1109/TIP.2025.3573620","DOIUrl":"10.1109/TIP.2025.3573620","url":null,"abstract":"High-resolution natural image matting plays an important role in image editing, film-making and remote sensing due to its ability of accurately extract the foreground from a natural background. However, due to the complexity brought about by the proliferation of resolution, the existing image matting methods cannot obtain high-quality alpha mattes on high-resolution images in reasonable time. To overcome this challenge, we introduce a high-resolution image matting framework based on alpha matte refinement from low-resolution to high-resolution (HRIMF-AMR). The proposed framework transforms the complex high-resolution image matting problem into low-resolution image matting problem and high-resolution alpha matte refinement problem. While the first problem is solved by adopting an existing image matting method, the latter is addressed by applying the Detail Difference Feature Extractor (DDFE) designed as a part of our work. The DDFE extracts detail difference features from high-resolution images by measuring the image feature difference between high-resolution images and low-resolution images. The low-resolution alpha matte is refined according to the extracted detail difference feature, providing the high-resolution alpha matte. In addition, the Matte Detail Resolution Difference (MDRD) loss is introduced to train the DDFE, which imposes an additional constraint on the extraction of detail difference features with mattes. Experimental results show that integrating HRIMF-AMR significantly enhances the performance of existing matting methods on high-resolution images of Transparent-460 and Alphamatting. Project page: <uri>https://github.com/yexianmin/HRAMR-Matting</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3323-3335"},"PeriodicalIF":0.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144201876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prototypical Distribution Divergence Loss for Image Restoration 用于图像恢复的原型分布散度损失

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-02 DOI: 10.1109/TIP.2025.3572818

Jialun Peng;Jingjing Fu;Dong Liu

{"title":"Prototypical Distribution Divergence Loss for Image Restoration","authors":"Jialun Peng;Jingjing Fu;Dong Liu","doi":"10.1109/TIP.2025.3572818","DOIUrl":"10.1109/TIP.2025.3572818","url":null,"abstract":"Neural networks have achieved significant advances in the field of image restoration and much research has focused on designing new architectures for convolutional neural networks (CNNs) and Transformers. The choice of loss functions, despite being a critical factor when training image restoration networks, has attracted little attention. The existing losses are primarily based on semantic or hand-crafted representations. Recently, discrete representations have demonstrated strong capabilities in representing images. In this work, we explore the loss of discrete representations for image restoration. Specifically, we propose a Local Residual Quantized Variational AutoEncoder (Local RQ-VAE) to learn prototype vectors that represent the local details of high-quality images. Then we propose a Prototypical Distribution Divergence (PDD) loss that measures the Kullback-Leibler divergence between the prototypical distributions of the restored and target images. Experimental results demonstrate that our PDD loss improves the restored images in both PSNR and visual quality for state-of-the-art CNNs and Transformers on several image restoration tasks, including image super-resolution, image denoising, image motion deblurring, and defocus deblurring.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3563-3577"},"PeriodicalIF":0.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144201878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PLGS: Robust Panoptic Lifting With 3D Gaussian Splatting PLGS：鲁棒全景提升与3D高斯飞溅

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-02 DOI: 10.1109/TIP.2025.3573524

Yu Wang;Xiaobao Wei;Ming Lu;Guoliang Kang

{"title":"PLGS: Robust Panoptic Lifting With 3D Gaussian Splatting","authors":"Yu Wang;Xiaobao Wei;Ming Lu;Guoliang Kang","doi":"10.1109/TIP.2025.3573524","DOIUrl":"10.1109/TIP.2025.3573524","url":null,"abstract":"Previous methods utilize the Neural Radiance Field (NeRF) for panoptic lifting, while their training and rendering speed are unsatisfactory. In contrast, 3D Gaussian Splatting (3DGS) has emerged as a prominent technique due to its rapid training and rendering speed. However, unlike NeRF, the conventional 3DGS may not satisfy the basic smoothness assumption as it does not rely on any parameterized structures to render (e.g., MLPs). Consequently, the conventional 3DGS is, in nature, more susceptible to noisy 2D mask supervision. In this paper, we propose a new method called PLGS that enables 3DGS to generate consistent panoptic segmentation masks from noisy 2D segmentation masks while maintaining superior efficiency compared to NeRF-based methods. Specifically, we build a panoptic-aware structured 3D Gaussian model to introduce smoothness and design effective noise reduction strategies. For the semantic field, instead of initialization with structure from motion, we construct reliable semantic anchor points to initialize the 3D Gaussians. We then use these anchor points as smooth regularization during training. Additionally, we present a self-training approach using pseudo labels generated by merging the rendered masks with the noisy masks to enhance the robustness of PLGS. For the instance field, we project the 2D instance masks into 3D space and match them with oriented bounding boxes to generate cross-view consistent instance masks for supervision. Experiments on various benchmarks demonstrate that our method outperforms previous state-of-the-art methods in terms of both segmentation quality and speed.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3377-3388"},"PeriodicalIF":0.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144201970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Heterogeneous Experts and Hierarchical Perception for Underwater Salient Object Detection 基于异构专家和层次感知的水下显著目标检测

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-02 DOI: 10.1109/TIP.2025.3572760

Mingfeng Zha;Guoqing Wang;Yunqiang Pei;Tianyu Li;Xiongxin Tang;Chongyi Li;Yang Yang;Heng Tao Shen

{"title":"Heterogeneous Experts and Hierarchical Perception for Underwater Salient Object Detection","authors":"Mingfeng Zha;Guoqing Wang;Yunqiang Pei;Tianyu Li;Xiongxin Tang;Chongyi Li;Yang Yang;Heng Tao Shen","doi":"10.1109/TIP.2025.3572760","DOIUrl":"10.1109/TIP.2025.3572760","url":null,"abstract":"Existing underwater salient object detection (USOD) methods design fusion strategies to integrate multimodal information, but lack exploration of modal characteristics. To address this, we separately leverage the RGB and depth branches to learn disentangled representations, formulating the heterogeneous experts and hierarchical perception network (HEHP). Specifically, to reduce modal discrepancies, we propose the hierarchical prototype guided interaction (HPI), which achieves fine-grained alignment guided by the semantic prototypes, and then refines with complementary modalities. We further design the mixture of frequency experts (MoFE), where experts focus on modeling high- and low-frequency respectively, collaborating to explicitly obtain hierarchical representations. To efficiently integrate diverse spatial and frequency information, we formulate the four-way fusion experts (FFE), which dynamically selects optimal experts for fusion while being sensitive to scale and orientation. Since depth maps with poor quality inevitably introduce noises, we design the uncertainty injection (UI) to explore high uncertainty regions by establishing pixel-level probability distributions. We further formulate the holistic prototype contrastive (HPC) loss based on semantics and patches to learn compact and general representations across modalities and images. Finally, we employ varying supervision based on branch distinctions to implicitly construct difference modeling. Extensive experiments on two USOD datasets and four relevant underwater scene benchmarks validate the effect of the proposed method, surpassing state-of-the-art binary detection models. Impressive results on seven natural scene benchmarks further demonstrate the scalability.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3703-3717"},"PeriodicalIF":0.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144201879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Environmental Robustness in Few-Shot Learning via Conditional Representation Learning 利用条件表示学习增强少镜头学习的环境鲁棒性

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-02 DOI: 10.1109/TIP.2025.3572762

Qianyu Guo;Jingrong Wu;Tianxing Wu;Haofen Wang;Weifeng Ge;Wenqiang Zhang

{"title":"Enhancing Environmental Robustness in Few-Shot Learning via Conditional Representation Learning","authors":"Qianyu Guo;Jingrong Wu;Tianxing Wu;Haofen Wang;Weifeng Ge;Wenqiang Zhang","doi":"10.1109/TIP.2025.3572762","DOIUrl":"10.1109/TIP.2025.3572762","url":null,"abstract":"Few-shot learning (FSL) has recently been extensively utilized to overcome the scarcity of training data in domain-specific visual recognition. In real-world scenarios, environmental factors such as complex backgrounds, varying lighting conditions, long-distance shooting, and moving targets often cause test images to exhibit numerous incomplete targets or noise disruptions. However, current research on evaluation datasets and methodologies has largely ignored the concept of “environmental robustness”, which refers to maintaining consistent performance in complex and diverse physical environments. This neglect has led to a notable decline in the performance of FSL models during practical testing compared to their training performance. To bridge this gap, we introduce a new real-world multi-domain few-shot learning (RD-FSL) benchmark, which includes four domains and six evaluation datasets. The test images in this benchmark feature various challenging elements, such as camouflaged objects, small targets, and blurriness. Our evaluation experiments reveal that existing methods struggle to utilize training images effectively to generate accurate feature representations for challenging test images. To address this problem, we propose a novel conditional representation learning network (CRLNet) that integrates the interactions between training and testing images as conditional information in their respective representation processes. The main goal is to reduce intra-class variance or enhance inter-class variance at the feature representation level. Finally, comparative experiments reveal that CRLNet surpasses the current state-of-the-art methods, achieving performance improvements ranging from 6.83% to 16.98% across diverse settings and backbones. The source code and dataset are available at <uri>https://github.com/guoqianyu-alberta/Conditional-Representation-Learning</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3489-3502"},"PeriodicalIF":0.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144201654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MindGPT: Interpreting What You See With Non-Invasive Brain Recordings MindGPT：用非侵入性大脑记录解读你所看到的

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-02 DOI: 10.1109/TIP.2025.3572784

Jiaxuan Chen;Yu Qi;Yueming Wang;Gang Pan

{"title":"MindGPT: Interpreting What You See With Non-Invasive Brain Recordings","authors":"Jiaxuan Chen;Yu Qi;Yueming Wang;Gang Pan","doi":"10.1109/TIP.2025.3572784","DOIUrl":"10.1109/TIP.2025.3572784","url":null,"abstract":"Decoding of seen visual contents with non-invasive brain recordings has important scientific and practical values. Efforts have been made to recover the seen images from brain signals. However, most existing approaches cannot faithfully reflect the visual contents due to insufficient image quality or semantic mismatches. Compared with reconstructing pixel-level visual images, speaking is a more efficient and effective way to explain visual information. Here we introduce a non-invasive neural decoder, termed MindGPT, which interprets perceived visual stimuli into natural languages from functional Magnetic Resonance Imaging (fMRI) signals in an end-to-end manner. Specifically, our model builds upon a visually guided neural encoder with a cross-attention mechanism. By the collaborative use of data augmentation techniques, this architecture permits us to guide latent neural representations towards a desired language semantic direction in a self-supervised fashion. Through doing so, we found that the neural representations of the MindGPT are explainable, which can be used to evaluate the contributions of visual properties to language semantics. Our experiments show that the generated word sequences truthfully represented the visual information (with essential details) conveyed in the seen stimuli. The results also suggested that with respect to language decoding tasks, the higher visual cortex (HVC) is more semantically informative than the lower visual cortex (LVC), and using only the HVC can recover most of the semantic information. The source code for the MindGPT model is publicly available at <uri>https://github.com/JxuanC/MindGPT</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3281-3293"},"PeriodicalIF":0.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144201877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Restoration of Images Taken Through a Dirty Window Using Optics-Guided Transformer 使用光学引导变压器恢复通过脏窗拍摄的图像

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-02 DOI: 10.1109/TIP.2025.3573500

Zongliang Wu;Juzheng Zhang;Ying Fu;Yulun Zhang;Xin Yuan

{"title":"Restoration of Images Taken Through a Dirty Window Using Optics-Guided Transformer","authors":"Zongliang Wu;Juzheng Zhang;Ying Fu;Yulun Zhang;Xin Yuan","doi":"10.1109/TIP.2025.3573500","DOIUrl":"10.1109/TIP.2025.3573500","url":null,"abstract":"Taking photographs through windows is an inevitable scenario in the real world, but glass windows are not ideally clean in most cases. Although there exists various raindrop removal methods, the occlusion of dirt, as another dirty window case, has not been well valued. The vital reasons include <italic>i) the limitation of the optical imaging model proposed in previous methods, and <italic>ii) the shortage of a practical dataset for sufficient types of dirty glass windows. To fill this research gap, in this paper, we first propose a general optical imaging model that fits widely used dirty window cases. Following this, training and testing synthetic datasets are generated, and real-world dirty window data are collected to evaluate the effectiveness of our imaging model and synthetic data. For the methodology part, we propose an optics-guided Transformer network to solve this special image restoration problem, <italic>i.e., the dirt removal for images taken through a dirty window. Experimental results demonstrate that our imaging model is effective and robust. Our proposed network leads to higher performance than existing methods on both synthetic and real-world dirty window images. Code and data are available at <uri>https://github.com/Zongliang-Wu/ReDNet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3352-3365"},"PeriodicalIF":0.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144201656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep Multi-View Contrastive Clustering via Graph Structure Awareness 基于图结构感知的深度多视图对比聚类

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-02 DOI: 10.1109/TIP.2025.3573501

Lunke Fei;Junlin He;Qi Zhu;Shuping Zhao;Jie Wen;Yong Xu

{"title":"Deep Multi-View Contrastive Clustering via Graph Structure Awareness","authors":"Lunke Fei;Junlin He;Qi Zhu;Shuping Zhao;Jie Wen;Yong Xu","doi":"10.1109/TIP.2025.3573501","DOIUrl":"10.1109/TIP.2025.3573501","url":null,"abstract":"Multi-view clustering (MVC) aims to exploit the latent relationships between heterogeneous samples in an unsupervised manner, which has served as a fundamental task in the unsupervised learning community and has drawn widespread attention. In this work, we propose a new deep multi-view contrastive clustering method via graph structure awareness (DMvCGSA) by conducting both instance-level and cluster-level contrastive learning to exploit the collaborative representations of multi-view samples. Unlike most existing deep multi-view clustering methods, which usually extract only the attribute features for multi-view representation, we first exploit the view-specific features while preserving the latent structural information between multi-view data via a GCN-embedded autoencoder, and further develop a similarity-guided instance-level contrastive learning scheme to make the view-specific features discriminative. Moreover, unlike existing methods that separately explore common information, which may not contribute to the clustering task, we employ cluster-level contrastive learning to explore the clustering-beneficial consistency information directly, resulting in improved and reliable performance for the final multi-view clustering task. Extensive experimental results on twelve benchmark datasets clearly demonstrate the encouraging effectiveness of the proposed method compared with the state-of-the-art models.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3805-3816"},"PeriodicalIF":0.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144201923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Detector With Classifier2: An End-to-End Multi-Stream Feature Aggregation Network for Fine-Grained Object Detection in Remote Sensing Images 基于Classifier2的探测器：面向遥感图像细粒度目标检测的端到端多流特征聚合网络

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-04-30 DOI: 10.1109/TIP.2025.3563708

Shangdong Zheng;Zebin Wu;Yang Xu;Chengxun He;Zhihui Wei

{"title":"Detector With Classifier2: An End-to-End Multi-Stream Feature Aggregation Network for Fine-Grained Object Detection in Remote Sensing Images","authors":"Shangdong Zheng;Zebin Wu;Yang Xu;Chengxun He;Zhihui Wei","doi":"10.1109/TIP.2025.3563708","DOIUrl":"10.1109/TIP.2025.3563708","url":null,"abstract":"Fine-grained object detection (FGOD) fundamentally comprises two primary tasks: object detection and fine-grained classification. In natural scenes, most FGOD methods benefit from higher instance resolution and fewer environmental variation, attributing more commonly associated with the latter task. In this paper, we propose a unified paradigm named Detector with Classifier2 (DC2), which provides a holistic paradigm by explicitly considering the end-to-end integration of object detection and fine-grained classification tasks, rather than prioritizing one aspect. Initially, our detection sub-network is restricted to only determining whether the proposal is a coarse-category and does not delve into the specific sub-categories. Moreover, in order to reduce redundant pixel-level calculation, we propose an instance-level feature enhancement (IFE) module to model the semantic similarities among proposals, which poses great potential for locating more instances in remote sensing images (RSIs). After obtaining the coarse detection predictions, we further construct a classification sub-network, which is built on top of the former branch to determine the specific sub-categories of the aforementioned predictions. Importantly, the detection network is performed on the complete image, while the classification network conducts secondary modeling for the detected regions. These operations can be denoted as the global contextual information and local intrinsic cues extractions for each instance. Therefore, we propose a multi-stream feature aggregation (MSFA) module to integrate global-stream semantic information and local-stream discriminative cues. Our whole DC2 network follows an end-to-end learning fashion, which effectively excavates the internal correlation between detection and fine-grained classification networks. We evaluate the performance of our DC2 network on two benchmarks SAT-MTB and HRSC2016 datasets. Importantly, our method achieves the new state-of-the-art results compared with recent works (approximately 7% mAP gains on SAT-MTB) and improves baseline by a significant margin (43.2% <inline-formula> <tex-math>$v.s.~36.7$ </tex-math></inline-formula>%) without any complicated post-processing strategies. Source codes of the proposed methods are available at <uri>https://github.com/zhengshangdong/DC2</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2707-2720"},"PeriodicalIF":0.0,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143893504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0