IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献_第9页

Object Adaptive Self-Supervised Dense Visual Pre-Training 对象自适应自监督密集视觉预训练

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-04-01 DOI: 10.1109/TIP.2025.3555073

Yu Zhang;Tao Zhang;Hongyuan Zhu;Zihan Chen;Siya Mi;Xi Peng;Xin Geng

{"title":"Object Adaptive Self-Supervised Dense Visual Pre-Training","authors":"Yu Zhang;Tao Zhang;Hongyuan Zhu;Zihan Chen;Siya Mi;Xi Peng;Xin Geng","doi":"10.1109/TIP.2025.3555073","DOIUrl":"10.1109/TIP.2025.3555073","url":null,"abstract":"Self-supervised visual pre-training models have achieved significant success without employing expensive annotations. Nevertheless, most of these models focus on iconic single-instance datasets (e.g. ImageNet), ignoring the insufficient discriminative representation for non-iconic multi-instance datasets (e.g. COCO). In this paper, we propose a novel Object Adaptive Dense Pre-training (OADP) method to learn the visual representation directly on the multi-instance datasets (e.g., PASCAL VOC and COCO) for dense prediction tasks (e.g., object detection and instance segmentation). We present a novel object-aware and learning-adaptive random view augmentation to focus the contrastive learning to enhance the discrimination of object presentations from large to small scale during different learning stages. Furthermore, the representations across different scale and resolutions are integrated so that the method can learn diverse representations. In the experiment, we evaluated OADP pre-trained on PASCAL VOC and COCO. Results show that our method has better performances than most existing state-of-the-art methods when transferring to various downstream tasks, including image classification, object detection, instance segmentation and semantic segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2228-2240"},"PeriodicalIF":0.0,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143757764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring Effective Factors for Improving Visual In-Context Learning 探索提高视觉情境学习的有效因素

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-31 DOI: 10.1109/TIP.2025.3554410

Yanpeng Sun;Qiang Chen;Jian Wang;Jingdong Wang;Zechao Li

{"title":"Exploring Effective Factors for Improving Visual In-Context Learning","authors":"Yanpeng Sun;Qiang Chen;Jian Wang;Jingdong Wang;Zechao Li","doi":"10.1109/TIP.2025.3554410","DOIUrl":"10.1109/TIP.2025.3554410","url":null,"abstract":"The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that Prompt Selection and Prompt Fusion are two major factors that have a direct impact on the inference performance of visual in-context learning. Prompt selection is the process of selecting the most suitable prompt for query image. This is crucial because high-quality prompts assist large-scale visual models in rapidly and accurately comprehending new tasks. Prompt fusion involves combining prompts and query images to activate knowledge within large-scale visual models. However, altering the prompt fusion method significantly impacts its performance on new tasks. Based on these findings, we propose a simple framework prompt-SelF to improve visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate diverse knowledge stored in the large-scale vision model, and finally, ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. We conducted extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, prompt-SelF has outperformed OSLSM method-based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at <uri>https://github.com/syp2ysy/prompt-SelF</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2147-2160"},"PeriodicalIF":0.0,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Local Cross-Patch Activation From Multi-Direction for Weakly Supervised Object Localization 基于多方向局部交叉补丁激活的弱监督目标定位

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-31 DOI: 10.1109/TIP.2025.3554398

Pei Lv;Junying Ren;Genwang Han;Jiwen Lu;Mingliang Xu

{"title":"Local Cross-Patch Activation From Multi-Direction for Weakly Supervised Object Localization","authors":"Pei Lv;Junying Ren;Genwang Han;Jiwen Lu;Mingliang Xu","doi":"10.1109/TIP.2025.3554398","DOIUrl":"10.1109/TIP.2025.3554398","url":null,"abstract":"Weakly supervised object localization (WSOL) learns to localize objects using only image-level labels. Recently, some studies apply transformers in WSOL to capture the long-range feature dependency and alleviate the partial activation issue of CNN-based methods. However, existing transformer-based methods still face two challenges. The first challenge is the over-activation of backgrounds. Specifically, the object boundaries and background are often semantically similar, and localization models may misidentify the background as a part of objects. The second challenge is the incomplete activation of occluded objects, since transformer architecture makes it difficult to capture local features across patches due to ignoring semantic and spatial coherence. To address these issues, in this paper, we propose LCA-MD, a novel transformer-based WSOL method using local cross-patch activation from multi-direction, which can capture more details of local features while inhibiting the background over-activation. In LCA-MD, first, combining contrastive learning with the transformer, we propose a token feature contrast module (TCM) that can maximize the difference between foregrounds and backgrounds and further separate them more accurately. Second, we propose a semantic-spatial fusion module (SFM), which leverages multi-directional perception to capture the local cross-patch features and diffuse activation across occlusions. Experiment results on the CUB-200-2011 and ILSVRC datasets demonstrate that our LCA-MD is significantly superior and has achieved state-of-the-art results in WSOL. The project code is available at <uri>https://github.com/rjy-fighting/LCA-MD</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2213-2227"},"PeriodicalIF":0.0,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IDENet: An Inter-Domain Equilibrium Network for Unsupervised Cross-Domain Person Re-Identification 无监督跨域人员再识别的域间平衡网络

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-31 DOI: 10.1109/TIP.2025.3554408

Xi Yang;Wenjiao Dong;Gu Zheng;Nannan Wang;Xinbo Gao

{"title":"IDENet: An Inter-Domain Equilibrium Network for Unsupervised Cross-Domain Person Re-Identification","authors":"Xi Yang;Wenjiao Dong;Gu Zheng;Nannan Wang;Xinbo Gao","doi":"10.1109/TIP.2025.3554408","DOIUrl":"10.1109/TIP.2025.3554408","url":null,"abstract":"Unsupervised person re-identification aims to retrieve a given pedestrian image from unlabeled data. For training on the unlabeled data, the method of clustering and assigning pseudo-labels has become mainstream, but the pseudo-labels themselves are noisy and will reduce the accuracy. To overcome this problem, several pseudo-label improvement methods have been proposed. But on the one hand, they only use target domain data for fine-tuning and do not make sufficient use of high-quality labeled data in the source domain. On the other hand, they ignore the critical fine-grained features of pedestrians and overfitting problems in the later training period. In this paper, we propose a novel unsupervised cross-domain person re-identification network (IDENet) based on an inter-domain equilibrium structure to improve the quality of pseudo-labels. Specifically, we make full use of both source domain and target domain information and construct a small learning network to equalize label allocation between the two domains. Based on it, we also develop a dynamic neural network with adaptive convolution kernels to generate adaptive residuals for adapting domain-agnostic deep fine-grained features. In addition, we design the network structure based on ordinary differential equations and embed modules to solve the problem of network overfitting. Extensive cross-domain experimental results on Market1501, PersonX, and MSMT17 prove that our proposed method outperforms the state-of-the-art methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2133-2146"},"PeriodicalIF":0.0,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PrivacyHFR: Visual Privacy Preserving for Heterogeneous Face Recognition PrivacyHFR：异构人脸识别的视觉隐私保护

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-29 DOI: 10.1109/TIP.2025.3572793

Decheng Liu;Weizhao Yang;Chunlei Peng;Nannan Wang;Ruimin Hu;Xinbo Gao

{"title":"PrivacyHFR: Visual Privacy Preserving for Heterogeneous Face Recognition","authors":"Decheng Liu;Weizhao Yang;Chunlei Peng;Nannan Wang;Ruimin Hu;Xinbo Gao","doi":"10.1109/TIP.2025.3572793","DOIUrl":"10.1109/TIP.2025.3572793","url":null,"abstract":"Face recognition has achieved remarkable progress and is widely deployed in real-world scenarios. Recently more and more attention has been given to individual privacy protection, due to unauthorized sensitive image leakage by malicious attackers. Multi-modality face images captured by diverse sensors, also called heterogeneous faces, bring in more challenges in face privacy protection while lacking related research. In this paper, we propose a novel visual Privacy preserving method for Heterogeneous Face Recognition (Privacy-HFR) to protect perceptual visual information and maintain essential identity information in multi-modality face analysis scenarios. Frequency domain analysis is a vital strategy to bridge the inevitable modality gap for heterogeneous face images. Meanwhile, recent theoretical insights also inspire us to design a suitable frequency component adjustment to balance human visual sensitivity and identity discriminative information. In addition, the ability to defend against recovery attacks has emerged as an essential criterion for privacy preserving face recognition. Noting that there seems to exist a dilemma that reducing accessible information by the attack model will affect the extracted identity information for recognition. It is because these two kinds of information are mutually blended in the frequency domain, which makes it a challenge to simultaneously maintain visual privacy and identity distinguishability. Thus, we provide a novel perspective to leverage the randomly optimal solutions and design the specific adversarial perturbations against the recovery attack. Experiments on several large-scale heterogeneous face datasets (CASIA NIR-VIS 2.0, LAMP-HQ, Tufts Face and CUFSF datasets) prove that the proposed method outperforms existing privacy-preserving face recognition methods in terms of recognition accuracy and privacy protection capability. The code is available in <uri>https://github.com/xiyin11/Privacy-HFR</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3417-3430"},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144176803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SSF-Net: Spatial-Spectral Fusion Network With Spectral Angle Awareness for Hyperspectral Object Tracking SSF-Net：用于高光谱目标跟踪的光谱角感知空间-光谱融合网络

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-29 DOI: 10.1109/TIP.2025.3572812

Hanzheng Wang;Wei Li;Xiang-Gen Xia;Qian Du;Jing Tian

{"title":"SSF-Net: Spatial-Spectral Fusion Network With Spectral Angle Awareness for Hyperspectral Object Tracking","authors":"Hanzheng Wang;Wei Li;Xiang-Gen Xia;Qian Du;Jing Tian","doi":"10.1109/TIP.2025.3572812","DOIUrl":"10.1109/TIP.2025.3572812","url":null,"abstract":"Hyperspectral video (HSV) offers valuable spatial, spectral, and temporal information simultaneously, making it highly suitable for handling challenges such as background clutter and visual similarity in object tracking. However, existing methods primarily focus on band regrouping and rely on RGB trackers for feature extraction, resulting in limited exploration of spectral information and difficulties in achieving complementary representations of object features. In this paper, a spatial-spectral fusion network with spectral angle awareness (SSF-Net) is proposed for hyperspectral (HS) object tracking. Firstly, to address the issue of insufficient spectral feature extraction in existing networks, a spatial-spectral feature backbone (<inline-formula> <tex-math>$S^{2}$ </tex-math></inline-formula>FB) is designed. With the spatial and spectral extraction branch, a joint representation of texture and spectrum is obtained. Secondly, a spectral attention fusion module (SAFM) is presented to capture the intra- and inter-modality correlation to obtain the fused features from the HS and RGB modalities. It can incorporate the visual information into the HS context to form a robust representation. Thirdly, to ensure a more accurate response to the object position, a spectral angle awareness module (SAAM) is designed to investigate the region-level spectral similarity between the template and search images during the prediction stage. Furthermore, a novel spectral angle awareness loss (SAAL) is developed to offer guidance for the SAAM based on similar regions. Finally, to obtain the robust tracking results, a weighted prediction method is considered to combine the HS and RGB predicted motions of objects to leverage the strengths of each modality. Extensive experiments on the HOTC-2020, HOTC-2024, and BihoT datasets demonstrate the effectiveness of the proposed SSF-Net compared with state-of-the-art trackers. The source code will be available at <uri>https://github.com/hzwyhc/hsvt</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3518-3532"},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144176880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Eyes on Islanded Nodes: Better Reasoning via Structure Augmentation and Feature Co-Training on Bi-Level Knowledge Graphs 孤岛节点之眼：基于双层知识图的结构增强和特征协同训练的更好推理

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-29 DOI: 10.1109/TIP.2025.3572825

Hao Li;Ke Liang;Wenjing Yang;Lingyuan Meng;Yaohua Wang;Sihang Zhou;Xinwang Liu

{"title":"Eyes on Islanded Nodes: Better Reasoning via Structure Augmentation and Feature Co-Training on Bi-Level Knowledge Graphs","authors":"Hao Li;Ke Liang;Wenjing Yang;Lingyuan Meng;Yaohua Wang;Sihang Zhou;Xinwang Liu","doi":"10.1109/TIP.2025.3572825","DOIUrl":"10.1109/TIP.2025.3572825","url":null,"abstract":"Knowledge graphs (KGs) represent known entities and their relationships using triplets, but this method cannot represent relationships between facts, limiting their expressiveness. Recently, the Bi-level Knowledge Graph (Bi-level KG) has addressed this issue by modeling facts as nodes and establishing relationships between these facts, introducing two new tasks: triplet prediction and conditional link prediction. Existing methods enhance triplets through data augmentation method and represent facts using entity representations. However, these methods do not address the isolated nodes at the structure level, nor do they effectively capture the information of facts at the feature level. To address these two issues, we design a data augmentation method that identifies islanded node by detecting anomalous structures and features in the graph. Subsequently, we perform similar subgraph matching for each isolated node to construct potential facts. To enrich the features of facts, we design a weighted combination initialization method for facts and introduce a new relation <inline-formula> <tex-math>$widetilde {R}$ </tex-math></inline-formula>, to connect facts with related entities. This approach allows for the co-training of fact and entity representations during the training process. Extensive experiments validate the effectiveness of our data augmentation and co-training methods. Our model achieves optimal performance in triplet prediction and conditional link prediction tasks.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3268-3280"},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144176881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generalized Category Discovery With Unknown Sample Generation 基于未知样本生成的广义类别发现

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-29 DOI: 10.1109/TIP.2025.3572767

Xiao Li;Min Fang;Haixiang Li

{"title":"Generalized Category Discovery With Unknown Sample Generation","authors":"Xiao Li;Min Fang;Haixiang Li","doi":"10.1109/TIP.2025.3572767","DOIUrl":"10.1109/TIP.2025.3572767","url":null,"abstract":"Semi-supervised learning uses labeled and unlabeled data from known classes for training, assuming the test data contains only those classes. However, in real-world scenarios, new classes can appear. Generalized Category Discovery (GCD) extends SSL to handle unlabeled samples that may belong to both known and unknown categories. The challenge arises from the lack of prior information about the unknown categories. We propose to generate unknown samples to address the GCD problem, called Generalized Category Discovery with Unknown Sample Generation (GCDUSG). Since the number of unknown categories is uncertain, we propose a prototype alignment method to estimate both the class numbers and pseudo-labels for unlabeled samples, thereby enabling us to learn the unknown prototypes. We have developed a process for generating realistic and discriminative unknown samples based on the known-unknown relationships between known and unknown prototypes. We generate realistic and discriminative unknown samples leveraging the known-unknown relationships. We achieve this by minimizing the class-wise Maximum Mean Discrepancy distance between the generated samples and the selected unknown samples. To account for the pseudo-labels assigned to unlabeled samples, we train a classifier using all samples, incorporating a pseudo-label supervision loss to mitigate the impact of potentially erroneous labels. This comprehensive training equips the classifier to effectively handle both known and unknown classes during testing. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of our approach.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3366-3376"},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144177352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring the Potential of Pooling Techniques for Universal Image Restoration 探索通用图像恢复池技术的潜力

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-29 DOI: 10.1109/TIP.2025.3572788

Yuning Cui;Wenqi Ren;Alois Knoll

{"title":"Exploring the Potential of Pooling Techniques for Universal Image Restoration","authors":"Yuning Cui;Wenqi Ren;Alois Knoll","doi":"10.1109/TIP.2025.3572788","DOIUrl":"10.1109/TIP.2025.3572788","url":null,"abstract":"Image restoration involves recovering a clean image from its degraded counterpart. In recent years, we have witnessed a paradigm shift from convolutional neural networks to Transformers, which have quadratic complexity with respect to the input size. Instead of designing more complex modules based on recent techniques, this paper presents an efficient and effective mechanism for image restoration by exploring the potential of ubiquitous pooling techniques. We leverage different pooling operators as tools for implicit dual-domain representation learning. Specifically, the average and max pooling can be used as extractors for implicit low- and high-frequency signals, respectively. Then, we utilize lightweight learnable parameters to modulate the resulting frequency components. Furthermore, the intermediate high-frequency features can serve as attention maps to highlight the spatial edge information. Our pooling module is built by incorporating the aforementioned dual-domain modulation across multiple scales and various shapes. We demonstrate the effectiveness of our module in single-degradation, composite-degradation, and all-in-one image restoration tasks. Extensive experimental results show that the resulting network achieves state-of-the-art performance on 15 datasets for five single-degradation and two composite-degradation image restoration tasks by deploying our module. Moreover, our method can be extended to all-in-one scenarios and performs favorably against state-of-the-art all-in-one algorithms under two settings. The code is available at <uri>https://github.com/c-yn/PoolNet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3403-3416"},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144176882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Segment Anything Model Is a Good Teacher for Local Feature Learning 分段任意模型是局部特征学习的好老师

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-28 DOI: 10.1109/TIP.2025.3554033

Jingqian Wu;Rongtao Xu;Zach Wood-Doughty;Changwei Wang;Shibiao Xu;Edmund Y. Lam

{"title":"Segment Anything Model Is a Good Teacher for Local Feature Learning","authors":"Jingqian Wu;Rongtao Xu;Zach Wood-Doughty;Changwei Wang;Shibiao Xu;Edmund Y. Lam","doi":"10.1109/TIP.2025.3554033","DOIUrl":"10.1109/TIP.2025.3554033","url":null,"abstract":"Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in any scene and any downstream task. Data-driven local feature learning methods need to rely on pixel-level correspondence for training. However, a vast number of existing approaches ignored the semantic information on which humans rely to describe image pixels. In addition, it is not feasible to enhance generic scene keypoints detection and description simply by using traditional common semantic segmentation models because they can only recognize a limited number of coarse-grained object classes. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a foundation model trained on 11 million images, as a teacher to guide local feature learning. SAMFeat learns additional semantic information brought by SAM and thus is inspired by higher performance even with limited training samples. To do so, first, we construct an auxiliary task of Attention-weighted Semantic Relation Distillation (ASRD), which adaptively distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat’s performance on various tasks, such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at <uri>https://github.com/vignywang/SAMFeat</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2097-2111"},"PeriodicalIF":0.0,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143733961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0