Guangyu Gao;Anqi Zhang;Jianbo Jiao;Chi Harold Liu;Yunchao Wei
{"title":"PRFormer: Matching Proposal and Reference Masks by Semantic and Spatial Similarity for Few-Shot Semantic Segmentation","authors":"Guangyu Gao;Anqi Zhang;Jianbo Jiao;Chi Harold Liu;Yunchao Wei","doi":"10.1109/TCSVT.2025.3550879","DOIUrl":null,"url":null,"abstract":"Few-shot Semantic Segmentation (FSS) aims to accurately segment query images with guidance from only a few annotated support images. Previous methods typically rely on pixel-level feature correlations, denoted as the many-to-many (pixels-to-pixels) or few-to-many (prototype-to-pixels) manners. Recent mask proposals classification pipeline in semantic segmentation enables more efficient few-to-few (prototype-to-prototype) correlation between masks of query proposals and support reference. However, these methods still involve intermediate pixel-level feature correlation, resulting in lower efficiency. In this paper, we introduce the Proposal and Reference masks matching transFormer (PRFormer), designed to rigorously address mask matching in both spatial and semantic aspects in a thorough few-to-few manner. Following the mask-classification paradigm, PRFormer starts with a class-agnostic proposal generator to partition the query image into proposal masks. It then evaluates the features corresponding to query proposal masks and support reference masks using two strategies: semantic matching based on feature similarity across prototypes and spatial matching through mask intersection ratio. These strategies are implemented as the Prototype Contrastive Correlation (PrCC) and Prior-Proposals Intersection (PPI) modules, respectively. These strategies enhance matching precision and efficiency while eliminating dependence on pixel-level feature correlations. Additionally, we propose the category discrimination NCE (cdNCE) loss and IoU-KLD loss to constrain the adapted prototypes and align the similarity vector with the corresponding IoU between proposals and ground truth. Given that class-agnostic proposals tend to be more accurate for training classes than for novel classes in FSS, we introduce the Weighted Proposal Refinement (WPR) to refine the most confident masks with detailed features, yielding more precise predictions. Experiments on the popular Pascal-5i and COCO-20i benchmarks show that our Few-to-Few approach, PRFormer, outperforms previous methods, achieving mIoU scores of 70.4% and 49.4%, respectively, on 1-shot segmentation. Code is available at <uri>https://github.com/ANDYZAQ/PRFormer</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 8","pages":"8161-8173"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10925417/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Few-shot Semantic Segmentation (FSS) aims to accurately segment query images with guidance from only a few annotated support images. Previous methods typically rely on pixel-level feature correlations, denoted as the many-to-many (pixels-to-pixels) or few-to-many (prototype-to-pixels) manners. Recent mask proposals classification pipeline in semantic segmentation enables more efficient few-to-few (prototype-to-prototype) correlation between masks of query proposals and support reference. However, these methods still involve intermediate pixel-level feature correlation, resulting in lower efficiency. In this paper, we introduce the Proposal and Reference masks matching transFormer (PRFormer), designed to rigorously address mask matching in both spatial and semantic aspects in a thorough few-to-few manner. Following the mask-classification paradigm, PRFormer starts with a class-agnostic proposal generator to partition the query image into proposal masks. It then evaluates the features corresponding to query proposal masks and support reference masks using two strategies: semantic matching based on feature similarity across prototypes and spatial matching through mask intersection ratio. These strategies are implemented as the Prototype Contrastive Correlation (PrCC) and Prior-Proposals Intersection (PPI) modules, respectively. These strategies enhance matching precision and efficiency while eliminating dependence on pixel-level feature correlations. Additionally, we propose the category discrimination NCE (cdNCE) loss and IoU-KLD loss to constrain the adapted prototypes and align the similarity vector with the corresponding IoU between proposals and ground truth. Given that class-agnostic proposals tend to be more accurate for training classes than for novel classes in FSS, we introduce the Weighted Proposal Refinement (WPR) to refine the most confident masks with detailed features, yielding more precise predictions. Experiments on the popular Pascal-5i and COCO-20i benchmarks show that our Few-to-Few approach, PRFormer, outperforms previous methods, achieving mIoU scores of 70.4% and 49.4%, respectively, on 1-shot segmentation. Code is available at https://github.com/ANDYZAQ/PRFormer.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.