PRFormer: Matching Proposal and Reference Masks by Semantic and Spatial Similarity for Few-Shot Semantic Segmentation

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-03-13 DOI:10.1109/TCSVT.2025.3550879

Guangyu Gao;Anqi Zhang;Jianbo Jiao;Chi Harold Liu;Yunchao Wei

{"title":"PRFormer: Matching Proposal and Reference Masks by Semantic and Spatial Similarity for Few-Shot Semantic Segmentation","authors":"Guangyu Gao;Anqi Zhang;Jianbo Jiao;Chi Harold Liu;Yunchao Wei","doi":"10.1109/TCSVT.2025.3550879","DOIUrl":null,"url":null,"abstract":"Few-shot Semantic Segmentation (FSS) aims to accurately segment query images with guidance from only a few annotated support images. Previous methods typically rely on pixel-level feature correlations, denoted as the many-to-many (pixels-to-pixels) or few-to-many (prototype-to-pixels) manners. Recent mask proposals classification pipeline in semantic segmentation enables more efficient few-to-few (prototype-to-prototype) correlation between masks of query proposals and support reference. However, these methods still involve intermediate pixel-level feature correlation, resulting in lower efficiency. In this paper, we introduce the Proposal and Reference masks matching transFormer (PRFormer), designed to rigorously address mask matching in both spatial and semantic aspects in a thorough few-to-few manner. Following the mask-classification paradigm, PRFormer starts with a class-agnostic proposal generator to partition the query image into proposal masks. It then evaluates the features corresponding to query proposal masks and support reference masks using two strategies: semantic matching based on feature similarity across prototypes and spatial matching through mask intersection ratio. These strategies are implemented as the Prototype Contrastive Correlation (PrCC) and Prior-Proposals Intersection (PPI) modules, respectively. These strategies enhance matching precision and efficiency while eliminating dependence on pixel-level feature correlations. Additionally, we propose the category discrimination NCE (cdNCE) loss and IoU-KLD loss to constrain the adapted prototypes and align the similarity vector with the corresponding IoU between proposals and ground truth. Given that class-agnostic proposals tend to be more accurate for training classes than for novel classes in FSS, we introduce the Weighted Proposal Refinement (WPR) to refine the most confident masks with detailed features, yielding more precise predictions. Experiments on the popular Pascal-5i and COCO-20i benchmarks show that our Few-to-Few approach, PRFormer, outperforms previous methods, achieving mIoU scores of 70.4% and 49.4%, respectively, on 1-shot segmentation. Code is available at <uri>https://github.com/ANDYZAQ/PRFormer</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 8","pages":"8161-8173"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10925417/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Few-shot Semantic Segmentation (FSS) aims to accurately segment query images with guidance from only a few annotated support images. Previous methods typically rely on pixel-level feature correlations, denoted as the many-to-many (pixels-to-pixels) or few-to-many (prototype-to-pixels) manners. Recent mask proposals classification pipeline in semantic segmentation enables more efficient few-to-few (prototype-to-prototype) correlation between masks of query proposals and support reference. However, these methods still involve intermediate pixel-level feature correlation, resulting in lower efficiency. In this paper, we introduce the Proposal and Reference masks matching transFormer (PRFormer), designed to rigorously address mask matching in both spatial and semantic aspects in a thorough few-to-few manner. Following the mask-classification paradigm, PRFormer starts with a class-agnostic proposal generator to partition the query image into proposal masks. It then evaluates the features corresponding to query proposal masks and support reference masks using two strategies: semantic matching based on feature similarity across prototypes and spatial matching through mask intersection ratio. These strategies are implemented as the Prototype Contrastive Correlation (PrCC) and Prior-Proposals Intersection (PPI) modules, respectively. These strategies enhance matching precision and efficiency while eliminating dependence on pixel-level feature correlations. Additionally, we propose the category discrimination NCE (cdNCE) loss and IoU-KLD loss to constrain the adapted prototypes and align the similarity vector with the corresponding IoU between proposals and ground truth. Given that class-agnostic proposals tend to be more accurate for training classes than for novel classes in FSS, we introduce the Weighted Proposal Refinement (WPR) to refine the most confident masks with detailed features, yielding more precise predictions. Experiments on the popular Pascal-5i and COCO-20i benchmarks show that our Few-to-Few approach, PRFormer, outperforms previous methods, achieving mIoU scores of 70.4% and 49.4%, respectively, on 1-shot segmentation. Code is available at https://github.com/ANDYZAQ/PRFormer.

查看原文本刊更多论文

PRFormer：基于语义和空间相似性的少镜头语义分割建议和参考掩码匹配

少量语义分割（few -shot Semantic Segmentation， FSS）的目的是在少量注释支持图像的指导下准确分割查询图像。以前的方法通常依赖于像素级特征相关性，表示为多对多（像素到像素）或少对多（原型到像素）方式。最近在语义分割中的掩码建议分类管道使得查询建议的掩码和支持引用之间的少对少（原型对原型）关联更加有效。然而，这些方法仍然涉及到中间像素级的特征相关，导致效率较低。在本文中，我们介绍了提议和参考掩码匹配变压器（PRFormer），旨在以彻底的少对少方式严格解决空间和语义方面的掩码匹配问题。遵循掩码分类范例，PRFormer从一个与类无关的提议生成器开始，将查询图像划分为提议掩码。然后使用基于原型之间特征相似度的语义匹配和基于掩码相交比例的空间匹配两种策略来评估查询建议掩码和支持参考掩码对应的特征。这些策略分别作为原型对比相关（PrCC）和先前建议交叉（PPI）模块实现。这些策略提高了匹配精度和效率，同时消除了对像素级特征相关性的依赖。此外，我们提出了类别判别NCE （cdNCE）损失和IoU- kld损失来约束自适应原型，并将相似向量与提案与基础真值之间的相应IoU对齐。考虑到类别不可知的建议往往比FSS中的新类更准确，我们引入加权建议细化（WPR）来细化具有详细特征的最自信掩模，从而产生更精确的预测。在流行的Pascal-5i和COCO-20i基准测试上的实验表明，我们的Few-to-Few方法（PRFormer）优于以前的方法，在1次分割上分别实现了70.4%和49.4%的mIoU分数。代码可从https://github.com/ANDYZAQ/PRFormer获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.