Beyond mask: Rethinking guidance types in few-shot segmentation

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-04-07 DOI:10.1016/j.patcog.2025.111635

Shijie Chang, Youwei Pang, Xiaoqi Zhao, Huchuan Lu, Lihe Zhang

{"title":"Beyond mask: Rethinking guidance types in few-shot segmentation","authors":"Shijie Chang, Youwei Pang, Xiaoqi Zhao, Huchuan Lu, Lihe Zhang","doi":"10.1016/j.patcog.2025.111635","DOIUrl":null,"url":null,"abstract":"<div><div>Existing few-shot segmentation (FSS) methods mainly focus on prototype feature generation and the query-support matching mechanism. As a crucial prompt for generating prototype features, the pair of image-mask types in the support set has become the default setting. However, various types such as image, text, box, and mask all can provide valuable information regarding the objects in context, class, localization, and shape appearance. Existing work focuses on specific combinations of guidance, leading FSS into different research branches. Rethinking guidance types in FSS is expected to explore the efficient joint representation of the coupling between the support set and query set, giving rise to research trends in the weakly or strongly annotated guidance to meet the customized requirements of practical users. In this work, we provide the generalized FSS with seven guidance paradigms and develop a universal vision–language framework (UniFSS) to integrate prompts from text, mask, box, and image. Leveraging the advantages of large-scale pre-training vision–language models in textual and visual embeddings, UniFSS proposes high-level spatial correction and embedding interactive units to overcome the semantic ambiguity drawbacks typically encountered by pure visual matching methods when facing intra-class appearance diversities. Extensive experiments show that UniFSS significantly outperforms the state-of-the-art methods. Notably, the weakly annotated class-aware box paradigm even surpasses the finely annotated mask paradigm.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111635"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S003132032500295X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Existing few-shot segmentation (FSS) methods mainly focus on prototype feature generation and the query-support matching mechanism. As a crucial prompt for generating prototype features, the pair of image-mask types in the support set has become the default setting. However, various types such as image, text, box, and mask all can provide valuable information regarding the objects in context, class, localization, and shape appearance. Existing work focuses on specific combinations of guidance, leading FSS into different research branches. Rethinking guidance types in FSS is expected to explore the efficient joint representation of the coupling between the support set and query set, giving rise to research trends in the weakly or strongly annotated guidance to meet the customized requirements of practical users. In this work, we provide the generalized FSS with seven guidance paradigms and develop a universal vision–language framework (UniFSS) to integrate prompts from text, mask, box, and image. Leveraging the advantages of large-scale pre-training vision–language models in textual and visual embeddings, UniFSS proposes high-level spatial correction and embedding interactive units to overcome the semantic ambiguity drawbacks typically encountered by pure visual matching methods when facing intra-class appearance diversities. Extensive experiments show that UniFSS significantly outperforms the state-of-the-art methods. Notably, the weakly annotated class-aware box paradigm even surpasses the finely annotated mask paradigm.

查看原文本刊更多论文

超越掩模：在少数镜头分割中重新思考引导类型

现有的小片段分割方法主要集中在原型特征的生成和查询支持匹配机制上。作为生成原型特征的关键提示，支持集中的一对图像掩码类型已经成为默认设置。但是，各种类型（如图像、文本、框和掩码）都可以提供有关对象的上下文、类、定位和形状外观的有价值的信息。现有的工作侧重于具体的指导组合，引导FSS进入不同的研究分支。重新思考FSS中的引导类型有望探索支持集和查询集之间耦合的有效联合表示，从而产生弱或强注释引导的研究趋势，以满足实际用户的定制需求。在这项工作中，我们提供了七个通用的视觉语言框架，并开发了一个通用的视觉语言框架（UniFSS）来整合来自文本、掩码、框和图像的提示。UniFSS利用大规模预训练视觉语言模型在文本和视觉嵌入方面的优势，提出了高级空间校正和嵌入交互单元，以克服纯视觉匹配方法在面对类内外观多样性时通常遇到的语义模糊缺点。广泛的实验表明，UniFSS明显优于最先进的方法。值得注意的是，弱注释的类感知盒范式甚至超过了精细注释的掩码范式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.