用于参考图像分割的释放分割模型

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-01 DOI:10.1109/TCSVT.2024.3524543

Sun-Ao Liu;Hongtao Xie;Jiannan Ge;Yongdong Zhang

{"title":"用于参考图像分割的释放分割模型","authors":"Sun-Ao Liu;Hongtao Xie;Jiannan Ge;Yongdong Zhang","doi":"10.1109/TCSVT.2024.3524543","DOIUrl":null,"url":null,"abstract":"The Segment Anything Model (SAM) has demonstrated remarkable capability as a general segmentation model given visual prompts such as points or boxes. While SAM is conceptually compatible with text prompts, it merely employs linguistic features from vision-language models as prompt embeddings and lacks fine-grained cross-modal interaction. This deficiency limits its application in referring image segmentation (RIS), where the targets are specified by free-form natural language expressions. In this paper, we introduce ReferSAM, a novel SAM-based framework that enhances cross-modal interaction and reformulates prompt encoding, thereby unleashing SAM’s segmentation capability for RIS. Specifically, ReferSAM incorporates the Vision-Language Interactor (VLI) to integrate linguistic features with visual features during the image encoding stage of SAM. This interactor introduces fine-grained alignment between linguistic features and multi-scale visual representations without altering the architecture of pre-trained models. Additionally, we present the Vision-Language Prompter (VLP) to generate dense and sparse prompt embeddings by aggregating the aligned linguistic and visual features. Consequently, the generated embeddings sufficiently prompt SAM’s mask decoder to provide precise segmentation results. Extensive experiments on five public benchmarks demonstrate that ReferSAM achieves state-of-the-art performance on both classic and generalized RIS tasks. The code and models are available at <uri>https://github.com/lsa1997/ReferSAM</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4910-4922"},"PeriodicalIF":8.3000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ReferSAM: Unleashing Segment Anything Model for Referring Image Segmentation\",\"authors\":\"Sun-Ao Liu;Hongtao Xie;Jiannan Ge;Yongdong Zhang\",\"doi\":\"10.1109/TCSVT.2024.3524543\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Segment Anything Model (SAM) has demonstrated remarkable capability as a general segmentation model given visual prompts such as points or boxes. While SAM is conceptually compatible with text prompts, it merely employs linguistic features from vision-language models as prompt embeddings and lacks fine-grained cross-modal interaction. This deficiency limits its application in referring image segmentation (RIS), where the targets are specified by free-form natural language expressions. In this paper, we introduce ReferSAM, a novel SAM-based framework that enhances cross-modal interaction and reformulates prompt encoding, thereby unleashing SAM’s segmentation capability for RIS. Specifically, ReferSAM incorporates the Vision-Language Interactor (VLI) to integrate linguistic features with visual features during the image encoding stage of SAM. This interactor introduces fine-grained alignment between linguistic features and multi-scale visual representations without altering the architecture of pre-trained models. Additionally, we present the Vision-Language Prompter (VLP) to generate dense and sparse prompt embeddings by aggregating the aligned linguistic and visual features. Consequently, the generated embeddings sufficiently prompt SAM’s mask decoder to provide precise segmentation results. Extensive experiments on five public benchmarks demonstrate that ReferSAM achieves state-of-the-art performance on both classic and generalized RIS tasks. The code and models are available at <uri>https://github.com/lsa1997/ReferSAM</uri>.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 5\",\"pages\":\"4910-4922\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10819432/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10819432/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

分段任意模型（SAM）作为一个通用的分割模型，在给出点或框等视觉提示的情况下表现出了非凡的能力。虽然SAM在概念上与文本提示兼容，但它仅仅使用来自视觉语言模型的语言特性作为提示嵌入，并且缺乏细粒度的跨模态交互。这一缺陷限制了其在参考图像分割（RIS）中的应用，其中目标由自由形式的自然语言表达式指定。在本文中，我们介绍了一种新的基于SAM的框架refsam，它增强了跨模态交互并重新制定了提示编码，从而释放了SAM对RIS的分割能力。具体而言，ReferSAM在图像编码阶段引入了视觉语言交互器（VLI），将语言特征与视觉特征相结合。这个交互器在语言特征和多尺度视觉表示之间引入了细粒度的对齐，而不会改变预训练模型的架构。此外，我们提出了视觉语言提示器（VLP），通过聚合对齐的语言和视觉特征来生成密集和稀疏的提示嵌入。因此，生成的嵌入足以提示SAM的掩码解码器提供精确的分割结果。在五个公共基准测试上进行的大量实验表明，ReferSAM在经典和广义RIS任务上都达到了最先进的性能。代码和模型可在https://github.com/lsa1997/ReferSAM上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ReferSAM: Unleashing Segment Anything Model for Referring Image Segmentation

The Segment Anything Model (SAM) has demonstrated remarkable capability as a general segmentation model given visual prompts such as points or boxes. While SAM is conceptually compatible with text prompts, it merely employs linguistic features from vision-language models as prompt embeddings and lacks fine-grained cross-modal interaction. This deficiency limits its application in referring image segmentation (RIS), where the targets are specified by free-form natural language expressions. In this paper, we introduce ReferSAM, a novel SAM-based framework that enhances cross-modal interaction and reformulates prompt encoding, thereby unleashing SAM’s segmentation capability for RIS. Specifically, ReferSAM incorporates the Vision-Language Interactor (VLI) to integrate linguistic features with visual features during the image encoding stage of SAM. This interactor introduces fine-grained alignment between linguistic features and multi-scale visual representations without altering the architecture of pre-trained models. Additionally, we present the Vision-Language Prompter (VLP) to generate dense and sparse prompt embeddings by aggregating the aligned linguistic and visual features. Consequently, the generated embeddings sufficiently prompt SAM’s mask decoder to provide precise segmentation results. Extensive experiments on five public benchmarks demonstrate that ReferSAM achieves state-of-the-art performance on both classic and generalized RIS tasks. The code and models are available at https://github.com/lsa1997/ReferSAM.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.