{"title":"用于参考图像分割的释放分割模型","authors":"Sun-Ao Liu;Hongtao Xie;Jiannan Ge;Yongdong Zhang","doi":"10.1109/TCSVT.2024.3524543","DOIUrl":null,"url":null,"abstract":"The Segment Anything Model (SAM) has demonstrated remarkable capability as a general segmentation model given visual prompts such as points or boxes. While SAM is conceptually compatible with text prompts, it merely employs linguistic features from vision-language models as prompt embeddings and lacks fine-grained cross-modal interaction. This deficiency limits its application in referring image segmentation (RIS), where the targets are specified by free-form natural language expressions. In this paper, we introduce ReferSAM, a novel SAM-based framework that enhances cross-modal interaction and reformulates prompt encoding, thereby unleashing SAM’s segmentation capability for RIS. Specifically, ReferSAM incorporates the Vision-Language Interactor (VLI) to integrate linguistic features with visual features during the image encoding stage of SAM. This interactor introduces fine-grained alignment between linguistic features and multi-scale visual representations without altering the architecture of pre-trained models. Additionally, we present the Vision-Language Prompter (VLP) to generate dense and sparse prompt embeddings by aggregating the aligned linguistic and visual features. Consequently, the generated embeddings sufficiently prompt SAM’s mask decoder to provide precise segmentation results. Extensive experiments on five public benchmarks demonstrate that ReferSAM achieves state-of-the-art performance on both classic and generalized RIS tasks. The code and models are available at <uri>https://github.com/lsa1997/ReferSAM</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4910-4922"},"PeriodicalIF":8.3000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ReferSAM: Unleashing Segment Anything Model for Referring Image Segmentation\",\"authors\":\"Sun-Ao Liu;Hongtao Xie;Jiannan Ge;Yongdong Zhang\",\"doi\":\"10.1109/TCSVT.2024.3524543\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Segment Anything Model (SAM) has demonstrated remarkable capability as a general segmentation model given visual prompts such as points or boxes. While SAM is conceptually compatible with text prompts, it merely employs linguistic features from vision-language models as prompt embeddings and lacks fine-grained cross-modal interaction. This deficiency limits its application in referring image segmentation (RIS), where the targets are specified by free-form natural language expressions. In this paper, we introduce ReferSAM, a novel SAM-based framework that enhances cross-modal interaction and reformulates prompt encoding, thereby unleashing SAM’s segmentation capability for RIS. Specifically, ReferSAM incorporates the Vision-Language Interactor (VLI) to integrate linguistic features with visual features during the image encoding stage of SAM. This interactor introduces fine-grained alignment between linguistic features and multi-scale visual representations without altering the architecture of pre-trained models. Additionally, we present the Vision-Language Prompter (VLP) to generate dense and sparse prompt embeddings by aggregating the aligned linguistic and visual features. Consequently, the generated embeddings sufficiently prompt SAM’s mask decoder to provide precise segmentation results. Extensive experiments on five public benchmarks demonstrate that ReferSAM achieves state-of-the-art performance on both classic and generalized RIS tasks. The code and models are available at <uri>https://github.com/lsa1997/ReferSAM</uri>.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 5\",\"pages\":\"4910-4922\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10819432/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10819432/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
ReferSAM: Unleashing Segment Anything Model for Referring Image Segmentation
The Segment Anything Model (SAM) has demonstrated remarkable capability as a general segmentation model given visual prompts such as points or boxes. While SAM is conceptually compatible with text prompts, it merely employs linguistic features from vision-language models as prompt embeddings and lacks fine-grained cross-modal interaction. This deficiency limits its application in referring image segmentation (RIS), where the targets are specified by free-form natural language expressions. In this paper, we introduce ReferSAM, a novel SAM-based framework that enhances cross-modal interaction and reformulates prompt encoding, thereby unleashing SAM’s segmentation capability for RIS. Specifically, ReferSAM incorporates the Vision-Language Interactor (VLI) to integrate linguistic features with visual features during the image encoding stage of SAM. This interactor introduces fine-grained alignment between linguistic features and multi-scale visual representations without altering the architecture of pre-trained models. Additionally, we present the Vision-Language Prompter (VLP) to generate dense and sparse prompt embeddings by aggregating the aligned linguistic and visual features. Consequently, the generated embeddings sufficiently prompt SAM’s mask decoder to provide precise segmentation results. Extensive experiments on five public benchmarks demonstrate that ReferSAM achieves state-of-the-art performance on both classic and generalized RIS tasks. The code and models are available at https://github.com/lsa1997/ReferSAM.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.