{"title":"ViT-CAPS: Vision transformer with contrastive adaptive prompt segmentation","authors":"Khawaja Iftekhar Rashid, Chenhui Yang","doi":"10.1016/j.neucom.2025.129578","DOIUrl":null,"url":null,"abstract":"<div><div>Real-time segmentation plays an important role in numerous applications, including autonomous driving and medical imaging, where accurate and instantaneous segmentation influences essential decisions. The previous approaches suffer from the lack of cross-domain transferability and the need for large amounts of labeled data that prevent them from being applied successfully to real-world scenarios. This study presents a new model, ViT-CAPS, that utilizes Vision Transformers in the encoder to improve segmentation performance in challenging and large-scale scenes. We employ the Adaptive Context Embedding (ACE) module, incorporating contrastive learning to improve domain adaptation by matching features from support and query images. Also, the Meta Prompt Generator (MPG) is designed to generate prompts from aligned features, and it can segment in complicated environments without requiring much human input. ViT-CAPS has shown promising results in resolving domain shift problems and improving few-shot segmentation in dynamic low-annotation settings. We conducted extensive experiments on four well-known datasets, FSS-1000, Cityscapes, ISIC, and DeepGlobe, and achieved noteworthy performance. We achieved a performance gain of 4.6 % on FSS-1000, 4.2 % on DeepGlobe, 6.1 % on Cityscapes, and a slight difference of −3 % on the ISIC dataset compared to previous approaches. We achieved an average mean IoU of 60.52 and 69.3, which is 2.7 % and 5.1 % higher accuracy over state-of-the-art Cross-Domain Few-Shot Segmentation (CD-FSS) models on 1-shot and 5-shot settings respectively.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"625 ","pages":"Article 129578"},"PeriodicalIF":5.5000,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225002504","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Real-time segmentation plays an important role in numerous applications, including autonomous driving and medical imaging, where accurate and instantaneous segmentation influences essential decisions. The previous approaches suffer from the lack of cross-domain transferability and the need for large amounts of labeled data that prevent them from being applied successfully to real-world scenarios. This study presents a new model, ViT-CAPS, that utilizes Vision Transformers in the encoder to improve segmentation performance in challenging and large-scale scenes. We employ the Adaptive Context Embedding (ACE) module, incorporating contrastive learning to improve domain adaptation by matching features from support and query images. Also, the Meta Prompt Generator (MPG) is designed to generate prompts from aligned features, and it can segment in complicated environments without requiring much human input. ViT-CAPS has shown promising results in resolving domain shift problems and improving few-shot segmentation in dynamic low-annotation settings. We conducted extensive experiments on four well-known datasets, FSS-1000, Cityscapes, ISIC, and DeepGlobe, and achieved noteworthy performance. We achieved a performance gain of 4.6 % on FSS-1000, 4.2 % on DeepGlobe, 6.1 % on Cityscapes, and a slight difference of −3 % on the ISIC dataset compared to previous approaches. We achieved an average mean IoU of 60.52 and 69.3, which is 2.7 % and 5.1 % higher accuracy over state-of-the-art Cross-Domain Few-Shot Segmentation (CD-FSS) models on 1-shot and 5-shot settings respectively.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.