{"title":"通过减少模内重叠进行 CLIP 适应","authors":"Alexey Kravets, Vinay Namboodiri","doi":"arxiv-2409.11338","DOIUrl":null,"url":null,"abstract":"Numerous methods have been proposed to adapt a pre-trained foundational CLIP\nmodel for few-shot classification. As CLIP is trained on a large corpus, it\ngeneralises well through adaptation to few-shot classification. In this work,\nwe analyse the intra-modal overlap in image space in terms of embedding\nrepresentation. Our analysis shows that, due to contrastive learning,\nembeddings from CLIP model exhibit high cosine similarity distribution overlap\nin the image space between paired and unpaired examples affecting the\nperformance of few-shot training-free classification methods which rely on\nsimilarity in the image space for their predictions. To tackle intra-modal\noverlap we propose to train a lightweight adapter on a generic set of samples\nfrom the Google Open Images dataset demonstrating that this improves accuracy\nfor few-shot training-free classification. We validate our contribution through\nextensive empirical analysis and demonstrate that reducing the intra-modal\noverlap leads to a) improved performance on a number of standard datasets, b)\nincreased robustness to distribution shift and c) higher feature variance\nrendering the features more discriminative for downstream tasks.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLIP Adaptation by Intra-modal Overlap Reduction\",\"authors\":\"Alexey Kravets, Vinay Namboodiri\",\"doi\":\"arxiv-2409.11338\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Numerous methods have been proposed to adapt a pre-trained foundational CLIP\\nmodel for few-shot classification. As CLIP is trained on a large corpus, it\\ngeneralises well through adaptation to few-shot classification. In this work,\\nwe analyse the intra-modal overlap in image space in terms of embedding\\nrepresentation. Our analysis shows that, due to contrastive learning,\\nembeddings from CLIP model exhibit high cosine similarity distribution overlap\\nin the image space between paired and unpaired examples affecting the\\nperformance of few-shot training-free classification methods which rely on\\nsimilarity in the image space for their predictions. To tackle intra-modal\\noverlap we propose to train a lightweight adapter on a generic set of samples\\nfrom the Google Open Images dataset demonstrating that this improves accuracy\\nfor few-shot training-free classification. We validate our contribution through\\nextensive empirical analysis and demonstrate that reducing the intra-modal\\noverlap leads to a) improved performance on a number of standard datasets, b)\\nincreased robustness to distribution shift and c) higher feature variance\\nrendering the features more discriminative for downstream tasks.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11338\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11338","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Numerous methods have been proposed to adapt a pre-trained foundational CLIP
model for few-shot classification. As CLIP is trained on a large corpus, it
generalises well through adaptation to few-shot classification. In this work,
we analyse the intra-modal overlap in image space in terms of embedding
representation. Our analysis shows that, due to contrastive learning,
embeddings from CLIP model exhibit high cosine similarity distribution overlap
in the image space between paired and unpaired examples affecting the
performance of few-shot training-free classification methods which rely on
similarity in the image space for their predictions. To tackle intra-modal
overlap we propose to train a lightweight adapter on a generic set of samples
from the Google Open Images dataset demonstrating that this improves accuracy
for few-shot training-free classification. We validate our contribution through
extensive empirical analysis and demonstrate that reducing the intra-modal
overlap leads to a) improved performance on a number of standard datasets, b)
increased robustness to distribution shift and c) higher feature variance
rendering the features more discriminative for downstream tasks.