{"title":"Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations","authors":"Samyak Rawlekar, Shubhang Bhatnagar, Narendra Ahuja","doi":"arxiv-2409.08381","DOIUrl":null,"url":null,"abstract":"Vision-language models (VLMs) like CLIP have been adapted for Multi-Label\nRecognition (MLR) with partial annotations by leveraging prompt-learning, where\npositive and negative prompts are learned for each class to associate their\nembeddings with class presence or absence in the shared vision-text feature\nspace. While this approach improves MLR performance by relying on VLM priors,\nwe hypothesize that learning negative prompts may be suboptimal, as the\ndatasets used to train VLMs lack image-caption pairs explicitly focusing on\nclass absence. To analyze the impact of positive and negative prompt learning\non MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is\nlearned with VLM guidance while the other is replaced by an embedding vector\nlearned directly in the shared feature space without relying on the text\nencoder. Through empirical analysis, we observe that negative prompts degrade\nMLR performance, and learning only positive prompts, combined with learned\nnegative embeddings (PositiveCoOp), outperforms dual prompt learning\napproaches. Moreover, we quantify the performance benefits that prompt-learning\noffers over a simple vision-features-only baseline, observing that the baseline\ndisplays strong performance comparable to dual prompt learning approach\n(DualCoOp), when the proportion of missing labels is low, while requiring half\nthe training compute and 16 times fewer parameters","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"201 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Vision-language models (VLMs) like CLIP have been adapted for Multi-Label
Recognition (MLR) with partial annotations by leveraging prompt-learning, where
positive and negative prompts are learned for each class to associate their
embeddings with class presence or absence in the shared vision-text feature
space. While this approach improves MLR performance by relying on VLM priors,
we hypothesize that learning negative prompts may be suboptimal, as the
datasets used to train VLMs lack image-caption pairs explicitly focusing on
class absence. To analyze the impact of positive and negative prompt learning
on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is
learned with VLM guidance while the other is replaced by an embedding vector
learned directly in the shared feature space without relying on the text
encoder. Through empirical analysis, we observe that negative prompts degrade
MLR performance, and learning only positive prompts, combined with learned
negative embeddings (PositiveCoOp), outperforms dual prompt learning
approaches. Moreover, we quantify the performance benefits that prompt-learning
offers over a simple vision-features-only baseline, observing that the baseline
displays strong performance comparable to dual prompt learning approach
(DualCoOp), when the proportion of missing labels is low, while requiring half
the training compute and 16 times fewer parameters