Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations

arXiv - CS - Multimedia Pub Date : 2024-09-12 DOI:arxiv-2409.08381

Samyak Rawlekar, Shubhang Bhatnagar, Narendra Ahuja

{"title":"Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations","authors":"Samyak Rawlekar, Shubhang Bhatnagar, Narendra Ahuja","doi":"arxiv-2409.08381","DOIUrl":null,"url":null,"abstract":"Vision-language models (VLMs) like CLIP have been adapted for Multi-Label\nRecognition (MLR) with partial annotations by leveraging prompt-learning, where\npositive and negative prompts are learned for each class to associate their\nembeddings with class presence or absence in the shared vision-text feature\nspace. While this approach improves MLR performance by relying on VLM priors,\nwe hypothesize that learning negative prompts may be suboptimal, as the\ndatasets used to train VLMs lack image-caption pairs explicitly focusing on\nclass absence. To analyze the impact of positive and negative prompt learning\non MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is\nlearned with VLM guidance while the other is replaced by an embedding vector\nlearned directly in the shared feature space without relying on the text\nencoder. Through empirical analysis, we observe that negative prompts degrade\nMLR performance, and learning only positive prompts, combined with learned\nnegative embeddings (PositiveCoOp), outperforms dual prompt learning\napproaches. Moreover, we quantify the performance benefits that prompt-learning\noffers over a simple vision-features-only baseline, observing that the baseline\ndisplays strong performance comparable to dual prompt learning approach\n(DualCoOp), when the proportion of missing labels is low, while requiring half\nthe training compute and 16 times fewer parameters","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"201 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters

查看原文本刊更多论文

反思带有部分注释的多标签识别的提示策略

像 CLIP 这样的视觉语言模型（VLMs）已经通过提示学习（prompt-learning）被用于带有部分注释的多标签识别（MLR），在提示学习中，为每个类别学习正向和负向提示，以便在共享的视觉-文本特征空间中将它们与类别的存在或不存在联系起来。虽然这种方法依靠 VLM 先验提高了 MLR 性能，但我们假设学习负面提示可能不是最佳方法，因为用于训练 VLM 的数据集缺乏明确关注类别缺失的图像标题对。为了分析正面和负面提示学习对 MLR 的影响，我们引入了 PositiveCoOp 和 NegativeCoOp，其中只有一个提示是在 VLM 的指导下学习的，而另一个提示则由直接在共享特征空间中学习的嵌入向量代替，而不依赖文本编码器。通过实证分析，我们发现负面提示会降低MLR 的性能，而只学习正面提示并结合学习到的负面嵌入向量（PositiveCoOp）的效果优于双重提示学习方法。此外，我们还量化了提示学习相对于单纯视觉特征基线的性能优势，观察到当缺失标签比例较低时，基线表现出与双提示学习方法（DualCoOp）相当的强劲性能，同时所需的训练计算量和参数分别减少了一半和 16 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量