MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems Pub Date : 2024-06-05 DOI:10.1007/s00530-024-01373-1

Yunpeng Jia, Xiufen Ye, Xinkui Mei, Yusong Liu, Shuxiang Guo

{"title":"MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models","authors":"Yunpeng Jia, Xiufen Ye, Xinkui Mei, Yusong Liu, Shuxiang Guo","doi":"10.1007/s00530-024-01373-1","DOIUrl":null,"url":null,"abstract":"<p>Vision-language models (VLM), such as Contrastive Language-Image Pretraining (CLIP), have demonstrated powerful capabilities in image classification under zero-shot settings. However, current zero-shot learning (ZSL) relies on manually tagged samples of known classes through supervised learning, resulting in a waste of labor costs and limitations on foreseeable classes in real-world applications. To address these challenges, we propose the mixup long-tail unsupervised (MLTU) approach for open-world ZSL problems. The proposed approach employs a novel long-tail mixup loss that integrated class-based re-weighting assignments with a given mixup factor for each mixed visual embedding. To mitigate the adverse impact over time, we adopt a noisy learning strategy to filter out samples that generated incorrect labels. We reproduce the unsupervised experiments of existing state-of-the-art long-tail and noisy learning approaches. Experimental results demonstrate that MLTU achieves significant improvements in classification compared to these proven existing approaches on public datasets. Moreover, it serves as a plug-and-play solution for amending previous assignments and enhancing unsupervised performance. MLTU enables the automatic classification and correction of incorrect predictions caused by the projection bias of CLIP.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"12 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00530-024-01373-1","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Vision-language models (VLM), such as Contrastive Language-Image Pretraining (CLIP), have demonstrated powerful capabilities in image classification under zero-shot settings. However, current zero-shot learning (ZSL) relies on manually tagged samples of known classes through supervised learning, resulting in a waste of labor costs and limitations on foreseeable classes in real-world applications. To address these challenges, we propose the mixup long-tail unsupervised (MLTU) approach for open-world ZSL problems. The proposed approach employs a novel long-tail mixup loss that integrated class-based re-weighting assignments with a given mixup factor for each mixed visual embedding. To mitigate the adverse impact over time, we adopt a noisy learning strategy to filter out samples that generated incorrect labels. We reproduce the unsupervised experiments of existing state-of-the-art long-tail and noisy learning approaches. Experimental results demonstrate that MLTU achieves significant improvements in classification compared to these proven existing approaches on public datasets. Moreover, it serves as a plug-and-play solution for amending previous assignments and enhancing unsupervised performance. MLTU enables the automatic classification and correction of incorrect predictions caused by the projection bias of CLIP.

Abstract Image

查看原文本刊更多论文

MLTU：基于视觉语言模型的混合长尾无监督零镜头图像分类

视觉语言模型（VLM），如对比语言-图像预训练模型（CLIP），已在零镜头设置下的图像分类中展现出强大的能力。然而，目前的零镜头学习（ZSL）依赖于通过监督学习手动标记已知类别的样本，造成了人力成本的浪费，并限制了实际应用中的可预见类别。为了应对这些挑战，我们提出了针对开放世界 ZSL 问题的混合长尾无监督（MLTU）方法。该方法采用了一种新颖的长尾混合损失，将基于类别的再加权分配与每个混合视觉嵌入的给定混合因子整合在一起。为了减轻随着时间推移产生的不利影响，我们采用了一种噪声学习策略，以过滤掉产生错误标签的样本。我们重现了现有最先进的长尾和噪声学习方法的无监督实验。实验结果表明，在公共数据集上，与这些成熟的现有方法相比，MLTU 在分类方面取得了显著的改进。此外，MLTU 还是一种即插即用的解决方案，可用于修改以前的分配并提高无监督性能。MLTU 能够自动分类和修正 CLIP 的投影偏差所导致的错误预测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Multimedia Systems 工程技术-计算机：理论方法

CiteScore

5.40

自引率

7.70%

发文量

148

审稿时长

4.5 months

期刊介绍： This journal details innovative research ideas, emerging technologies, state-of-the-art methods and tools in all aspects of multimedia computing, communication, storage, and applications. It features theoretical, experimental, and survey articles.