Yunpeng Jia, Xiufen Ye, Xinkui Mei, Yusong Liu, Shuxiang Guo
{"title":"MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models","authors":"Yunpeng Jia, Xiufen Ye, Xinkui Mei, Yusong Liu, Shuxiang Guo","doi":"10.1007/s00530-024-01373-1","DOIUrl":null,"url":null,"abstract":"<p>Vision-language models (VLM), such as Contrastive Language-Image Pretraining (CLIP), have demonstrated powerful capabilities in image classification under zero-shot settings. However, current zero-shot learning (ZSL) relies on manually tagged samples of known classes through supervised learning, resulting in a waste of labor costs and limitations on foreseeable classes in real-world applications. To address these challenges, we propose the mixup long-tail unsupervised (MLTU) approach for open-world ZSL problems. The proposed approach employs a novel long-tail mixup loss that integrated class-based re-weighting assignments with a given mixup factor for each mixed visual embedding. To mitigate the adverse impact over time, we adopt a noisy learning strategy to filter out samples that generated incorrect labels. We reproduce the unsupervised experiments of existing state-of-the-art long-tail and noisy learning approaches. Experimental results demonstrate that MLTU achieves significant improvements in classification compared to these proven existing approaches on public datasets. Moreover, it serves as a plug-and-play solution for amending previous assignments and enhancing unsupervised performance. MLTU enables the automatic classification and correction of incorrect predictions caused by the projection bias of CLIP.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"12 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00530-024-01373-1","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Vision-language models (VLM), such as Contrastive Language-Image Pretraining (CLIP), have demonstrated powerful capabilities in image classification under zero-shot settings. However, current zero-shot learning (ZSL) relies on manually tagged samples of known classes through supervised learning, resulting in a waste of labor costs and limitations on foreseeable classes in real-world applications. To address these challenges, we propose the mixup long-tail unsupervised (MLTU) approach for open-world ZSL problems. The proposed approach employs a novel long-tail mixup loss that integrated class-based re-weighting assignments with a given mixup factor for each mixed visual embedding. To mitigate the adverse impact over time, we adopt a noisy learning strategy to filter out samples that generated incorrect labels. We reproduce the unsupervised experiments of existing state-of-the-art long-tail and noisy learning approaches. Experimental results demonstrate that MLTU achieves significant improvements in classification compared to these proven existing approaches on public datasets. Moreover, it serves as a plug-and-play solution for amending previous assignments and enhancing unsupervised performance. MLTU enables the automatic classification and correction of incorrect predictions caused by the projection bias of CLIP.
期刊介绍:
This journal details innovative research ideas, emerging technologies, state-of-the-art methods and tools in all aspects of multimedia computing, communication, storage, and applications. It features theoretical, experimental, and survey articles.