Dual-View Alignment Learning With Hierarchical-Prompt for Class-Imbalance Multi-Label Image Classification

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-09-18 DOI:10.1109/TIP.2025.3609185

Sheng Huang;Jiexuan Yan;Beiyan Liu;Bo Liu;Richang Hong

{"title":"Dual-View Alignment Learning With Hierarchical-Prompt for Class-Imbalance Multi-Label Image Classification","authors":"Sheng Huang;Jiexuan Yan;Beiyan Liu;Bo Liu;Richang Hong","doi":"10.1109/TIP.2025.3609185","DOIUrl":null,"url":null,"abstract":"Real-world datasets often exhibit class imbalance across multiple categories, manifesting as long-tailed distributions and few-shot scenarios. This is especially challenging in Class-Imbalanced Multi-Label Image Classification (CI-MLIC) tasks, where data imbalance and multi-object recognition present significant obstacles. To address these challenges, we propose a novel method termed Dual-View Alignment Learning with Hierarchical Prompt (HP-DVAL), which leverages multi-modal knowledge from vision-language pretrained (VLP) models to mitigate the class-imbalance problem in multi-label settings. Specifically, HP-DVAL employs dual-view alignment learning to transfer the powerful feature representation capabilities from VLP models by extracting complementary features for accurate image-text alignment. To better adapt VLP models for CI-MLIC tasks, we introduce a hierarchical prompt-tuning strategy that utilizes global and local prompts to learn task-specific and context-related prior knowledge. Additionally, we design a semantic consistency loss during prompt tuning to prevent learned prompts from deviating from general knowledge embedded in VLP models. The effectiveness of our approach is validated on two CI-MLIC benchmarks: MS-COCO and VOC2007. Extensive experimental results demonstrate the superiority of our method over SOTA approaches, achieving mAP improvements of 10.0% and 5.2% on the long-tailed multi-label image classification task, and 6.8% and 2.9% on the multi-label few-shot image classification task.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5989-6001"},"PeriodicalIF":13.7000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11169416/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Real-world datasets often exhibit class imbalance across multiple categories, manifesting as long-tailed distributions and few-shot scenarios. This is especially challenging in Class-Imbalanced Multi-Label Image Classification (CI-MLIC) tasks, where data imbalance and multi-object recognition present significant obstacles. To address these challenges, we propose a novel method termed Dual-View Alignment Learning with Hierarchical Prompt (HP-DVAL), which leverages multi-modal knowledge from vision-language pretrained (VLP) models to mitigate the class-imbalance problem in multi-label settings. Specifically, HP-DVAL employs dual-view alignment learning to transfer the powerful feature representation capabilities from VLP models by extracting complementary features for accurate image-text alignment. To better adapt VLP models for CI-MLIC tasks, we introduce a hierarchical prompt-tuning strategy that utilizes global and local prompts to learn task-specific and context-related prior knowledge. Additionally, we design a semantic consistency loss during prompt tuning to prevent learned prompts from deviating from general knowledge embedded in VLP models. The effectiveness of our approach is validated on two CI-MLIC benchmarks: MS-COCO and VOC2007. Extensive experimental results demonstrate the superiority of our method over SOTA approaches, achieving mAP improvements of 10.0% and 5.2% on the long-tailed multi-label image classification task, and 6.8% and 2.9% on the multi-label few-shot image classification task.

查看原文本刊更多论文

类不平衡多标签图像分类的层次提示双视图对齐学习。

现实世界的数据集经常表现出跨多个类别的类别不平衡，表现为长尾分布和少镜头场景。这在类不平衡多标签图像分类（CI-MLIC）任务中尤其具有挑战性，其中数据不平衡和多目标识别存在重大障碍。为了解决这些挑战，我们提出了一种新的方法，称为分层提示双视图对齐学习（HP-DVAL），它利用视觉语言预训练（VLP）模型的多模态知识来缓解多标签设置下的类不平衡问题。具体而言，HP-DVAL采用双视图对齐学习，通过提取互补特征，将VLP模型强大的特征表示能力转移到精确的图像-文本对齐中。为了更好地使VLP模型适应CI-MLIC任务，我们引入了一种分层提示调优策略，该策略利用全局和局部提示来学习特定于任务和上下文相关的先验知识。此外，我们在提示调优期间设计了语义一致性损失，以防止学习到的提示偏离嵌入在VLP模型中的一般知识。我们的方法的有效性在两个CI-MLIC基准上得到了验证：MS-COCO和VOC2007。大量的实验结果证明了我们的方法比SOTA方法的优越性，在长尾多标签图像分类任务上实现了10.0%和5.2%的mAP改进，在多标签少镜头图像分类任务上实现了6.8%和2.9%的mAP改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量