{"title":"Active Supervised Cross-Modal Retrieval","authors":"Huaiwen Zhang;Yang Yang;Fan Qi;Shengsheng Qian;Changsheng Xu","doi":"10.1109/TPAMI.2025.3550526","DOIUrl":null,"url":null,"abstract":"Supervised Cross-Modal Retrieval (SCMR) achieves significant performance with the supervision provided by substantial label annotations of multi-modal data. However, the requirement for large annotated multi-modal datasets restricts the use of supervised cross-modal retrieval in many practical scenarios. Active Learning (AL) has been proposed to reduce labeling costs while improving performance in various label-dependent tasks, in which the most informative unlabeled samples are selected for labeling and training. Directly exploiting the existing AL methods for supervised cross-modal retrieval may not be a good idea since they only focus on the uncertainty within each modality, ignoring the inter-modality relationship within the text-image pairs. Furthermore, existing methods focus exclusively on the informativeness of data during sample selection, leading to a biased, homogenized set where selected samples often contain nearly identical semantics and are densely distributed in a region of the feature space. Persistent training with such biased data selections can disturb multi-modal representation learning and substantially degrade the retrieval performance of SCMR. In this work, we propose an Active Supervised Cross-Modal Retrieval (ASCMR) framework, which effectively identifies informative multi-modal samples and generates unbiased sample selections. In particular, we propose a probabilistic multi-modal informativeness estimation that captures both the intra-modality and inter-modality uncertainty of multi-modal pairs within a unified representation. To ensure unbiased sample selection, we introduce a density-aware budget allocation strategy that constrains the active learning objective of maximizing the informativeness of selection with a novel semantic density regularization term. The proposed methods are evaluated on three widely used benchmark datasets, MS-COCO, NUS-WIDE, and MIRFlickr, demonstrating our effectiveness in significantly reducing the annotation cost while outperforming other baselines of active learning strategies. We could achieve over 95% of the fully supervised model’s performance by only utilizing 6%, 3%, and 4% active selected samples for MS-COCO, NUS-WIDE, and MIRFlickr, respectively.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 6","pages":"5112-5126"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10923693/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Supervised Cross-Modal Retrieval (SCMR) achieves significant performance with the supervision provided by substantial label annotations of multi-modal data. However, the requirement for large annotated multi-modal datasets restricts the use of supervised cross-modal retrieval in many practical scenarios. Active Learning (AL) has been proposed to reduce labeling costs while improving performance in various label-dependent tasks, in which the most informative unlabeled samples are selected for labeling and training. Directly exploiting the existing AL methods for supervised cross-modal retrieval may not be a good idea since they only focus on the uncertainty within each modality, ignoring the inter-modality relationship within the text-image pairs. Furthermore, existing methods focus exclusively on the informativeness of data during sample selection, leading to a biased, homogenized set where selected samples often contain nearly identical semantics and are densely distributed in a region of the feature space. Persistent training with such biased data selections can disturb multi-modal representation learning and substantially degrade the retrieval performance of SCMR. In this work, we propose an Active Supervised Cross-Modal Retrieval (ASCMR) framework, which effectively identifies informative multi-modal samples and generates unbiased sample selections. In particular, we propose a probabilistic multi-modal informativeness estimation that captures both the intra-modality and inter-modality uncertainty of multi-modal pairs within a unified representation. To ensure unbiased sample selection, we introduce a density-aware budget allocation strategy that constrains the active learning objective of maximizing the informativeness of selection with a novel semantic density regularization term. The proposed methods are evaluated on three widely used benchmark datasets, MS-COCO, NUS-WIDE, and MIRFlickr, demonstrating our effectiveness in significantly reducing the annotation cost while outperforming other baselines of active learning strategies. We could achieve over 95% of the fully supervised model’s performance by only utilizing 6%, 3%, and 4% active selected samples for MS-COCO, NUS-WIDE, and MIRFlickr, respectively.