{"title":"主动监督跨模态检索","authors":"Huaiwen Zhang;Yang Yang;Fan Qi;Shengsheng Qian;Changsheng Xu","doi":"10.1109/TPAMI.2025.3550526","DOIUrl":null,"url":null,"abstract":"Supervised Cross-Modal Retrieval (SCMR) achieves significant performance with the supervision provided by substantial label annotations of multi-modal data. However, the requirement for large annotated multi-modal datasets restricts the use of supervised cross-modal retrieval in many practical scenarios. Active Learning (AL) has been proposed to reduce labeling costs while improving performance in various label-dependent tasks, in which the most informative unlabeled samples are selected for labeling and training. Directly exploiting the existing AL methods for supervised cross-modal retrieval may not be a good idea since they only focus on the uncertainty within each modality, ignoring the inter-modality relationship within the text-image pairs. Furthermore, existing methods focus exclusively on the informativeness of data during sample selection, leading to a biased, homogenized set where selected samples often contain nearly identical semantics and are densely distributed in a region of the feature space. Persistent training with such biased data selections can disturb multi-modal representation learning and substantially degrade the retrieval performance of SCMR. In this work, we propose an Active Supervised Cross-Modal Retrieval (ASCMR) framework, which effectively identifies informative multi-modal samples and generates unbiased sample selections. In particular, we propose a probabilistic multi-modal informativeness estimation that captures both the intra-modality and inter-modality uncertainty of multi-modal pairs within a unified representation. To ensure unbiased sample selection, we introduce a density-aware budget allocation strategy that constrains the active learning objective of maximizing the informativeness of selection with a novel semantic density regularization term. The proposed methods are evaluated on three widely used benchmark datasets, MS-COCO, NUS-WIDE, and MIRFlickr, demonstrating our effectiveness in significantly reducing the annotation cost while outperforming other baselines of active learning strategies. We could achieve over 95% of the fully supervised model’s performance by only utilizing 6%, 3%, and 4% active selected samples for MS-COCO, NUS-WIDE, and MIRFlickr, respectively.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 6","pages":"5112-5126"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Active Supervised Cross-Modal Retrieval\",\"authors\":\"Huaiwen Zhang;Yang Yang;Fan Qi;Shengsheng Qian;Changsheng Xu\",\"doi\":\"10.1109/TPAMI.2025.3550526\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Supervised Cross-Modal Retrieval (SCMR) achieves significant performance with the supervision provided by substantial label annotations of multi-modal data. However, the requirement for large annotated multi-modal datasets restricts the use of supervised cross-modal retrieval in many practical scenarios. Active Learning (AL) has been proposed to reduce labeling costs while improving performance in various label-dependent tasks, in which the most informative unlabeled samples are selected for labeling and training. Directly exploiting the existing AL methods for supervised cross-modal retrieval may not be a good idea since they only focus on the uncertainty within each modality, ignoring the inter-modality relationship within the text-image pairs. Furthermore, existing methods focus exclusively on the informativeness of data during sample selection, leading to a biased, homogenized set where selected samples often contain nearly identical semantics and are densely distributed in a region of the feature space. Persistent training with such biased data selections can disturb multi-modal representation learning and substantially degrade the retrieval performance of SCMR. In this work, we propose an Active Supervised Cross-Modal Retrieval (ASCMR) framework, which effectively identifies informative multi-modal samples and generates unbiased sample selections. In particular, we propose a probabilistic multi-modal informativeness estimation that captures both the intra-modality and inter-modality uncertainty of multi-modal pairs within a unified representation. To ensure unbiased sample selection, we introduce a density-aware budget allocation strategy that constrains the active learning objective of maximizing the informativeness of selection with a novel semantic density regularization term. The proposed methods are evaluated on three widely used benchmark datasets, MS-COCO, NUS-WIDE, and MIRFlickr, demonstrating our effectiveness in significantly reducing the annotation cost while outperforming other baselines of active learning strategies. We could achieve over 95% of the fully supervised model’s performance by only utilizing 6%, 3%, and 4% active selected samples for MS-COCO, NUS-WIDE, and MIRFlickr, respectively.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"47 6\",\"pages\":\"5112-5126\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-03-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10923693/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10923693/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Supervised Cross-Modal Retrieval (SCMR) achieves significant performance with the supervision provided by substantial label annotations of multi-modal data. However, the requirement for large annotated multi-modal datasets restricts the use of supervised cross-modal retrieval in many practical scenarios. Active Learning (AL) has been proposed to reduce labeling costs while improving performance in various label-dependent tasks, in which the most informative unlabeled samples are selected for labeling and training. Directly exploiting the existing AL methods for supervised cross-modal retrieval may not be a good idea since they only focus on the uncertainty within each modality, ignoring the inter-modality relationship within the text-image pairs. Furthermore, existing methods focus exclusively on the informativeness of data during sample selection, leading to a biased, homogenized set where selected samples often contain nearly identical semantics and are densely distributed in a region of the feature space. Persistent training with such biased data selections can disturb multi-modal representation learning and substantially degrade the retrieval performance of SCMR. In this work, we propose an Active Supervised Cross-Modal Retrieval (ASCMR) framework, which effectively identifies informative multi-modal samples and generates unbiased sample selections. In particular, we propose a probabilistic multi-modal informativeness estimation that captures both the intra-modality and inter-modality uncertainty of multi-modal pairs within a unified representation. To ensure unbiased sample selection, we introduce a density-aware budget allocation strategy that constrains the active learning objective of maximizing the informativeness of selection with a novel semantic density regularization term. The proposed methods are evaluated on three widely used benchmark datasets, MS-COCO, NUS-WIDE, and MIRFlickr, demonstrating our effectiveness in significantly reducing the annotation cost while outperforming other baselines of active learning strategies. We could achieve over 95% of the fully supervised model’s performance by only utilizing 6%, 3%, and 4% active selected samples for MS-COCO, NUS-WIDE, and MIRFlickr, respectively.