{"title":"通过代表性和多样性样本选择加强半监督学习","authors":"Qian Shao, Jiangrui Kang, Qiyuan Chen, Zepeng Li, Hongxia Xu, Yiwen Cao, Jiajuan Liang, Jian Wu","doi":"arxiv-2409.11653","DOIUrl":null,"url":null,"abstract":"Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep\nlearning tasks, which reduces the need for human labor. Previous studies\nprimarily focus on effectively utilising the labelled and unlabeled data to\nimprove performance. However, we observe that how to select samples for\nlabelling also significantly impacts performance, particularly under extremely\nlow-budget settings. The sample selection task in SSL has been under-explored\nfor a long time. To fill in this gap, we propose a Representative and Diverse\nSample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm\nto minimise a novel criterion $\\alpha$-Maximum Mean Discrepancy ($\\alpha$-MMD),\nRDSS samples a representative and diverse subset for annotation from the\nunlabeled data. We demonstrate that minimizing $\\alpha$-MMD enhances the\ngeneralization ability of low-budget learning. Experimental results show that\nRDSS consistently improves the performance of several popular SSL frameworks\nand outperforms the state-of-the-art sample selection approaches used in Active\nLearning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained\nannotation budgets.","PeriodicalId":501301,"journal":{"name":"arXiv - CS - Machine Learning","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection\",\"authors\":\"Qian Shao, Jiangrui Kang, Qiyuan Chen, Zepeng Li, Hongxia Xu, Yiwen Cao, Jiajuan Liang, Jian Wu\",\"doi\":\"arxiv-2409.11653\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep\\nlearning tasks, which reduces the need for human labor. Previous studies\\nprimarily focus on effectively utilising the labelled and unlabeled data to\\nimprove performance. However, we observe that how to select samples for\\nlabelling also significantly impacts performance, particularly under extremely\\nlow-budget settings. The sample selection task in SSL has been under-explored\\nfor a long time. To fill in this gap, we propose a Representative and Diverse\\nSample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm\\nto minimise a novel criterion $\\\\alpha$-Maximum Mean Discrepancy ($\\\\alpha$-MMD),\\nRDSS samples a representative and diverse subset for annotation from the\\nunlabeled data. We demonstrate that minimizing $\\\\alpha$-MMD enhances the\\ngeneralization ability of low-budget learning. Experimental results show that\\nRDSS consistently improves the performance of several popular SSL frameworks\\nand outperforms the state-of-the-art sample selection approaches used in Active\\nLearning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained\\nannotation budgets.\",\"PeriodicalId\":501301,\"journal\":{\"name\":\"arXiv - CS - Machine Learning\",\"volume\":\"11 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11653\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
半监督学习(SSL)已成为许多深度学习任务的首选范式,它减少了对人力的需求。以往的研究主要集中在有效利用标记数据和未标记数据来提高性能。然而,我们发现,如何选择标记样本也会对性能产生重大影响,尤其是在预算极低的情况下。长期以来,SSL 中的样本选择任务一直未得到充分探索。为了填补这一空白,我们提出了一种代表性和多样性样本选择方法(RDSS)。通过采用改进的弗兰克-沃尔夫算法(Frank-Wolfe algorithm)来最小化一个新标准($\alpha$-Maximum Mean Discrepancy ($\alpha$-MMD)),RDSS从未标明的数据中采样出一个具有代表性和多样性的注释子集。我们证明,最小化$\alpha$-MMD可以增强低预算学习的泛化能力。实验结果表明,即使在标注预算受限的情况下,RDSS 也能持续提高几种流行的 SSL 框架的性能,并优于主动学习(ActiveLearning,AL)和半监督主动学习(Semi-Supervised Active Learning,SSAL)中使用的最先进的样本选择方法。
Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection
Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep
learning tasks, which reduces the need for human labor. Previous studies
primarily focus on effectively utilising the labelled and unlabeled data to
improve performance. However, we observe that how to select samples for
labelling also significantly impacts performance, particularly under extremely
low-budget settings. The sample selection task in SSL has been under-explored
for a long time. To fill in this gap, we propose a Representative and Diverse
Sample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm
to minimise a novel criterion $\alpha$-Maximum Mean Discrepancy ($\alpha$-MMD),
RDSS samples a representative and diverse subset for annotation from the
unlabeled data. We demonstrate that minimizing $\alpha$-MMD enhances the
generalization ability of low-budget learning. Experimental results show that
RDSS consistently improves the performance of several popular SSL frameworks
and outperforms the state-of-the-art sample selection approaches used in Active
Learning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained
annotation budgets.