Impact of Strategic Sampling and Supervision Policies on Semi-Supervised Learning

IF 5.3 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Emerging Topics in Computational Intelligence Pub Date : 2024-11-27 DOI:10.1109/TETCI.2024.3502453

Shuvendu Roy;Ali Etemad

{"title":"Impact of Strategic Sampling and Supervision Policies on Semi-Supervised Learning","authors":"Shuvendu Roy;Ali Etemad","doi":"10.1109/TETCI.2024.3502453","DOIUrl":null,"url":null,"abstract":"In semi-supervised representation learning frameworks, when the number of labelled data is very scarce, the quality and representativeness of these samples become increasingly important. Existing literature on semi-supervised learning randomly sample a limited number of data points for labelling. All these labelled samples are then used along with the unlabelled data throughout the training process. In this work, we ask two important questions in this context: 1) does it matter which samples are selected for labelling? 2) does it matter how the labelled samples are used throughout the training process along with the unlabelled data? To answer the first question, we explore a number of unsupervised methods for selecting specific subsets of data to label (without prior knowledge of their labels), with the goal of maximizing representativeness w.r.t. the unlabelled set. Then, for our second line of inquiry, we define a variety of different label injection strategies in the training process. Extensive experiments on four popular datasets, CIFAR-10, CIFAR-100, SVHN, and STL-10, show that unsupervised selection of samples that are more representative of the entire data improves performance by up to <inline-formula><tex-math>$\\sim$</tex-math></inline-formula>2% over the existing semi-supervised frameworks such as MixMatch, ReMixMatch, FixMatch and others with random sample labelling. We show that this boost could even increase to 7.5% for very few-labelled scenarios. However, our study shows that gradually injecting the labels throughout the training procedure does not impact the performance considerably versus when all the existing labels are used throughout the entire training.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"9 4","pages":"2806-2817"},"PeriodicalIF":5.3000,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10769604/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In semi-supervised representation learning frameworks, when the number of labelled data is very scarce, the quality and representativeness of these samples become increasingly important. Existing literature on semi-supervised learning randomly sample a limited number of data points for labelling. All these labelled samples are then used along with the unlabelled data throughout the training process. In this work, we ask two important questions in this context: 1) does it matter which samples are selected for labelling? 2) does it matter how the labelled samples are used throughout the training process along with the unlabelled data? To answer the first question, we explore a number of unsupervised methods for selecting specific subsets of data to label (without prior knowledge of their labels), with the goal of maximizing representativeness w.r.t. the unlabelled set. Then, for our second line of inquiry, we define a variety of different label injection strategies in the training process. Extensive experiments on four popular datasets, CIFAR-10, CIFAR-100, SVHN, and STL-10, show that unsupervised selection of samples that are more representative of the entire data improves performance by up to

$\sim$

2% over the existing semi-supervised frameworks such as MixMatch, ReMixMatch, FixMatch and others with random sample labelling. We show that this boost could even increase to 7.5% for very few-labelled scenarios. However, our study shows that gradually injecting the labels throughout the training procedure does not impact the performance considerably versus when all the existing labels are used throughout the entire training.

查看原文本刊更多论文

策略抽样和监督策略对半监督学习的影响

在半监督表示学习框架中，当标记数据数量非常稀少时，这些样本的质量和代表性变得越来越重要。现有的半监督学习文献随机抽取有限数量的数据点进行标记。然后在整个训练过程中，所有这些标记的样本与未标记的数据一起使用。在这项工作中，我们在这种情况下提出了两个重要问题：1)选择哪些样品进行标记是否重要？2)在整个训练过程中如何与未标记的数据一起使用标记样本是否重要？为了回答第一个问题，我们探索了许多无监督的方法，用于选择特定的数据子集进行标记（不需要事先知道它们的标签），目标是最大限度地提高未标记集的代表性。然后，对于我们的第二条查询线，我们在训练过程中定义了各种不同的标签注入策略。在四个流行的数据集（CIFAR-10、CIFAR-100、SVHN和STL-10）上进行的大量实验表明，与现有的半监督框架（如MixMatch、ReMixMatch、FixMatch和其他随机样本标记）相比，对更能代表整个数据的样本进行无监督选择，性能提高了2%。我们表明，在极少数情况下，这种提升甚至可以增加到7.5%。然而，我们的研究表明，与在整个训练过程中使用所有现有标签相比，在整个训练过程中逐渐注入标签不会对性能产生显着影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Emerging Topics in Computational Intelligence Mathematics-Control and Optimization

CiteScore

10.30

自引率

7.50%

发文量

147

期刊介绍： The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys. TETCI is an electronics only publication. TETCI publishes six issues per year. Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.