{"title":"基于匈牙利方法的光谱聚类零初始化主动学习","authors":"Dávid Papp","doi":"10.14232/actacyb.288006","DOIUrl":null,"url":null,"abstract":"Supervised machine learning tasks often require a large number of labeled training data to set up a model, and then prediction - for example the classification - is carried out based on this model. Nowadays tremendous amount of data is available on the web or in data warehouses, although only a portion of those data is annotated and the labeling process can be tedious, expensive and time consuming. Active learning tries to overcome this problem by reducing the labeling cost through allowing the learning system to iteratively select the data from which it learns. In special case of active learning, the process starts from zero initialized scenario, where the labeled training dataset is empty, and therefore only unsupervised methods can be performed. In this paper a novel query strategy framework is presented for this problem, called Clustering Based Balanced Sampling Framework (CBBSF), which is not only select the initial labeled training dataset, but uniformly selects the items among the categories to get a balanced labeled training dataset. The framework includes an assignment technique to implicitly determine the class membership probabilities. Assignment solution is updated during CBBSF iterations, hence it simulates supervised machine learning more accurately as the process progresses. The proposed Spectral Clustering Based Sampling (SCBS) query startegy realizes the CBBSF framework, and therefore it is applicable in the special zero initialized situation. This selection approach uses ClusterGAN (Clustering using Generative Adversarial Networks) integrated in the spectral clustering algorithm and then it selects an unlabeled instance depending on the class membership probabilities. Global and local versions of SCBS were developed, furthermore, most confident and minimal entropy measures were calculated, thus four different SCBS variants were examined in total. Experimental evaluation was conducted on the MNIST dataset, and the results showed that SCBS outperforms the state-of-the-art zero initialized active learning query strategies.","PeriodicalId":187125,"journal":{"name":"Acta Cybern.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Zero Initialized Active Learning with Spectral Clustering using Hungarian Method\",\"authors\":\"Dávid Papp\",\"doi\":\"10.14232/actacyb.288006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Supervised machine learning tasks often require a large number of labeled training data to set up a model, and then prediction - for example the classification - is carried out based on this model. Nowadays tremendous amount of data is available on the web or in data warehouses, although only a portion of those data is annotated and the labeling process can be tedious, expensive and time consuming. Active learning tries to overcome this problem by reducing the labeling cost through allowing the learning system to iteratively select the data from which it learns. In special case of active learning, the process starts from zero initialized scenario, where the labeled training dataset is empty, and therefore only unsupervised methods can be performed. In this paper a novel query strategy framework is presented for this problem, called Clustering Based Balanced Sampling Framework (CBBSF), which is not only select the initial labeled training dataset, but uniformly selects the items among the categories to get a balanced labeled training dataset. The framework includes an assignment technique to implicitly determine the class membership probabilities. Assignment solution is updated during CBBSF iterations, hence it simulates supervised machine learning more accurately as the process progresses. The proposed Spectral Clustering Based Sampling (SCBS) query startegy realizes the CBBSF framework, and therefore it is applicable in the special zero initialized situation. This selection approach uses ClusterGAN (Clustering using Generative Adversarial Networks) integrated in the spectral clustering algorithm and then it selects an unlabeled instance depending on the class membership probabilities. Global and local versions of SCBS were developed, furthermore, most confident and minimal entropy measures were calculated, thus four different SCBS variants were examined in total. Experimental evaluation was conducted on the MNIST dataset, and the results showed that SCBS outperforms the state-of-the-art zero initialized active learning query strategies.\",\"PeriodicalId\":187125,\"journal\":{\"name\":\"Acta Cybern.\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Acta Cybern.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14232/actacyb.288006\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Cybern.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14232/actacyb.288006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
监督式机器学习任务通常需要大量标记的训练数据来建立模型,然后根据该模型进行预测(例如分类)。如今,在网络上或数据仓库中有大量的数据,尽管这些数据中只有一部分被注释了,而且标记过程可能是乏味、昂贵和耗时的。主动学习试图克服这个问题,通过允许学习系统迭代地选择学习的数据来减少标记成本。在主动学习的特殊情况下,该过程从零初始化场景开始,其中标记的训练数据集是空的,因此只能执行无监督方法。针对这一问题,本文提出了一种新的查询策略框架——基于聚类的平衡采样框架(CBBSF),该框架不仅选择初始标记训练数据集,而且在分类中统一选择项目,从而得到一个平衡标记训练数据集。该框架包括一种赋值技术来隐式地确定类隶属概率。分配解决方案在CBBSF迭代期间更新,因此随着过程的进行,它更准确地模拟了监督机器学习。提出的基于谱聚类采样(SCBS)查询策略实现了CBBSF框架,因此适用于特殊的零初始化情况。该方法将ClusterGAN (Clustering using Generative Adversarial Networks)集成到谱聚类算法中,然后根据类隶属度概率选择一个未标记的实例。开发了SCBS的全球和本地版本,此外,计算了最自信和最小熵度量,从而总共检查了四种不同的SCBS变体。在MNIST数据集上进行了实验评估,结果表明SCBS优于最先进的零初始化主动学习查询策略。
Zero Initialized Active Learning with Spectral Clustering using Hungarian Method
Supervised machine learning tasks often require a large number of labeled training data to set up a model, and then prediction - for example the classification - is carried out based on this model. Nowadays tremendous amount of data is available on the web or in data warehouses, although only a portion of those data is annotated and the labeling process can be tedious, expensive and time consuming. Active learning tries to overcome this problem by reducing the labeling cost through allowing the learning system to iteratively select the data from which it learns. In special case of active learning, the process starts from zero initialized scenario, where the labeled training dataset is empty, and therefore only unsupervised methods can be performed. In this paper a novel query strategy framework is presented for this problem, called Clustering Based Balanced Sampling Framework (CBBSF), which is not only select the initial labeled training dataset, but uniformly selects the items among the categories to get a balanced labeled training dataset. The framework includes an assignment technique to implicitly determine the class membership probabilities. Assignment solution is updated during CBBSF iterations, hence it simulates supervised machine learning more accurately as the process progresses. The proposed Spectral Clustering Based Sampling (SCBS) query startegy realizes the CBBSF framework, and therefore it is applicable in the special zero initialized situation. This selection approach uses ClusterGAN (Clustering using Generative Adversarial Networks) integrated in the spectral clustering algorithm and then it selects an unlabeled instance depending on the class membership probabilities. Global and local versions of SCBS were developed, furthermore, most confident and minimal entropy measures were calculated, thus four different SCBS variants were examined in total. Experimental evaluation was conducted on the MNIST dataset, and the results showed that SCBS outperforms the state-of-the-art zero initialized active learning query strategies.