Co-training Based on Semi-Supervised Ensemble Classification Approach for Multi-label Data Stream

2019 IEEE International Conference on Big Knowledge (ICBK) Pub Date : 2019-11-01 DOI:10.1109/ICBK.2019.00016

Zhe Chu, Peipei Li, Xuegang Hu

{"title":"Co-training Based on Semi-Supervised Ensemble Classification Approach for Multi-label Data Stream","authors":"Zhe Chu, Peipei Li, Xuegang Hu","doi":"10.1109/ICBK.2019.00016","DOIUrl":null,"url":null,"abstract":"A large amount of data streams in the form of texts and images has been emerging in many real-world applications. These data streams often present the characteristics such as multi-labels, label missing and new class emerging, which makes the existing data stream classification algorithm face the challenges in precision space and time performance. This is because, on the one hand, it is known that data stream classification algorithms are mostly trained on all labeled single-class data, while there are a large amount of unlabeled data and few labeled data due to it is difficult to obtain labels in the real world. On the other hand, many of existing multi-label data stream classification algorithms mostly focused on the classification with all labeled data and without emerging new classes, and there are few semi-supervised methods. Therefore, this paper proposes a semi-supervised ensemble classification algorithm for multi-label data streams based on co-training. Firstly, the algorithm uses the sliding window mechanism to partition the data stream into data chunks. On the former w data chucks, the multi-label semi-supervised classification algorithm COINS based on co-training is used to training a base classifier on each chunk, and then an ensemble model with w COINS classifiers is generated ensemble model to adapt to the environment of data stream with a large number of unlabeled data. Meanwhile, a new class emerging detection mechanism is introduced, and the w+1 data chunk is predicted by the ensemble model to detect whether there is a new class emerging. When a new label is detected, the classifier is retrained on the current data chunk, and the ensemble model is updated. Finally, experimental results on five real data sets show that: as compared with the classical algorithms, the proposed approach can improve the classification accuracy of multi-label data streams with a large number of missing labels and new labels emerging.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Big Knowledge (ICBK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBK.2019.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

A large amount of data streams in the form of texts and images has been emerging in many real-world applications. These data streams often present the characteristics such as multi-labels, label missing and new class emerging, which makes the existing data stream classification algorithm face the challenges in precision space and time performance. This is because, on the one hand, it is known that data stream classification algorithms are mostly trained on all labeled single-class data, while there are a large amount of unlabeled data and few labeled data due to it is difficult to obtain labels in the real world. On the other hand, many of existing multi-label data stream classification algorithms mostly focused on the classification with all labeled data and without emerging new classes, and there are few semi-supervised methods. Therefore, this paper proposes a semi-supervised ensemble classification algorithm for multi-label data streams based on co-training. Firstly, the algorithm uses the sliding window mechanism to partition the data stream into data chunks. On the former w data chucks, the multi-label semi-supervised classification algorithm COINS based on co-training is used to training a base classifier on each chunk, and then an ensemble model with w COINS classifiers is generated ensemble model to adapt to the environment of data stream with a large number of unlabeled data. Meanwhile, a new class emerging detection mechanism is introduced, and the w+1 data chunk is predicted by the ensemble model to detect whether there is a new class emerging. When a new label is detected, the classifier is retrained on the current data chunk, and the ensemble model is updated. Finally, experimental results on five real data sets show that: as compared with the classical algorithms, the proposed approach can improve the classification accuracy of multi-label data streams with a large number of missing labels and new labels emerging.

查看原文本刊更多论文

基于半监督集成分类方法的多标签数据流协同训练

大量文本和图像形式的数据流已经出现在许多实际应用中。这些数据流往往呈现出多标签、标签缺失和新类别出现等特点，使得现有的数据流分类算法在精度、空间性能和时间性能方面面临挑战。这是因为，一方面，我们知道数据流分类算法大多是对所有标记的单类数据进行训练，而由于现实世界中很难获得标签，因此存在大量未标记的数据和很少的标记数据。另一方面，现有的多标签数据流分类算法大多集中在对所有标记数据的分类上，没有出现新的类，半监督的方法很少。为此，本文提出了一种基于协同训练的多标签数据流半监督集成分类算法。该算法首先利用滑动窗口机制将数据流划分为数据块;在前w个数据卡上，采用基于协同训练的多标签半监督分类算法COINS在每个数据块上训练一个基分类器，然后生成一个包含w个COINS分类器的集成模型，以适应具有大量无标签数据的数据流环境。同时，引入了一种新的类出现检测机制，通过集成模型对w+1数据块进行预测，检测是否有新类出现。当检测到新标签时，在当前数据块上重新训练分类器，并更新集成模型。最后，在5个真实数据集上的实验结果表明:与经典算法相比，本文提出的方法能够提高存在大量缺失标签和新标签出现的多标签数据流的分类精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE International Conference on Big Knowledge (ICBK)

自引率

0.00%

发文量