André G. Maletzke, Denis M. dos Reis, Gustavo E. A. P. A. Batista
{"title":"结合实例选择和自训练,提高数据流量化","authors":"André G. Maletzke, Denis M. dos Reis, Gustavo E. A. P. A. Batista","doi":"10.1186/s13173-018-0076-0","DOIUrl":null,"url":null,"abstract":"In the last years, learning from data streams has attracted the attention of researchers and practitioners due to its large number of applications. These applications have motivated the research community to propose a significant amount of methods to solve problems in diverse tasks, more prominently in classification, clustering, and anomaly detection. However, a relevant task known as quantification has remained mostly unexplored. The quantification goal is to provide an estimate of the class prevalence in an unlabeled set. Recently, we proposed the SQSI algorithm to quantify data streams with concept drifts. SQSI uses a statistical test to identify concept drifts and retrain the classifiers. However, the retraining involves requiring the labels for all newly arrived instances. In this paper, we extend SQSI algorithm by exploring instance selection techniques allied to semi-supervised learning. The idea is to request the classes of a smaller subset of recent examples. Our experiments demonstrate that SQSI’s extension significantly reduces the dependency on actual labels while maintaining or improving the quantification accuracy.","PeriodicalId":39760,"journal":{"name":"Journal of the Brazilian Computer Society","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Combining instance selection and self-training to improve data stream quantification\",\"authors\":\"André G. Maletzke, Denis M. dos Reis, Gustavo E. A. P. A. Batista\",\"doi\":\"10.1186/s13173-018-0076-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the last years, learning from data streams has attracted the attention of researchers and practitioners due to its large number of applications. These applications have motivated the research community to propose a significant amount of methods to solve problems in diverse tasks, more prominently in classification, clustering, and anomaly detection. However, a relevant task known as quantification has remained mostly unexplored. The quantification goal is to provide an estimate of the class prevalence in an unlabeled set. Recently, we proposed the SQSI algorithm to quantify data streams with concept drifts. SQSI uses a statistical test to identify concept drifts and retrain the classifiers. However, the retraining involves requiring the labels for all newly arrived instances. In this paper, we extend SQSI algorithm by exploring instance selection techniques allied to semi-supervised learning. The idea is to request the classes of a smaller subset of recent examples. Our experiments demonstrate that SQSI’s extension significantly reduces the dependency on actual labels while maintaining or improving the quantification accuracy.\",\"PeriodicalId\":39760,\"journal\":{\"name\":\"Journal of the Brazilian Computer Society\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the Brazilian Computer Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s13173-018-0076-0\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Brazilian Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s13173-018-0076-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Combining instance selection and self-training to improve data stream quantification
In the last years, learning from data streams has attracted the attention of researchers and practitioners due to its large number of applications. These applications have motivated the research community to propose a significant amount of methods to solve problems in diverse tasks, more prominently in classification, clustering, and anomaly detection. However, a relevant task known as quantification has remained mostly unexplored. The quantification goal is to provide an estimate of the class prevalence in an unlabeled set. Recently, we proposed the SQSI algorithm to quantify data streams with concept drifts. SQSI uses a statistical test to identify concept drifts and retrain the classifiers. However, the retraining involves requiring the labels for all newly arrived instances. In this paper, we extend SQSI algorithm by exploring instance selection techniques allied to semi-supervised learning. The idea is to request the classes of a smaller subset of recent examples. Our experiments demonstrate that SQSI’s extension significantly reduces the dependency on actual labels while maintaining or improving the quantification accuracy.
期刊介绍:
JBCS is a formal quarterly publication of the Brazilian Computer Society. It is a peer-reviewed international journal which aims to serve as a forum to disseminate innovative research in all fields of computer science and related subjects. Theoretical, practical and experimental papers reporting original research contributions are welcome, as well as high quality survey papers. The journal is open to contributions in all computer science topics, computer systems development or in formal and theoretical aspects of computing, as the list of topics below is not exhaustive. Contributions will be considered for publication in JBCS if they have not been published previously and are not under consideration for publication elsewhere.