Zeyu Zheng, Jun Yan, Shuicheng Yan, Ning Liu, Zheng Chen, Ming Zhang
{"title":"生成高质量训练数据的新型对比协同学习框架","authors":"Zeyu Zheng, Jun Yan, Shuicheng Yan, Ning Liu, Zheng Chen, Ming Zhang","doi":"10.1109/ICDM.2010.23","DOIUrl":null,"url":null,"abstract":"The good performances of most classical learning algorithms are generally founded on high quality training data, which are clean and unbiased. The availability of such data is however becoming much harder than ever in many real world problems due to the difficulties in collecting large scale unbiased data and precisely labeling them for training. In this paper, we propose a general Contrast Co-learning (CCL) framework to refine the biased and noisy training data when an unbiased yet unlabeled data pool is available. CCL starts with multiple sets of probably biased and noisy training data and trains a set of classifiers individually. Then under the assumption that the confidently classified data samples may have higher probabilities to be correctly classified, CCL iteratively and automatically filtering out possible data noises as well as adding those confidently classified samples from the unlabeled data pool to correct the bias. Through this process, we can generate a cleaner and unbiased training dataset with theoretical guarantees. Extensive experiments on two public text datasets clearly show that CCL consistently improves the algorithmic classification performance on biased and noisy training data compared with several state-of-the-art classical algorithms.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Novel Contrast Co-learning Framework for Generating High Quality Training Data\",\"authors\":\"Zeyu Zheng, Jun Yan, Shuicheng Yan, Ning Liu, Zheng Chen, Ming Zhang\",\"doi\":\"10.1109/ICDM.2010.23\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The good performances of most classical learning algorithms are generally founded on high quality training data, which are clean and unbiased. The availability of such data is however becoming much harder than ever in many real world problems due to the difficulties in collecting large scale unbiased data and precisely labeling them for training. In this paper, we propose a general Contrast Co-learning (CCL) framework to refine the biased and noisy training data when an unbiased yet unlabeled data pool is available. CCL starts with multiple sets of probably biased and noisy training data and trains a set of classifiers individually. Then under the assumption that the confidently classified data samples may have higher probabilities to be correctly classified, CCL iteratively and automatically filtering out possible data noises as well as adding those confidently classified samples from the unlabeled data pool to correct the bias. Through this process, we can generate a cleaner and unbiased training dataset with theoretical guarantees. Extensive experiments on two public text datasets clearly show that CCL consistently improves the algorithmic classification performance on biased and noisy training data compared with several state-of-the-art classical algorithms.\",\"PeriodicalId\":294061,\"journal\":{\"name\":\"2010 IEEE International Conference on Data Mining\",\"volume\":\"38 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Conference on Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDM.2010.23\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2010.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Novel Contrast Co-learning Framework for Generating High Quality Training Data
The good performances of most classical learning algorithms are generally founded on high quality training data, which are clean and unbiased. The availability of such data is however becoming much harder than ever in many real world problems due to the difficulties in collecting large scale unbiased data and precisely labeling them for training. In this paper, we propose a general Contrast Co-learning (CCL) framework to refine the biased and noisy training data when an unbiased yet unlabeled data pool is available. CCL starts with multiple sets of probably biased and noisy training data and trains a set of classifiers individually. Then under the assumption that the confidently classified data samples may have higher probabilities to be correctly classified, CCL iteratively and automatically filtering out possible data noises as well as adding those confidently classified samples from the unlabeled data pool to correct the bias. Through this process, we can generate a cleaner and unbiased training dataset with theoretical guarantees. Extensive experiments on two public text datasets clearly show that CCL consistently improves the algorithmic classification performance on biased and noisy training data compared with several state-of-the-art classical algorithms.