具有概念漂移的文本流单类分类

2008 IEEE International Conference on Data Mining Workshops Pub Date : 2008-12-15 DOI:10.1109/ICDMW.2008.54

Yang Zhang, Xue Li, M. Orlowska

{"title":"具有概念漂移的文本流单类分类","authors":"Yang Zhang, Xue Li, M. Orlowska","doi":"10.1109/ICDMW.2008.54","DOIUrl":null,"url":null,"abstract":"Research on streaming data classification has been mostly based on the assumption that data can be fully labelled. However, this is impractical. Firstly it is impossible to make a complete labelling before all data has arrived. Secondly it is generally very expensive to obtain fully labelled data by using man power. Thirdly user interests may change with time so the labels issued earlier may be inconsistent with the labels issued later - this represents concept drift. In this paper, we consider the problem of one-class classification on text stream with respect to concept drift where a large volume of documents arrives at a high speed and with change of user interests and data distribution. In this case, only a small number of positively labelled documents is available for training. We propose a stacking style ensemble-based approach and have compared it to all other window-based approaches, such as single window, fixed window, and full memory approaches. Our experiment results demonstrate that the proposed ensemble approach outperforms all other approaches.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":"{\"title\":\"One-Class Classification of Text Streams with Concept Drift\",\"authors\":\"Yang Zhang, Xue Li, M. Orlowska\",\"doi\":\"10.1109/ICDMW.2008.54\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Research on streaming data classification has been mostly based on the assumption that data can be fully labelled. However, this is impractical. Firstly it is impossible to make a complete labelling before all data has arrived. Secondly it is generally very expensive to obtain fully labelled data by using man power. Thirdly user interests may change with time so the labels issued earlier may be inconsistent with the labels issued later - this represents concept drift. In this paper, we consider the problem of one-class classification on text stream with respect to concept drift where a large volume of documents arrives at a high speed and with change of user interests and data distribution. In this case, only a small number of positively labelled documents is available for training. We propose a stacking style ensemble-based approach and have compared it to all other window-based approaches, such as single window, fixed window, and full memory approaches. Our experiment results demonstrate that the proposed ensemble approach outperforms all other approaches.\",\"PeriodicalId\":175955,\"journal\":{\"name\":\"2008 IEEE International Conference on Data Mining Workshops\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"37\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 IEEE International Conference on Data Mining Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDMW.2008.54\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 IEEE International Conference on Data Mining Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2008.54","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

摘要

流数据分类的研究大多基于数据可以被完全标记的假设。然而，这是不切实际的。首先，在所有数据到达之前不可能做出完整的标签。其次，通过人力获得完全标记的数据通常是非常昂贵的。第三，用户的兴趣可能会随着时间的推移而变化，因此较早发布的标签可能与较晚发布的标签不一致——这代表了概念漂移。在本文中，我们考虑了文本流上的一类分类问题，其中大量文档以高速到达，并且用户兴趣和数据分布发生变化。在这种情况下，只有少数正面标记的文件可用于培训。我们提出了一种基于堆叠风格的集成方法，并将其与所有其他基于窗口的方法(如单窗口、固定窗口和全内存方法)进行了比较。实验结果表明，所提出的集成方法优于所有其他方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

One-Class Classification of Text Streams with Concept Drift

Research on streaming data classification has been mostly based on the assumption that data can be fully labelled. However, this is impractical. Firstly it is impossible to make a complete labelling before all data has arrived. Secondly it is generally very expensive to obtain fully labelled data by using man power. Thirdly user interests may change with time so the labels issued earlier may be inconsistent with the labels issued later - this represents concept drift. In this paper, we consider the problem of one-class classification on text stream with respect to concept drift where a large volume of documents arrives at a high speed and with change of user interests and data distribution. In this case, only a small number of positively labelled documents is available for training. We propose a stacking style ensemble-based approach and have compared it to all other window-based approaches, such as single window, fixed window, and full memory approaches. Our experiment results demonstrate that the proposed ensemble approach outperforms all other approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2008 IEEE International Conference on Data Mining Workshops

自引率

0.00%

发文量