On Appropriate Assumptions to Mine Data Streams: Analysis and Practice

Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI:10.1109/ICDM.2007.96

Jing Gao, W. Fan, Jiawei Han

{"title":"On Appropriate Assumptions to Mine Data Streams: Analysis and Practice","authors":"Jing Gao, W. Fan, Jiawei Han","doi":"10.1109/ICDM.2007.96","DOIUrl":null,"url":null,"abstract":"Recent years have witnessed an increasing number of studies in stream mining, which aim at building an accurate model for continuously arriving data. Somehow most existing work makes the implicit assumption that the training data and the yet-to-come testing data are always sampled from the \"same distribution\", and yet this \"same distribution\" evolves over time. We demonstrate that this may not be true, and one actually may never know either \"how\" or \"when\" the distribution changes. Thus, a model that fits well on the observed distribution can have unsatisfactory accuracy on the incoming data. Practically, one can just assume the bare minimum that learning from observed data is better than both random guessing and always predicting exactly the same class label. Importantly, we formally and experimentally demonstrate the robustness of a model averaging and simple voting-based framework for data streams, particularly when incoming data \"continuously follows significantly different\" distributions. On a real streaming data, this framework reduces the expected error of baseline models by 60%, and remains the most accurate compared to those baseline models.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"176","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2007.96","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 176

Abstract

Recent years have witnessed an increasing number of studies in stream mining, which aim at building an accurate model for continuously arriving data. Somehow most existing work makes the implicit assumption that the training data and the yet-to-come testing data are always sampled from the "same distribution", and yet this "same distribution" evolves over time. We demonstrate that this may not be true, and one actually may never know either "how" or "when" the distribution changes. Thus, a model that fits well on the observed distribution can have unsatisfactory accuracy on the incoming data. Practically, one can just assume the bare minimum that learning from observed data is better than both random guessing and always predicting exactly the same class label. Importantly, we formally and experimentally demonstrate the robustness of a model averaging and simple voting-based framework for data streams, particularly when incoming data "continuously follows significantly different" distributions. On a real streaming data, this framework reduces the expected error of baseline models by 60%, and remains the most accurate compared to those baseline models.

查看原文本刊更多论文

挖掘数据流的适当假设:分析与实践

近年来，流挖掘的研究越来越多，其目的是为连续到达的数据建立准确的模型。在某种程度上，大多数现有的工作都做了一个隐含的假设，即训练数据和尚未到来的测试数据总是从“相同分布”中采样，然而这种“相同分布”随着时间的推移而演变。我们证明这可能不是真的，实际上人们可能永远不知道分布“如何”或“何时”发生变化。因此，一个很好地拟合观测分布的模型对输入数据的精度可能不令人满意。实际上，我们可以假设从观察到的数据中学习比随机猜测和总是预测完全相同的类标签要好。重要的是，我们正式和实验地证明了模型平均和简单的基于投票的数据流框架的鲁棒性，特别是当传入数据“连续遵循显著不同的”分布时。在真实的流数据中，该框架将基线模型的预期误差降低了60%，并且与基线模型相比仍然是最准确的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Seventh IEEE International Conference on Data Mining (ICDM 2007)

自引率

0.00%

发文量