On Appropriate Assumptions to Mine Data Streams: Analysis and Practice

Jing Gao, W. Fan, Jiawei Han
{"title":"On Appropriate Assumptions to Mine Data Streams: Analysis and Practice","authors":"Jing Gao, W. Fan, Jiawei Han","doi":"10.1109/ICDM.2007.96","DOIUrl":null,"url":null,"abstract":"Recent years have witnessed an increasing number of studies in stream mining, which aim at building an accurate model for continuously arriving data. Somehow most existing work makes the implicit assumption that the training data and the yet-to-come testing data are always sampled from the \"same distribution\", and yet this \"same distribution\" evolves over time. We demonstrate that this may not be true, and one actually may never know either \"how\" or \"when\" the distribution changes. Thus, a model that fits well on the observed distribution can have unsatisfactory accuracy on the incoming data. Practically, one can just assume the bare minimum that learning from observed data is better than both random guessing and always predicting exactly the same class label. Importantly, we formally and experimentally demonstrate the robustness of a model averaging and simple voting-based framework for data streams, particularly when incoming data \"continuously follows significantly different\" distributions. On a real streaming data, this framework reduces the expected error of baseline models by 60%, and remains the most accurate compared to those baseline models.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"176","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2007.96","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 176

Abstract

Recent years have witnessed an increasing number of studies in stream mining, which aim at building an accurate model for continuously arriving data. Somehow most existing work makes the implicit assumption that the training data and the yet-to-come testing data are always sampled from the "same distribution", and yet this "same distribution" evolves over time. We demonstrate that this may not be true, and one actually may never know either "how" or "when" the distribution changes. Thus, a model that fits well on the observed distribution can have unsatisfactory accuracy on the incoming data. Practically, one can just assume the bare minimum that learning from observed data is better than both random guessing and always predicting exactly the same class label. Importantly, we formally and experimentally demonstrate the robustness of a model averaging and simple voting-based framework for data streams, particularly when incoming data "continuously follows significantly different" distributions. On a real streaming data, this framework reduces the expected error of baseline models by 60%, and remains the most accurate compared to those baseline models.
挖掘数据流的适当假设:分析与实践
近年来,流挖掘的研究越来越多,其目的是为连续到达的数据建立准确的模型。在某种程度上,大多数现有的工作都做了一个隐含的假设,即训练数据和尚未到来的测试数据总是从“相同分布”中采样,然而这种“相同分布”随着时间的推移而演变。我们证明这可能不是真的,实际上人们可能永远不知道分布“如何”或“何时”发生变化。因此,一个很好地拟合观测分布的模型对输入数据的精度可能不令人满意。实际上,我们可以假设从观察到的数据中学习比随机猜测和总是预测完全相同的类标签要好。重要的是,我们正式和实验地证明了模型平均和简单的基于投票的数据流框架的鲁棒性,特别是当传入数据“连续遵循显著不同的”分布时。在真实的流数据中,该框架将基线模型的预期误差降低了60%,并且与基线模型相比仍然是最准确的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信